Search | arXiv e-print repository

LLaMA-MoE: Building Mixture-of-Experts from LLaMA with Continual Pre-training

Authors: Tong Zhu, Xiaoye Qu, Daize Dong, Jiacheng Ruan, **gqi Tong, Conghui He, Yu Cheng

Abstract: Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B mod… ▽ More Mixture-of-Experts (MoE) has gained increasing popularity as a promising framework for scaling up large language models (LLMs). However, training MoE from scratch in a large-scale setting still suffers from data-hungry and instability problems. Motivated by this limit, we investigate building MoE models from existing dense large language models. Specifically, based on the well-known LLaMA-2 7B model, we obtain an MoE model by: (1) Expert Construction, which partitions the parameters of original Feed-Forward Networks (FFNs) into multiple experts; (2) Continual Pre-training, which further trains the transformed MoE model and additional gate networks. In this paper, we comprehensively explore different methods for expert construction and various data sampling strategies for continual pre-training. After these stages, our LLaMA-MoE models could maintain language abilities and route the input tokens to specific experts with part of the parameters activated. Empirically, by training 200B tokens, LLaMA-MoE-3.5B models significantly outperform dense models that contain similar activation parameters. The source codes and models are available at https://github.com/pjlab-sys4nlp/llama-moe . △ Less

Submitted 24 June, 2024; originally announced June 2024.

arXiv:2406.11256 [pdf, other]

Dynamic Data Mixing Maximizes Instruction Tuning for Mixture-of-Experts

Authors: Tong Zhu, Daize Dong, Xiaoye Qu, Jiacheng Ruan, Wenliang Chen, Yu Cheng

Abstract: Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data c… ▽ More Mixture-of-Experts (MoE) models have shown remarkable capability in instruction tuning, especially when the number of tasks scales. However, previous methods simply merge all training tasks (e.g. creative writing, coding, and mathematics) and apply fixed sampling weights, without considering the importance of different tasks as the model training state changes. In this way, the most helpful data cannot be effectively distinguished, leading to suboptimal model performance. To reduce the potential redundancies of datasets, we make the first attempt and propose a novel dynamic data mixture for MoE instruction tuning. Specifically, inspired by MoE's token routing preference, we build dataset-level representations and then capture the subtle differences among datasets. Finally, we propose to dynamically adjust the sampling weight of datasets by their inter-redundancies, thus maximizing global performance under a limited training budget. The experimental results on two MoE models demonstrate the effectiveness of our approach on both downstream knowledge \& reasoning tasks and open-ended queries. Code and models are available at https://github.com/Spico197/MoE-SFT . △ Less

Submitted 17 June, 2024; originally announced June 2024.

arXiv:2406.02500 [pdf, other]

Demystifying the Compression of Mixture-of-Experts Through a Unified Framework

Authors: Shwai He, Daize Dong, Liang Ding, Ang Li

Abstract: Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE intro… ▽ More Scaling large language models has revolutionized the performance across diverse domains, yet the continual growth in model size poses significant challenges for real-world deployment. The Mixture of Experts (MoE) approach addresses this by dynamically selecting and activating only a subset of experts, significantly reducing computational costs while maintaining high performance. However, MoE introduces potential redundancy (e.g., parameters) and extra costs (e.g., communication overhead). Despite numerous compression techniques developed for mitigating the redundancy in dense models, the compression of MoE remains under-explored. We first bridge this gap with a cutting-edge unified framework that not only seamlessly integrates mainstream compression methods but also helps systematically understand MoE compression. This framework approaches compression from two perspectives: Expert Slimming which compresses individual experts and Expert Trimming which removes structured modules. Within this framework, we explore the optimization space unexplored by existing methods,and further introduce aggressive Expert Trimming techniques, i.e., Layer Drop and Block Drop, to eliminate redundancy at larger scales. Based on these insights,we present a comprehensive recipe to guide practitioners in compressing MoE effectively. Extensive experimental results demonstrate the effectiveness of the compression methods under our framework and the proposed recipe, achieving a 6.05x speedup and only 20.0GB memory usage while maintaining over 92% of performance on Mixtral-8x7B. Code is released at \url{https://github.com/DaizeDong/Unified-MoE-Compression}. △ Less

Submitted 24 June, 2024; v1 submitted 4 June, 2024; originally announced June 2024.

Comments: 20 pages, 15 figures, 5 tables

arXiv:2405.17870 [pdf, other]

Full-Stack Allreduce on Multi-Rail Networks

Authors: Enda Yu, Dezun Dong, Xiangke Liao

Abstract: The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale netw… ▽ More The high communication costs impede scalability in distributed systems. Multimodal models like Sora exacerbate this issue by requiring more resources than current networks can support. However, existing network architectures fail to address this gap. In this paper, we provide full-stack support for allreduce on multi-rail networks, aiming to overcome the scalability limitations of large-scale networks by facilitating collaborative data transfer across various networks. To achieve this, we propose the Nezha system, which integrates TCP, in-network computing protocol SHARP, and RDMA-based protocol GLEX. To maximize data transfer rates, Nezha incorporates a load balancing data allocation scheme based on cost feedback and combines exception handling to achieve reliable data transmission. Our experiments on a six-node cluster demonstrate that Nezha significantly enhances allreduce performance by 58\% to 87\% in homogeneous dual-rail configurations and offers considerable acceleration in heterogeneous settings, contingent on the performance variance among networks. △ Less

Submitted 28 May, 2024; originally announced May 2024.

Comments: Submitted to SC'2024

arXiv:2405.06948 [pdf, other]

Training-free Subject-Enhanced Attention Guidance for Compositional Text-to-image Generation

Authors: Shengyuan Liu, Bo Wang, Ye Ma, Te Yang, Xipeng Cao, Quan Chen, Han Li, Di Dong, Peng Jiang

Abstract: Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these… ▽ More Existing subject-driven text-to-image generation models suffer from tedious fine-tuning steps and struggle to maintain both text-image alignment and subject fidelity. For generating compositional subjects, it often encounters problems such as object missing and attribute mixing, where some subjects in the input prompt are not generated or their attributes are incorrectly combined. To address these limitations, we propose a subject-driven generation framework and introduce training-free guidance to intervene in the generative process during inference time. This approach strengthens the attention map, allowing for precise attribute binding and feature injection for each subject. Notably, our method exhibits exceptional zero-shot generation ability, especially in the challenging task of compositional generation. Furthermore, we propose a novel metric GroundingScore to evaluate subject alignment thoroughly. The obtained quantitative results serve as compelling evidence showcasing the effectiveness of our proposed method. The code will be released soon. △ Less

Submitted 11 May, 2024; originally announced May 2024.

Comments: 26 pages, 13 figures

arXiv:2404.13391 [pdf, other]

Online Planning of Power Flows for Power Systems Against Bushfires Using Spatial Context

Authors: Jianyu Xu, Qiuzhuang Sun, Yang Yang, Huadong Mo, Daoyi Dong

Abstract: The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the s… ▽ More The 2019-20 Australia bushfire incurred numerous economic losses and significantly affected the operations of power systems. A power station or transmission line can be significantly affected due to bushfires, leading to an increase in operational costs. We study a fundamental but challenging problem of planning the optimal power flow (OPF) for power systems subject to bushfires. Considering the stochastic nature of bushfire spread, we develop a model to capture such dynamics based on Moore's neighborhood model. Under a periodic inspection scheme that reveals the in-situ bushfire status, we propose an online optimization modeling framework that sequentially plans the power flows in the electricity network. Our framework assumes that the spread of bushfires is non-stationary over time, and the spread and containment probabilities are unknown. To meet these challenges, we develop a contextual online learning algorithm that treats the in-situ geographical information of the bushfire as a 'spatial context'. The online learning algorithm learns the unknown probabilities sequentially based on the observed data and then makes the OPF decision accordingly. The sequential OPF decisions aim to minimize the regret function, which is defined as the cumulative loss against the clairvoyant strategy that knows the true model parameters. We provide a theoretical guarantee of our algorithm by deriving a bound on the regret function, which outperforms the regret bound achieved by other benchmark algorithms. Our model assumptions are verified by the real bushfire data from NSW, Australia, and we apply our model to two power systems to illustrate its applicability. △ Less

Submitted 20 April, 2024; originally announced April 2024.

arXiv:2403.15750 [pdf, other]

iDAT: inverse Distillation Adapter-Tuning

Authors: Jiacheng Ruan, **gsheng Gao, Mingye Xie, Daize Dong, Suncheng Xiang, Ting Liu, Yuzhuo Fu

Abstract: Adapter-Tuning (AT) method involves freezing a pre-trained model and introducing trainable adapter modules to acquire downstream knowledge, thereby calibrating the model for better adaptation to downstream tasks. This paper proposes a distillation framework for the AT method instead of crafting a carefully designed adapter module, which aims to improve fine-tuning performance. For the first time,… ▽ More Adapter-Tuning (AT) method involves freezing a pre-trained model and introducing trainable adapter modules to acquire downstream knowledge, thereby calibrating the model for better adaptation to downstream tasks. This paper proposes a distillation framework for the AT method instead of crafting a carefully designed adapter module, which aims to improve fine-tuning performance. For the first time, we explore the possibility of combining the AT method with knowledge distillation. Via statistical analysis, we observe significant differences in the knowledge acquisition between adapter modules of different models. Leveraging these differences, we propose a simple yet effective framework called inverse Distillation Adapter-Tuning (iDAT). Specifically, we designate the smaller model as the teacher and the larger model as the student. The two are jointly trained, and online knowledge distillation is applied to inject knowledge of different perspective to student model, and significantly enhance the fine-tuning performance on downstream tasks. Extensive experiments on the VTAB-1K benchmark with 19 image classification tasks demonstrate the effectiveness of iDAT. The results show that using existing AT method within our iDAT framework can further yield a 2.66% performance gain, with only an additional 0.07M trainable parameters. Our approach compares favorably with state-of-the-arts without bells and whistles. Our code is available at https://github.com/JCruan519/iDAT. △ Less

Submitted 23 March, 2024; originally announced March 2024.

Comments: 10 pages, 9 figures, 13 tables. This paper has been accepted by ICME 2024

arXiv:2403.09195 [pdf, other]

SAM-Lightening: A Lightweight Segment Anything Model with Dilated Flash Attention to Achieve 30 times Acceleration

Authors: Yanfei Song, Bangzheng Pu, Peng Wang, Hongxu Jiang, Dong Dong, Yongxiang Cao, Yiqing Shen

Abstract: Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not ade… ▽ More Segment Anything Model (SAM) has garnered significant attention in segmentation tasks due to their zero-shot generalization ability. However, a broader application of SAMs to real-world practice has been restricted by their low inference speed and high computational memory demands, which mainly stem from the attention mechanism. Existing work concentrated on optimizing the encoder, yet has not adequately addressed the inefficiency of the attention mechanism itself, even when distilled to a smaller model, which thus leaves space for further improvement. In response, we introduce SAM-Lightening, a variant of SAM, that features a re-engineered attention mechanism, termed Dilated Flash Attention. It not only facilitates higher parallelism, enhancing processing efficiency but also retains compatibility with the existing FlashAttention. Correspondingly, we propose a progressive distillation to enable an efficient knowledge transfer from the vanilla SAM without costly training from scratch. Experiments on COCO and LVIS reveal that SAM-Lightening significantly outperforms the state-of-the-art methods in both run-time efficiency and segmentation accuracy. Specifically, it can achieve an inference speed of 7 milliseconds (ms) per image, for images of size 1024*1024 pixels, which is 30.1 times faster than the vanilla SAM and 2.1 times than the state-of-the-art. Moreover, it takes only 244MB memory, which is 3.5\% of the vanilla SAM. The code and weights are available at https://anonymous.4open.science/r/SAM-LIGHTENING-BC25/. △ Less

Submitted 17 March, 2024; v1 submitted 14 March, 2024; originally announced March 2024.

arXiv:2402.02464 [pdf, other]

A Graph is Worth $K$ Words: Euclideanizing Graph using Pure Transformer

Authors: Zhangyang Gao, Daize Dong, Cheng Tan, Jun Xia, Bozhen Hu, Stan Z. Li

Abstract: Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce Gr… ▽ More Can we model Non-Euclidean graphs as pure language or even Euclidean vectors while retaining their inherent information? The Non-Euclidean property have posed a long term challenge in graph modeling. Despite recent graph neural networks and graph transformers efforts encoding graphs as Euclidean vectors, recovering the original graph from vectors remains a challenge. In this paper, we introduce GraphsGPT, featuring an Graph2Seq encoder that transforms Non-Euclidean graphs into learnable Graph Words in the Euclidean space, along with a GraphGPT decoder that reconstructs the original graph from Graph Words to ensure information equivalence. We pretrain GraphsGPT on $100$M molecules and yield some interesting findings: (1) The pretrained Graph2Seq excels in graph representation learning, achieving state-of-the-art results on $8/9$ graph classification and regression tasks. (2) The pretrained GraphGPT serves as a strong graph generator, demonstrated by its strong ability to perform both few-shot and conditional graph generation. (3) Graph2Seq+GraphGPT enables effective graph mixup in the Euclidean space, overcoming previously known Non-Euclidean challenges. (4) The edge-centric pretraining framework GraphsGPT demonstrates its efficacy in graph domain tasks, excelling in both representation and generation. Code is available at \href{https://github.com/A4Bio/GraphsGPT}{GitHub}. △ Less

Submitted 29 May, 2024; v1 submitted 4 February, 2024; originally announced February 2024.

arXiv:2401.11724 [pdf, other]

Augmenting Prototype Network with TransMix for Few-shot Hyperspectral Image Classification

Authors: Chun Liu, Longwei Yang, Dongmei Dong, Zheng Li, Wei Yang, Zhigang Han, Jiayao Wang

Abstract: Few-shot hyperspectral image classification aims to identify the classes of each pixel in the images by only marking few of these pixels. And in order to obtain the spatial-spectral joint features of each pixel, the fixed-size patches centering around each pixel are often used for classification. However, observing the classification results of existing methods, we found that boundary patches corr… ▽ More Few-shot hyperspectral image classification aims to identify the classes of each pixel in the images by only marking few of these pixels. And in order to obtain the spatial-spectral joint features of each pixel, the fixed-size patches centering around each pixel are often used for classification. However, observing the classification results of existing methods, we found that boundary patches corresponding to the pixels which are located at the boundary of the objects in the hyperspectral images, are hard to classify. These boundary patchs are mixed with multi-class spectral information. Inspired by this, we propose to augment the prototype network with TransMix for few-shot hyperspectrial image classification(APNT). While taking the prototype network as the backbone, it adopts the transformer as feature extractor to learn the pixel-to-pixel relation and pay different attentions to different pixels. At the same time, instead of directly using the patches which are cut from the hyperspectral images for training, it randomly mixs up two patches to imitate the boundary patches and uses the synthetic patches to train the model, with the aim to enlarge the number of hard training samples and enhance their diversity. And by following the data agumentation technique TransMix, the attention returned by the transformer is also used to mix up the labels of two patches to generate better labels for synthetic patches. Compared with existing methods, the proposed method has demonstrated sate of the art performance and better robustness for few-shot hyperspectral image classification in our experiments. △ Less

Submitted 22 January, 2024; originally announced January 2024.

arXiv:2401.02708 [pdf, other]

TripleSurv: Triplet Time-adaptive Coordinate Loss for Survival Analysis

Authors: Liwen Zhang, Lianzhen Zhong, Fan Yang, Di Dong, Hui Hui, Jie Tian

Abstract: A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not… ▽ More A core challenge in survival analysis is to model the distribution of censored time-to-event data, where the event of interest may be a death, failure, or occurrence of a specific event. Previous studies have showed that ranking and maximum likelihood estimation (MLE)loss functions are widely-used for survival analysis. However, ranking loss only focus on the ranking of survival time and does not consider potential effect of samples for exact survival time values. Furthermore, the MLE is unbounded and easily subject to outliers (e.g., censored data), which may cause poor performance of modeling. To handle the complexities of learning process and exploit valuable survival time values, we propose a time-adaptive coordinate loss function, TripleSurv, to achieve adaptive adjustments by introducing the differences in the survival time between sample pairs into the ranking, which can encourage the model to quantitatively rank relative risk of pairs, ultimately enhancing the accuracy of predictions. Most importantly, the TripleSurv is proficient in quantifying the relative risk between samples by ranking ordering of pairs, and consider the time interval as a trade-off to calibrate the robustness of model over sample distribution. Our TripleSurv is evaluated on three real-world survival datasets and a public synthetic dataset. The results show that our method outperforms the state-of-the-art methods and exhibits good model performance and robustness on modeling various sophisticated data distributions with different censor rates. Our code will be available upon acceptance. △ Less

Submitted 5 January, 2024; originally announced January 2024.

Comments: 9 pages,6 figures

arXiv:2401.01571 [pdf, other]

CodeFuse-Query: A Data-Centric Static Code Analysis System for Large-Scale Organizations

Authors: Xiaoheng Xie, Gang Fan, Xiaojun Lin, Ang Zhou, Shijie Li, Xun** Zheng, Yinan Liang, Yu Zhang, Na Yu, Haokun Li, Xinyu Chen, Yingzhuang Chen, Yi Zhen, Dejun Dong, Xian** Fu, **zhou Su, Fuxiong Pan, Pengshuai Luo, Youzheng Feng, Ruoxiang Hu, **g Fan, **guo Zhou, Xiao Xiao, Peng Di

Abstract: In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design. CodeFuse-Query reimagines code analysis as a data compu… ▽ More In the domain of large-scale software development, the demands for dynamic and multifaceted static code analysis exceed the capabilities of traditional tools. To bridge this gap, we present CodeFuse-Query, a system that redefines static code analysis through the fusion of Domain Optimized System Design and Logic Oriented Computation Design. CodeFuse-Query reimagines code analysis as a data computation task, support scanning over 10 billion lines of code daily and more than 300 different tasks. It optimizes resource utilization, prioritizes data reusability, applies incremental code extraction, and introduces tasks types specially for Code Change, underscoring its domain-optimized design. The system's logic-oriented facet employs Datalog, utilizing a unique two-tiered schema, COREF, to convert source code into data facts. Through Godel, a distinctive language, CodeFuse-Query enables formulation of complex tasks as logical expressions, harnessing Datalog's declarative prowess. This paper provides empirical evidence of CodeFuse-Query's transformative approach, demonstrating its robustness, scalability, and efficiency. We also highlight its real-world impact and diverse applications, emphasizing its potential to reshape the landscape of static code analysis in the context of large-scale software development.Furthermore, in the spirit of collaboration and advancing the field, our project is open-sourced and the repository is available for public access △ Less

Submitted 3 January, 2024; originally announced January 2024.

arXiv:2311.07766 [pdf, other]

Vision-Language Integration in Multimodal Video Transformers (Partially) Aligns with the Brain

Authors: Dota Tianai Dong, Mariya Toneva

Abstract: Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities stil… ▽ More Integrating information from multiple modalities is arguably one of the essential prerequisites for grounding artificial intelligence systems with an understanding of the real world. Recent advances in video transformers that jointly learn from vision, text, and sound over time have made some progress toward this goal, but the degree to which these models integrate information from modalities still remains unclear. In this work, we present a promising approach for probing a pre-trained multimodal video transformer model by leveraging neuroscientific evidence of multimodal information processing in the brain. Using brain recordings of participants watching a popular TV show, we analyze the effects of multi-modal connections and interactions in a pre-trained multi-modal video transformer on the alignment with uni- and multi-modal brain regions. We find evidence that vision enhances masked prediction performance during language processing, providing support that cross-modal representations in models can benefit individual modalities. However, we don't find evidence of brain-relevant information captured by the joint multi-modal transformer representations beyond that captured by all of the individual modalities. We finally show that the brain alignment of the pre-trained joint representation can be improved by fine-tuning using a task that requires vision-language inferences. Overall, our results paint an optimistic picture of the ability of multi-modal transformers to integrate vision and language in partially brain-relevant ways but also show that improving the brain alignment of these models may require new approaches. △ Less

Submitted 13 November, 2023; originally announced November 2023.

arXiv:2310.15204 [pdf]

Mid-Long Term Daily Electricity Consumption Forecasting Based on Piecewise Linear Regression and Dilated Causal CNN

Authors: Zhou Lan, Ben Liu, Yi Feng, Danhuang Dong, Peng Zhang

Abstract: Daily electricity consumption forecasting is a classical problem. Existing forecasting algorithms tend to have decreased accuracy on special dates like holidays. This study decomposes the daily electricity consumption series into three components: trend, seasonal, and residual, and constructs a two-stage prediction method using piecewise linear regression as a filter and Dilated Causal CNN as a pr… ▽ More Daily electricity consumption forecasting is a classical problem. Existing forecasting algorithms tend to have decreased accuracy on special dates like holidays. This study decomposes the daily electricity consumption series into three components: trend, seasonal, and residual, and constructs a two-stage prediction method using piecewise linear regression as a filter and Dilated Causal CNN as a predictor. The specific steps involve setting breakpoints on the time axis and fitting the piecewise linear regression model with one-hot encoded information such as month, weekday, and holidays. For the challenging prediction of the Spring Festival, distance is introduced as a variable using a third-degree polynomial form in the model. The residual sequence obtained in the previous step is modeled using Dilated Causal CNN, and the final prediction of daily electricity consumption is the sum of the two-stage predictions. Experimental results demonstrate that this method achieves higher accuracy compared to existing approaches. △ Less

Submitted 23 October, 2023; originally announced October 2023.

Comments: Key words: Daily electricity consumption forecasting; time series decomposition; piecewise linear regression; Dilated Causal CNN

arXiv:2310.00518 [pdf, other]

Learning Informative Latent Representation for Quantum State Tomography

Authors: Hailan Ma, Zhenhong Sun, Daoyi Dong, Dong Gong

Abstract: Quantum state tomography (QST) is the process of reconstructing the complete state of a quantum system (mathematically described as a density matrix) through a series of different measurements. These measurements are performed on a number of identical copies of the quantum system, with outcomes gathered as frequencies. QST aims to recover the density matrix and the corresponding properties of the… ▽ More Quantum state tomography (QST) is the process of reconstructing the complete state of a quantum system (mathematically described as a density matrix) through a series of different measurements. These measurements are performed on a number of identical copies of the quantum system, with outcomes gathered as frequencies. QST aims to recover the density matrix and the corresponding properties of the quantum state from the measured frequencies. Although an informationally complete set of measurements can specify quantum state accurately in an ideal scenario with a large number of identical copies, both measurements and identical copies are restricted and imperfect in practical scenarios, making QST highly ill-posed. The conventional QST methods usually assume adequate or accurate measured frequencies or rely on manually designed regularizers to handle the ill-posed reconstruction problem, suffering from limited applications in realistic scenarios. Recent advances in deep neural networks (DNNs) led to the emergence of deep learning (DL) in QST. However, existing DL-based QST approaches often employ generic DNN models that are not optimized for imperfect conditions of QST. In this paper, we propose a transformer-based autoencoder architecture tailored for QST with imperfect measurement data. Our method leverages a transformer-based encoder to extract an informative latent representation (ILR) from imperfect measurement data and employs a decoder to predict the quantum states based on the ILR. We anticipate that the high-dimensional ILR will capture more comprehensive information about quantum states. To achieve this, we conduct pre-training of the encoder using a pretext task that involves reconstructing high-quality frequencies from measured frequencies. Extensive simulations and experiments demonstrate the remarkable ability of the ILR in dealing with imperfect measurement data in QST. △ Less

Submitted 30 September, 2023; originally announced October 2023.

arXiv:2308.07622 [pdf, other]

EMID: An Emotional Aligned Dataset in Audio-Visual Modality

Authors: Jialing Zou, Jiahao Mei, Guangze Ye, Tianyu Huai, Qiwei Shen, Daoguo Dong

Abstract: In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consisten… ▽ More In this paper, we propose Emotionally paired Music and Image Dataset (EMID), a novel dataset designed for the emotional matching of music and images, to facilitate auditory-visual cross-modal tasks such as generation and retrieval. Unlike existing approaches that primarily focus on semantic correlations or roughly divided emotional relations, EMID emphasizes the significance of emotional consistency between music and images using an advanced 13-dimension emotional model. By incorporating emotional alignment into the dataset, it aims to establish pairs that closely align with human perceptual understanding, thereby raising the performance of auditory-visual cross-modal tasks. We also design a supplemental module named EMI-Adapter to optimize existing cross-modal alignment methods. To validate the effectiveness of the EMID, we conduct a psychological experiment, which has demonstrated that considering the emotional relationship between the two modalities effectively improves the accuracy of matching in abstract perspective. This research lays the foundation for future cross-modal research in domains such as psychotherapy and contributes to advancing the understanding and utilization of emotions in cross-modal alignment. The EMID dataset is available at https://github.com/ecnu-aigc/EMID. △ Less

Submitted 15 August, 2023; originally announced August 2023.

arXiv:2306.03387 [pdf, other]

ColdNAS: Search to Modulate for User Cold-Start Recommendation

Authors: Shiguang Wu, Yaqing Wang, Qinghe **g, Daxiang Dong, De**g Dou, Quanming Yao

Abstract: Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map user interaction histories to user-specific parameters, which are then used to modulate predictor by feature-wise linear modulation function. These works obtain the state-of-the-art performance. H… ▽ More Making personalized recommendation for cold-start users, who only have a few interaction histories, is a challenging problem in recommendation systems. Recent works leverage hypernetworks to directly map user interaction histories to user-specific parameters, which are then used to modulate predictor by feature-wise linear modulation function. These works obtain the state-of-the-art performance. However, the physical meaning of scaling and shifting in recommendation data is unclear. Instead of using a fixed modulation function and deciding modulation position by expertise, we propose a modulation framework called ColdNAS for user cold-start problem, where we look for proper modulation structure, including function and position, via neural architecture search. We design a search space which covers broad models and theoretically prove that this search space can be transformed to a much smaller space, enabling an efficient and robust one-shot search algorithm. Extensive experimental results on benchmark datasets show that ColdNAS consistently performs the best. We observe that different modulation functions lead to the best performance on different datasets, which validates the necessity of designing a searching-based method. △ Less

Submitted 6 June, 2023; originally announced June 2023.

arXiv:2305.13850 [pdf, other]

Global Structure Knowledge-Guided Relation Extraction Method for Visually-Rich Document

Authors: Xiangnan Chen, Qian Xiao, Juncheng Li, Duo Dong, Jun Lin, Xiaozhong Liu, Siliang Tang

Abstract: Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn… ▽ More Visual Relation Extraction (VRE) is a powerful means of discovering relationships between entities within visually-rich documents. Existing methods often focus on manipulating entity features to find pairwise relations, yet neglect the more fundamental structural information that links disparate entity pairs together. The absence of global structure information may make the model struggle to learn long-range relations and easily predict conflicted results. To alleviate such limitations, we propose a GlObal Structure knowledge-guided relation Extraction (GOSE) framework. GOSE initiates by generating preliminary relation predictions on entity pairs extracted from a scanned image of the document. Subsequently, global structural knowledge is captured from the preceding iterative predictions, which are then incorporated into the representations of the entities. This "generate-capture-incorporate" cycle is repeated multiple times, allowing entity representations and global structure knowledge to be mutually reinforced. Extensive experiments validate that GOSE not only outperforms existing methods in the standard fine-tuning setting but also reveals superior cross-lingual learning capabilities; indeed, even yields stronger data-efficient performance in the low-resource setting. The code for GOSE will be available at https://github.com/chenxn2020/GOSE. △ Less

Submitted 27 October, 2023; v1 submitted 23 May, 2023; originally announced May 2023.

Comments: Accepted by EMNLP 2023 (Findings)

arXiv:2305.11643 [pdf, other]

Time Optimal Ergodic Search

Authors: Dayi Dong, Henry Berger, Ian Abraham

Abstract: Robots with the ability to balance time against the thoroughness of search have the potential to provide time-critical assistance in applications such as search and rescue. Current advances in ergodic coverage-based search methods have enabled robots to completely explore and search an area in a fixed amount of time. However, optimizing time against the quality of autonomous ergodic search has yet… ▽ More Robots with the ability to balance time against the thoroughness of search have the potential to provide time-critical assistance in applications such as search and rescue. Current advances in ergodic coverage-based search methods have enabled robots to completely explore and search an area in a fixed amount of time. However, optimizing time against the quality of autonomous ergodic search has yet to be demonstrated. In this paper, we investigate solutions to the time-optimal ergodic search problem for fast and adaptive robotic search and exploration. We pose the problem as a minimum time problem with an ergodic inequality constraint whose upper bound regulates and balances the granularity of search against time. Solutions to the problem are presented analytically using Pontryagin's conditions of optimality and demonstrated numerically through a direct transcription optimization approach. We show the efficacy of the approach in generating time-optimal ergodic search trajectories in simulation and with drone experiments in a cluttered environment. Obstacle avoidance is shown to be readily integrated into our formulation, and we perform ablation studies that investigate parameter dependence on optimized time and trajectory sensitivity for search. △ Less

Submitted 19 May, 2023; originally announced May 2023.

Comments: 13 pages, 8 figures, Robotics: Science and Systems

arXiv:2305.05433 [pdf, other]

Tomography of Quantum States from Structured Measurements via quantum-aware transformer

Authors: Hailan Ma, Zhenhong Sun, Daoyi Dong, Chunlin Chen, Herschel Rabitz

Abstract: Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a q… ▽ More Quantum state tomography (QST) is the process of reconstructing the state of a quantum system (mathematically described as a density matrix) through a series of different measurements, which can be solved by learning a parameterized function to translate experimentally measured statistics into physical density matrices. However, the specific structure of quantum measurements for characterizing a quantum state has been neglected in previous work. In this paper, we explore the similarity between highly structured sentences in natural language and intrinsically structured measurements in QST. To fully leverage the intrinsic quantum characteristics involved in QST, we design a quantum-aware transformer (QAT) model to capture the complex relationship between measured frequencies and density matrices. In particular, we query quantum operators in the architecture to facilitate informative representations of quantum data and integrate the Bures distance into the loss function to evaluate quantum state fidelity, thereby enabling the reconstruction of quantum states from measured data with high fidelity. Extensive simulations and experiments (on IBM quantum computers) demonstrate the superiority of the QAT in reconstructing quantum states with favorable robustness against experimental noise. △ Less

Submitted 17 November, 2023; v1 submitted 9 May, 2023; originally announced May 2023.

arXiv:2304.11384 [pdf, other]

Large Language Models are Few-Shot Summarizers: Multi-Intent Comment Generation via In-Context Learning

Authors: Mingyang Geng, Shangwen Wang, Dezun Dong, Haotian Wang, Ge Li, Zhi **, Xiaoguang Mao, Xiangke Liao

Abstract: Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionali… ▽ More Code comment generation aims at generating natural language descriptions for a code snippet to facilitate developers' program comprehension activities. Despite being studied for a long time, a bottleneck for existing approaches is that given a code snippet, they can only generate one comment while developers usually need to know information from diverse perspectives such as what is the functionality of this code snippet and how to use it. To tackle this limitation, this study empirically investigates the feasibility of utilizing large language models (LLMs) to generate comments that can fulfill developers' diverse intents. Our intuition is based on the facts that (1) the code and its pairwise comment are used during the pre-training process of LLMs to build the semantic connection between the natural language and programming language, and (2) comments in the real-world projects, which are collected for the pre-training, usually contain different developers' intents. We thus postulate that the LLMs can already understand the code from different perspectives after the pre-training. Indeed, experiments on two large-scale datasets demonstrate the rationale of our insights: by adopting the in-context learning paradigm and giving adequate prompts to the LLM (e.g., providing it with ten or more examples), the LLM can significantly outperform a state-of-the-art supervised learning approach on generating comments with multiple intents. Results also show that customized strategies for constructing the prompts and post-processing strategies for reranking the results can both boost the LLM's performances, which shed light on future research directions for using LLMs to achieve comment generation. △ Less

Submitted 14 June, 2023; v1 submitted 22 April, 2023; originally announced April 2023.

Comments: Accepted by the 46th International Conference on Software Engineering (ICSE 2024)

arXiv:2302.14312 [pdf, other]

Auxiliary Task-based Deep Reinforcement Learning for Quantum Control

Authors: Shumin Zhou, Hailan Ma, Sen Kuang, Daoyi Dong

Abstract: Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement… ▽ More Due to its property of not requiring prior knowledge of the environment, reinforcement learning has significant potential for quantum control problems. In this work, we investigate the effectiveness of continuous control policies based on deep deterministic policy gradient. To solve the sparse reward signal in quantum learning control problems, we propose an auxiliary task-based deep reinforcement learning (AT-DRL) for quantum control. In particular, we first design a guided reward function based on the fidelity of quantum states that enables incremental fidelity improvement. Then, we introduce the concept of an auxiliary task whose network shares parameters with the main network to predict the reward provided by the environment (called the main task). The auxiliary task learns synchronously with the main task, allowing one to select the most relevant features of the environment, thus aiding the agent in comprehending how to achieve the desired state. The numerical simulations demonstrate that the proposed AT-DRL can provide a solution to the sparse reward in quantum systems, and has great potential in designing control pulses that achieve efficient quantum state preparation. △ Less

Submitted 28 February, 2023; originally announced February 2023.

Comments: 13 pages, 11 figures

arXiv:2302.00282 [pdf, other]

Xenos: Dataflow-Centric Optimization to Accelerate Model Inference on Edge Devices

Authors: Zhang Runhua, Jiang Hongxu, Tian Fangzheng, Geng **kun, Li Xiaobin, Ma Yuhang, Zhu Chenhui, Dong Dong, Li Xin, Wang Haojie

Abstract: Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Bes… ▽ More Edge computing has been emerging as a popular scenario for model inference. However, the inference performance on edge devices (e.g., Multi-Core DSP, FGPA, etc.) suffers from inefficiency due to the lack of highly optimized inference frameworks. Previous model inference frameworks are mainly developed in an operator-centric way, which provides insufficient acceleration to edge-based inference. Besides, the operator-centric framework incurs significant costs for continuous development and maintenance. In this paper, we propose Xenos, which can automatically conduct dataflow-centric optimization of the computation graph and accelerate inference in two dimensions. Vertically, Xenos develops operator linking technique to improve data locality by restructuring the inter-operator dataflow. Horizontally, Xenos develops DSP-aware operator split technique to enable higher parallelism across multiple DSP units. Our evaluation proves the effectiveness of vertical and horizontal dataflow optimization, which reduce the inference time by 21.2\%--84.9\% and 17.9\%--96.2\% , respectively. Besides, Xenos also outperforms the widely-used TVM by 3.22$\times$--17.92$\times$. Moreover, we extend Xenos to a distributed solution, which we call d-Xenos. d-Xenos employs multiple edge devices to jointly conduct the inference task and achieves a speedup of 3.68x--3.78x compared with the single device. △ Less

Submitted 1 February, 2023; originally announced February 2023.

Comments: The preliminary version is accepted by the 28th International Conference on Database Systems for Advanced Applications (DASFAA-2023)

arXiv:2211.05528 [pdf, other]

PAD-Net: An Efficient Framework for Dynamic Networks

Authors: Shwai He, Liang Ding, Daize Dong, Boan Liu, Fuqiang Yu, Dacheng Tao

Abstract: Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer… ▽ More Dynamic networks, e.g., Dynamic Convolution (DY-Conv) and the Mixture of Experts (MoE), have been extensively explored as they can considerably improve the model's representation power with acceptable computational cost. The common practice in implementing dynamic networks is to convert the given static layers into fully dynamic ones where all parameters are dynamic (at least within a single layer) and vary with the input. However, such a fully dynamic setting may cause redundant parameters and high deployment costs, limiting the applicability of dynamic networks to a broader range of tasks and models. The main contributions of our work are challenging the basic commonsense in dynamic networks and proposing a partially dynamic network, namely PAD-Net, to transform the redundant dynamic parameters into static ones. Also, we further design Iterative Mode Partition to partition dynamic and static parameters efficiently. Our method is comprehensively supported by large-scale experiments with two typical advanced dynamic architectures, i.e., DY-Conv and MoE, on both image classification and GLUE benchmarks. Encouragingly, we surpass the fully dynamic networks by $+0.7\%$ top-1 acc with only $30\%$ dynamic parameters for ResNet-50 and $+1.9\%$ average score in language understanding with only $50\%$ dynamic parameters for BERT. Code will be released at: \url{https://github.com/Shwai-He/PAD-Net}. △ Less

Submitted 31 May, 2023; v1 submitted 10 November, 2022; originally announced November 2022.

Comments: Proceedings of ACL 2023

arXiv:2211.04310 [pdf, other]

Safety-Critical Ergodic Exploration in Cluttered Environments via Control Barrier Functions

Authors: Cameron Lerch, Dayi Dong, Ian Abraham

Abstract: In this paper, we address the problem of safe trajectory planning for autonomous search and exploration in constrained, cluttered environments. Guaranteeing safe (collision-free) trajectories is a challenging problem that has garnered significant due to its importance in the successful utilization of robots in search and exploration tasks. This work contributes a method that generates guaranteed s… ▽ More In this paper, we address the problem of safe trajectory planning for autonomous search and exploration in constrained, cluttered environments. Guaranteeing safe (collision-free) trajectories is a challenging problem that has garnered significant due to its importance in the successful utilization of robots in search and exploration tasks. This work contributes a method that generates guaranteed safety-critical search trajectories in a cluttered environment. Our approach integrates safety-critical constraints using discrete control barrier functions (DCBFs) with ergodic trajectory optimization to enable safe exploration. Ergodic trajectory optimization plans continuous exploratory trajectories that guarantee complete coverage of a space. We demonstrate through simulated and experimental results on a drone that our approach is able to generate trajectories that enable safe and effective exploration. Furthermore, we show the efficacy of our approach for safe exploration using real-world single- and multi- drone platforms. △ Less

Submitted 29 April, 2023; v1 submitted 8 November, 2022; originally announced November 2022.

arXiv:2210.16640 [pdf]

doi 10.1109/JBHI.2020.3002805

2D and 3D CT Radiomic Features Performance Comparison in Characterization of Gastric Cancer: A Multi-center Study

Authors: Lingwei Meng, Di Dong, Xin Chen, Mengjie Fang, Rongpin Wang, **g Li, Zaiyi Liu, Jie Tian

Abstract: Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks. Meth… ▽ More Objective: Radiomics, an emerging tool for medical image analysis, is potential towards precisely characterizing gastric cancer (GC). Whether using one-slice 2D annotation or whole-volume 3D annotation remains a long-time debate, especially for heterogeneous GC. We comprehensively compared 2D and 3D radiomic features' representation and discrimination capacity regarding GC, via three tasks. Methods: Four-center 539 GC patients were retrospectively enrolled and divided into the training and validation cohorts. From 2D or 3D regions of interest (ROIs) annotated by radiologists, radiomic features were extracted respectively. Feature selection and model construction procedures were customed for each combination of two modalities (2D or 3D) and three tasks. Subsequently, six machine learning models (Model_2D^LNM, Model_3D^LNM; Model_2D^LVI, Model_3D^LVI; Model_2D^pT, Model_3D^pT) were derived and evaluated to reflect modalities' performances in characterizing GC. Furthermore, we performed an auxiliary experiment to assess modalities' performances when resampling spacing is different. Results: Regarding three tasks, the yielded areas under the curve (AUCs) were: Model_2D^LNM's 0.712 (95% confidence interval, 0.613-0.811), Model_3D^LNM's 0.680 (0.584-0.775); Model_2D^LVI's 0.677 (0.595-0.761), Model_3D^LVI's 0.615 (0.528-0.703); Model_2D^pT's 0.840 (0.779-0.901), Model_3D^pT's 0.813 (0.747-0.879). Moreover, the auxiliary experiment indicated that Models_2D are statistically more advantageous than Models3D with different resampling spacings. Conclusion: Models constructed with 2D radiomic features revealed comparable performances with those constructed with 3D features in characterizing GC. Significance: Our work indicated that time-saving 2D annotation would be the better choice in GC, and provided a related reference to further radiomics-based researches. △ Less

Submitted 29 October, 2022; originally announced October 2022.

Comments: Published in IEEE Journal of Biomedical and Health Informatics

Journal ref: IEEE.J.Biomed.Health.Inf. 25 (2021) 755-763

arXiv:2210.04284 [pdf, other]

SparseAdapter: An Easy Approach for Improving the Parameter-Efficiency of Adapters

Authors: Shwai He, Liang Ding, Daize Dong, Miao Zhang, Dacheng Tao

Abstract: Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intentio… ▽ More Adapter Tuning, which freezes the pretrained language models (PLMs) and only fine-tunes a few extra modules, becomes an appealing efficient alternative to the full model fine-tuning. Although computationally efficient, the recent Adapters often increase parameters (e.g. bottleneck dimension) for matching the performance of full model fine-tuning, which we argue goes against their original intention. In this work, we re-examine the parameter-efficiency of Adapters through the lens of network pruning (we name such plug-in concept as \texttt{SparseAdapter}) and find that SparseAdapter can achieve comparable or better performance than standard Adapters when the sparse ratio reaches up to 80\%. Based on our findings, we introduce an easy but effective setting ``\textit{Large-Sparse}'' to improve the model capacity of Adapters under the same parameter budget. Experiments on five competitive Adapters upon three advanced PLMs show that with proper sparse method (e.g. SNIP) and ratio (e.g. 40\%) SparseAdapter can consistently outperform their corresponding counterpart. Encouragingly, with the \textit{Large-Sparse} setting, we can obtain further appealing gains, even outperforming the full fine-tuning by a large margin. Our code will be released at: https://github.com/Shwai-He/SparseAdapter. △ Less

Submitted 10 November, 2022; v1 submitted 9 October, 2022; originally announced October 2022.

Comments: Findings of EMNLP 2022

arXiv:2209.04894 [pdf, ps, other]

Nearly all $k$-SAT functions are unate

Authors: József Balogh, Dingding Dong, Bernard Lidický, Nitya Mani, Yufei Zhao

Abstract: We prove that $1-o(1)$ fraction of all $k$-SAT functions on $n$ Boolean variables are unate (i.e., monotone after first negating some variables), for any fixed positive integer $k$ and as $n \to \infty$. This resolves a conjecture by Bollobás, Brightwell, and Leader from 2003. We prove that $1-o(1)$ fraction of all $k$-SAT functions on $n$ Boolean variables are unate (i.e., monotone after first negating some variables), for any fixed positive integer $k$ and as $n \to \infty$. This resolves a conjecture by Bollobás, Brightwell, and Leader from 2003. △ Less

Submitted 3 October, 2023; v1 submitted 11 September, 2022; originally announced September 2022.

Comments: 43 pages. v2 merges arXiv:2107.09233 (SODA22) and arXiv:2209.04894v1 (STOC23) along with expository improvements. This combined version is intended for journal submission

MSC Class: 05A16; 05C65 ACM Class: G.2.1; G.2.2

arXiv:2207.06667 [pdf, other]

Large-scale Knowledge Distillation with Elastic Heterogeneous Computing Resources

Authors: Ji Liu, Daxiang Dong, Xi Wang, An Qin, Xingjian Li, Patrick Valduriez, De**g Dou, Dianhai Yu

Abstract: Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an… ▽ More Although more layers and more parameters generally improve the accuracy of the models, such big models generally have high computational complexity and require big memory, which exceed the capacity of small devices for inference and incurs long training time. In addition, it is difficult to afford long training time and inference time of big models even in high performance servers, as well. As an efficient approach to compress a large deep model (a teacher model) to a compact model (a student model), knowledge distillation emerges as a promising approach to deal with the big models. Existing knowledge distillation methods cannot exploit the elastic available computing resources and correspond to low efficiency. In this paper, we propose an Elastic Deep Learning framework for knowledge Distillation, i.e., EDL-Dist. The advantages of EDL-Dist are three-fold. First, the inference and the training process is separated. Second, elastic available computing resources can be utilized to improve the efficiency. Third, fault-tolerance of the training and inference processes is supported. We take extensive experimentation to show that the throughput of EDL-Dist is up to 3.125 times faster than the baseline method (online knowledge distillation) while the accuracy is similar or higher. △ Less

Submitted 14 July, 2022; originally announced July 2022.

Comments: To appear in Concurrency and Computation: Practice and Experience, 16 pages, 7 figures, 5 tables

arXiv:2206.02480 [pdf, ps, other]

doi 10.1109/TIT.2024.3386821

Subspace Phase Retrieval

Authors: Mengchu Xu, Dekuan Dong, Jian Wang

Abstract: In recent years, phase retrieval has received much attention in statistics, applied mathematics and optical engineering. In this paper, we propose an efficient algorithm, termed Subspace Phase Retrieval (SPR), which can accurately recover an $n$-dimensional $k$-sparse complex-valued signal $\x$ given its $Ω(k^2\log n)$ magnitude-only Gaussian samples if the minimum nonzero entry of $\x$ satisfies… ▽ More In recent years, phase retrieval has received much attention in statistics, applied mathematics and optical engineering. In this paper, we propose an efficient algorithm, termed Subspace Phase Retrieval (SPR), which can accurately recover an $n$-dimensional $k$-sparse complex-valued signal $\x$ given its $Ω(k^2\log n)$ magnitude-only Gaussian samples if the minimum nonzero entry of $\x$ satisfies $|x_{\min}| = Ω(\|\x\|/\sqrt{k})$. Furthermore, if the energy sum of the most significant $\sqrt{k}$ elements in $\x$ is comparable to $\|\x\|^2$, the SPR algorithm can exactly recover $\x$ with $Ω(k \log n)$ magnitude-only samples, which attains the information-theoretic sampling complexity for sparse phase retrieval. Numerical Experiments demonstrate that the proposed algorithm achieves the state-of-the-art reconstruction performance compared to existing ones. △ Less

Submitted 7 April, 2024; v1 submitted 6 June, 2022; originally announced June 2022.

Comments: To appear in IEEE Transactions on Information Theory, 2024, 33 pages, 10 figures

arXiv:2205.12363 [pdf, ps, other]

On the number of error correcting codes

Authors: Dingding Dong, Nitya Mani, Yufei Zhao

Abstract: We show that for a fixed $q$, the number of $q$-ary $t$-error correcting codes of length $n$ is at most $2^{(1 + o(1)) H_q(n,t)}$ for all $t \leq (1 - q^{-1})n - C_q\sqrt{n \log n}$ (for sufficiently large constant $C_q$), where $H_q(n, t) = q^n / V_q(n,t)$ is the Hamming bound and $V_q(n,t)$ is the cardinality of the radius $t$ Hamming ball. This proves a conjecture of Balogh, Treglown, and Wagne… ▽ More We show that for a fixed $q$, the number of $q$-ary $t$-error correcting codes of length $n$ is at most $2^{(1 + o(1)) H_q(n,t)}$ for all $t \leq (1 - q^{-1})n - C_q\sqrt{n \log n}$ (for sufficiently large constant $C_q$), where $H_q(n, t) = q^n / V_q(n,t)$ is the Hamming bound and $V_q(n,t)$ is the cardinality of the radius $t$ Hamming ball. This proves a conjecture of Balogh, Treglown, and Wagner, who showed the result for $t = o(n^{1/3} (\log n)^{-2/3})$. △ Less

Submitted 24 May, 2022; originally announced May 2022.

Comments: 13 pages. Comments welcome!

MSC Class: 05C30 ACM Class: G.2.2

arXiv:2205.10787 [pdf, other]

doi 10.1109/TCYB.2022.3170485

A Dirichlet Process Mixture of Robust Task Models for Scalable Lifelong Reinforcement Learning

Authors: Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: While reinforcement learning (RL) algorithms are achieving state-of-the-art performance in various challenging tasks, they can easily encounter catastrophic forgetting or interference when faced with lifelong streaming information. In the paper, we propose a scalable lifelong RL method that dynamically expands the network capacity to accommodate new knowledge while preventing past memories from be… ▽ More While reinforcement learning (RL) algorithms are achieving state-of-the-art performance in various challenging tasks, they can easily encounter catastrophic forgetting or interference when faced with lifelong streaming information. In the paper, we propose a scalable lifelong RL method that dynamically expands the network capacity to accommodate new knowledge while preventing past memories from being perturbed. We use a Dirichlet process mixture to model the non-stationary task distribution, which captures task relatedness by estimating the likelihood of task-to-cluster assignments and clusters the task models in a latent space. We formulate the prior distribution of the mixture as a Chinese restaurant process (CRP) that instantiates new mixture components as needed. The update and expansion of the mixture are governed by the Bayesian non-parametric framework with an expectation maximization (EM) procedure, which dynamically adapts the model complexity without explicit task boundaries or heuristics. Moreover, we use the domain randomization technique to train robust prior parameters for the initialization of each task model in the mixture, thus the resulting model can better generalize and adapt to unseen tasks. With extensive experiments conducted on robot navigation and locomotion domains, we show that our method successfully facilitates scalable lifelong RL and outperforms relevant existing methods. △ Less

Submitted 22 May, 2022; originally announced May 2022.

Comments: Manuscript accepted by IEEE Transactions on Cybernetics, 2022, DOI: DOI: 10.1109/TCYB.2022.3170485

arXiv:2204.07729 [pdf, other]

doi 10.1109/TNNLS.2023.3281604

Efficient Bayesian Policy Reuse with a Scalable Observation Model in Deep Reinforcement Learning

Authors: **mei Liu, Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as… ▽ More Bayesian policy reuse (BPR) is a general policy transfer framework for selecting a source policy from an offline library by inferring the task belief based on some observation signals and a trained observation model. In this paper, we propose an improved BPR method to achieve more efficient policy transfer in deep reinforcement learning (DRL). First, most BPR algorithms use the episodic return as the observation signal that contains limited information and cannot be obtained until the end of an episode. Instead, we employ the state transition sample, which is informative and instantaneous, as the observation signal for faster and more accurate task inference. Second, BPR algorithms usually require numerous samples to estimate the probability distribution of the tabular-based observation model, which may be expensive and even infeasible to learn and maintain, especially when using the state transition sample as the signal. Hence, we propose a scalable observation model based on fitting state transition functions of source tasks from only a small number of samples, which can generalize to any signals observed in the target task. Moreover, we extend the offline-mode BPR to the continual learning setting by expanding the scalable observation model in a plug-and-play fashion, which can avoid negative transfer when faced with new unknown tasks. Experimental results show that our method can consistently facilitate faster and more efficient policy transfer. △ Less

Submitted 13 July, 2023; v1 submitted 16 April, 2022; originally announced April 2022.

Comments: Published in IEEE Transactions on Neural Networks and Learning Systems, 2023, DOI: 10.1109/TNNLS.2023.3281604

arXiv:2204.02227 [pdf, other]

SD-Conv: Towards the Parameter-Efficiency of Dynamic Convolution

Authors: Shwai He, Chenbo Jiang, Daize Dong, Liang Ding

Abstract: Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase. However, the performance increase can not match the significantly expanded number of parameters, which is the main bottleneck in real-world applications. Contrastively, mask-based unstructured pruning obtains a lightweight network by removing redundancy in the heavy network. In this paper,… ▽ More Dynamic convolution achieves better performance for efficient CNNs at the cost of negligible FLOPs increase. However, the performance increase can not match the significantly expanded number of parameters, which is the main bottleneck in real-world applications. Contrastively, mask-based unstructured pruning obtains a lightweight network by removing redundancy in the heavy network. In this paper, we propose a new framework, \textbf{Sparse Dynamic Convolution} (\textsc{SD-Conv}), to naturally integrate these two paths such that it can inherit the advantage of dynamic mechanism and sparsity. We first design a binary mask derived from a learnable threshold to prune static kernels, significantly reducing the parameters and computational cost but achieving higher performance in Imagenet-1K. We further transfer pretrained models into a variety of downstream tasks, showing consistently better results than baselines. We hope our SD-Conv could be an efficient alternative to conventional dynamic convolutions. △ Less

Submitted 26 May, 2023; v1 submitted 5 April, 2022; originally announced April 2022.

Comments: WACV 2023

arXiv:2203.02896 [pdf, other]

doi 10.1109/TNNLS.2022.3230701

Depthwise Convolution for Multi-Agent Communication with Enhanced Mean-Field Approximation

Authors: Donghan Xie, Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protoc… ▽ More Multi-agent settings remain a fundamental challenge in the reinforcement learning (RL) domain due to the partial observability and the lack of accurate real-time interactions across agents. In this paper, we propose a new method based on local communication learning to tackle the multi-agent RL (MARL) challenge within a large number of agents coexisting. First, we design a new communication protocol that exploits the ability of depthwise convolution to efficiently extract local relations and learn local communication between neighboring agents. To facilitate multi-agent coordination, we explicitly learn the effect of joint actions by taking the policies of neighboring agents as inputs. Second, we introduce the mean-field approximation into our method to reduce the scale of agent interactions. To more effectively coordinate behaviors of neighboring agents, we enhance the mean-field approximation by a supervised policy rectification network (PRN) for rectifying real-time agent interactions and by a learnable compensation term for correcting the approximation bias. The proposed method enables efficient coordination as well as outperforms several baseline approaches on the adaptive traffic signal control (ATSC) task and the StarCraft II multi-agent challenge (SMAC). △ Less

Submitted 1 January, 2023; v1 submitted 6 March, 2022; originally announced March 2022.

Comments: Accepted by IEEE Transactions on Neural Networks, 2022, DOI: 10.1109/TNNLS.2022.3230701

arXiv:2112.15270 [pdf]

Echo state graph neural networks with analogue random resistor arrays

Authors: Shaocong Wang, Yi Li, Dingchen Wang, Woyu Zhang, Xi Chen, Danian Dong, Songqi Wang, Xumeng Zhang, Peng Lin, Claudio Gallicchio, Xiaoxin Xu, Qi Liu, Kwang-Ting Cheng, Zhongrui Wang, Dashan Shang, Ming Liu

Abstract: Recent years have witnessed an unprecedented surge of interest, from social networks to drug discovery, in learning representations of graph-structured data. However, graph neural networks, the machine learning models for handling graph-structured data, face significant challenges when running on conventional digital hardware, including von Neumann bottleneck incurred by physically separated memor… ▽ More Recent years have witnessed an unprecedented surge of interest, from social networks to drug discovery, in learning representations of graph-structured data. However, graph neural networks, the machine learning models for handling graph-structured data, face significant challenges when running on conventional digital hardware, including von Neumann bottleneck incurred by physically separated memory and processing units, slowdown of Moore's law due to transistor scaling limit, and expensive training cost. Here we present a novel hardware-software co-design, the random resistor array-based echo state graph neural network, which addresses these challenges. The random resistor arrays not only harness low-cost, nanoscale and stackable resistors for highly efficient in-memory computing using simple physical laws, but also leverage the intrinsic stochasticity of dielectric breakdown to implement random projections in hardware for an echo state network that effectively minimizes the training cost thanks to its fixed and random weights. The system demonstrates state-of-the-art performance on both graph classification using the MUTAG and COLLAB datasets and node classification using the CORA dataset, achieving 34.2x, 93.2x, and 570.4x improvement of energy efficiency and 98.27%, 99.46%, and 95.12% reduction of training cost compared to conventional graph learning on digital hardware, respectively, which may pave the way for the next generation AI system for graph learning. △ Less

Submitted 30 December, 2021; originally announced December 2021.

Comments: 24 pages, 4 figures

arXiv:2110.07770 [pdf]

Human factors engineering research on single pilot operations for large commercial aircraft: Status and prospect

Authors: Wei Xu, Yong Chen, Wenjun Dong, Dayong Dong, Liezhong Ge

Abstract: The civil aviation community is actively exploring and develo** the solutions of single pilot operations SPO for large commercial aircraft. Human factors engineering research for SPO has been launched, and the research mainly focuses on three research solutions: flight deck airborne equipment upgrade, flight support from ground stations, and the combined SPO solution of "flight deck airborne equ… ▽ More The civil aviation community is actively exploring and develo** the solutions of single pilot operations SPO for large commercial aircraft. Human factors engineering research for SPO has been launched, and the research mainly focuses on three research solutions: flight deck airborne equipment upgrade, flight support from ground stations, and the combined SPO solution of "flight deck airborne equipment upgrade, flight support from ground stations". This paper reviews and analyzez the progress of human factors engineering research on SPO. The preliminary research outcome tends to support the combined SPO solution. However, the current human factors engineering research is not comprehensive and cannot provide a complete human factors engineering solution for SPO. For future human factors engineering research, this paper analyzes the key human factors issues on SPO and points out the gaps in the current research and the areas for future work. Finally, this paper puts forward an overall strategy and recommendations for future human factors engineering research on SPO. △ Less

Submitted 14 October, 2021; originally announced October 2021.

Comments: in Chinese language

arXiv:2108.08659 [pdf, other]

doi 10.1109/TAI.2022.3194132

Residual Tensor Train: A Quantum-inspired Approach for Learning Multiple Multilinear Correlations

Authors: Yiwei Chen, Yu Pan, Daoyi Dong

Abstract: States of quantum many-body systems are defined in a high-dimensional Hilbert space, where rich and complex interactions among subsystems can be modelled. In machine learning, complex multiple multilinear correlations may also exist within input features. In this paper, we present a quantum-inspired multilinear model, named Residual Tensor Train (ResTT), to capture the multiple multilinear correla… ▽ More States of quantum many-body systems are defined in a high-dimensional Hilbert space, where rich and complex interactions among subsystems can be modelled. In machine learning, complex multiple multilinear correlations may also exist within input features. In this paper, we present a quantum-inspired multilinear model, named Residual Tensor Train (ResTT), to capture the multiple multilinear correlations of features, from low to high orders, within a single model. ResTT is able to build a robust decision boundary in a high-dimensional space for solving fitting and classification tasks. In particular, we prove that the fully-connected layer and the Volterra series can be taken as special cases of ResTT. Furthermore, we derive the rule for weight initialization that stabilizes the training of ResTT based on a mean-field analysis. We prove that such a rule is much more relaxed than that of TT, which means ResTT can easily address the vanishing and exploding gradient problem that exists in the existing TT models. Numerical experiments demonstrate that ResTT outperforms the state-of-the-art tensor network and benchmark deep learning models on MNIST and Fashion-MNIST datasets. Moreover, ResTT achieves better performance than other statistical methods on two practical examples with limited data which are known to have complex feature interactions. △ Less

Submitted 1 August, 2022; v1 submitted 19 August, 2021; originally announced August 2021.

Comments: 12 pages, 6 figures

arXiv:2106.10796 [pdf, other]

CD-SGD: Distributed Stochastic Gradient Descent with Compression and Delay Compensation

Authors: Enda Yu, Dezun Dong, Yemao Xu, Shuo Ouyang, Xiangke Liao

Abstract: Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. F… ▽ More Communication overhead is the key challenge for distributed training. Gradient compression is a widely used approach to reduce communication traffic. When combining with parallel communication mechanism method like pipeline, gradient compression technique can greatly alleviate the impact of communication overhead. However, there exists two problems of gradient compression technique to be solved. Firstly, gradient compression brings in extra computation cost, which will delay the next training iteration. Secondly, gradient compression usually leads to the decrease of convergence accuracy. △ Less

Submitted 6 September, 2021; v1 submitted 20 June, 2021; originally announced June 2021.

Comments: 12 pages

arXiv:2106.01674 [pdf, other]

JIZHI: A Fast and Cost-Effective Model-As-A-Service System for Web-Scale Online Inference at Baidu

Authors: Hao Liu, Qian Gao, Jiang Li, Xiaochao Liao, Hao Xiong, Guangxing Chen, Wenlin Wang, Guobao Yang, Zhiwei Zha, Daxiang Dong, De**g Dou, Haoyi Xiong

Abstract: In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a… ▽ More In modern internet industries, deep learning based recommender systems have became an indispensable building block for a wide spectrum of applications, such as search engine, news feed, and short video clips. However, it remains challenging to carry the well-trained deep models for online real-time inference serving, with respect to the time-varying web-scale traffics from billions of users, in a cost-effective manner. In this work, we present JIZHI - a Model-as-a-Service system - that per second handles hundreds of millions of online inference requests to huge deep models with more than trillions of sparse parameters, for over twenty real-time recommendation services at Baidu, Inc. In JIZHI, the inference workflow of every recommendation request is transformed to a Staged Event-Driven Pipeline (SEDP), where each node in the pipeline refers to a staged computation or I/O intensive task processor. With traffics of real-time inference requests arrived, each modularized processor can be run in a fully asynchronized way and managed separately. Besides, JIZHI introduces heterogeneous and hierarchical storage to further accelerate the online inference process by reducing unnecessary computations and potential data access latency induced by ultra-sparse model parameters. Moreover, an intelligent resource manager has been deployed to maximize the throughput of JIZHI over the shared infrastructure by searching the optimal resource allocation plan from historical logs and fine-tuning the load shedding policies over intermediate system feedback. Extensive experiments have been done to demonstrate the advantages of JIZHI from the perspectives of end-to-end service latency, system-wide throughput, and resource consumption. JIZHI has helped Baidu saved more than ten million US dollars in hardware and utility costs while handling 200% more traffics without sacrificing inference efficiency. △ Less

Submitted 3 June, 2021; originally announced June 2021.

Comments: Accepted to SIGKDD 2021 applied data science track

arXiv:2104.07282 [pdf, other]

doi 10.1109/TMECH.2021.3072675

Rule-Based Reinforcement Learning for Efficient Robot Navigation with Space Reduction

Authors: Yuanyang Zhu, Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In cont… ▽ More For real-world deployments, it is critical to allow robots to navigate in complex environments autonomously. Traditional methods usually maintain an internal map of the environment, and then design several simple rules, in conjunction with a localization and planning approach, to navigate through the internal map. These approaches often involve a variety of assumptions and prior knowledge. In contrast, recent reinforcement learning (RL) methods can provide a model-free, self-learning mechanism as the robot interacts with an initially unknown environment, but are expensive to deploy in real-world scenarios due to inefficient exploration. In this paper, we focus on efficient navigation with the RL technique and combine the advantages of these two kinds of methods into a rule-based RL (RuRL) algorithm for reducing the sample complexity and cost of time. First, we use the rule of wall-following to generate a closed-loop trajectory. Second, we employ a reduction rule to shrink the trajectory, which in turn effectively reduces the redundant exploration space. Besides, we give the detailed theoretical guarantee that the optimal navigation path is still in the reduced space. Third, in the reduced space, we utilize the Pledge rule to guide the exploration strategy for accelerating the RL process at the early stage. Experiments conducted on real robot navigation problems in hex-grid environments demonstrate that RuRL can achieve improved navigation performance. △ Less

Submitted 15 April, 2021; originally announced April 2021.

Comments: Accepted by IEEE/ASME Transactions on Mechatronics, 2021, DOI: 10.1109/TMECH.2021.3072675

Journal ref: IEEE/ASME Transactions on Mechatronics, 2021

arXiv:2104.02774 [pdf, other]

Bayesian adversarial multi-node bandit for optimal smart grid protection against cyber attacks

Authors: Jianyu Xu, Bin Liu, Huadong Mo, Daoyi Dong

Abstract: The cybersecurity of smart grids has become one of key problems in develo** reliable modern power and energy systems. This paper introduces a non-stationary adversarial cost with a variation constraint for smart grids and enables us to investigate the problem of optimal smart grid protection against cyber attacks in a relatively practical scenario. In particular, a Bayesian multi-node bandit (MN… ▽ More The cybersecurity of smart grids has become one of key problems in develo** reliable modern power and energy systems. This paper introduces a non-stationary adversarial cost with a variation constraint for smart grids and enables us to investigate the problem of optimal smart grid protection against cyber attacks in a relatively practical scenario. In particular, a Bayesian multi-node bandit (MNB) model with adversarial costs is constructed and a new regret function is defined for this model. An algorithm called Thompson-Hedge algorithm is presented to solve the problem and the superior performance of the proposed algorithm is proven in terms of the convergence rate of the regret function. The applicability of the algorithm to real smart grid scenarios is verified and the performance of the algorithm is also demonstrated by numerical examples. △ Less

Submitted 20 February, 2021; originally announced April 2021.

Journal ref: Automatica, 2021

arXiv:2101.02034 [pdf, other]

Deep Reinforcement Learning with Quantum-inspired Experience Replay

Authors: Qing Wei, Hailan Ma, Chunlin Chen, Daoyi Dong

Abstract: In this paper, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to traditional experience replay mechanism in DRL, the proposed deep reinforcement learning with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed… ▽ More In this paper, a novel training paradigm inspired by quantum computation is proposed for deep reinforcement learning (DRL) with experience replay. In contrast to traditional experience replay mechanism in DRL, the proposed deep reinforcement learning with quantum-inspired experience replay (DRL-QER) adaptively chooses experiences from the replay buffer according to the complexity and the replayed times of each experience (also called transition), to achieve a balance between exploration and exploitation. In DRL-QER, transitions are first formulated in quantum representations, and then the preparation operation and the depreciation operation are performed on the transitions. In this progress, the preparation operation reflects the relationship between the temporal difference errors (TD-errors) and the importance of the experiences, while the depreciation operation is taken into account to ensure the diversity of the transitions. The experimental results on Atari 2600 games show that DRL-QER outperforms state-of-the-art algorithms such as DRL-PER and DCRL on most of these games with improved training efficiency, and is also applicable to such memory-based DRL approaches as double network and dueling network. △ Less

Submitted 6 January, 2021; originally announced January 2021.

arXiv:2012.15427 [pdf, other]

Curriculum-based Deep Reinforcement Learning for Quantum Control

Authors: Hailan Ma, Daoyi Dong, Steven X. Ding, Chunlin Chen

Abstract: Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold… ▽ More Deep reinforcement learning has been recognized as an efficient technique to design optimal strategies for different complex systems without prior knowledge of the control landscape. To achieve a fast and precise control for quantum systems, we propose a novel deep reinforcement learning approach by constructing a curriculum consisting of a set of intermediate tasks defined by a fidelity threshold. Tasks among a curriculum can be statically determined using empirical knowledge or adaptively generated with the learning process. By transferring knowledge between two successive tasks and sequencing tasks according to their difficulties, the proposed curriculum-based deep reinforcement learning (CDRL) method enables the agent to focus on easy tasks in the early stage, then move onto difficult tasks, and eventually approaches the final task. Numerical simulations on closed quantum systems and open quantum systems demonstrate that the proposed method exhibits improved control performance for quantum systems and also provides an efficient way to identify optimal strategies with fewer control pulses. △ Less

Submitted 2 January, 2021; v1 submitted 30 December, 2020; originally announced December 2020.

arXiv:2012.05396 [pdf, other]

SSD-SSD: Communication sparsification for distributed deep learning training

Authors: Yemao Xu, Dezun Dong, Yawei Zhao, Weixia Xu, Xiangke Liao

Abstract: Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD) to combine their merits, aiming at tackling the commu… ▽ More Intensive communication and synchronization cost for gradients and parameters is the well-known bottleneck of distributed deep learning training. Based on the observations that Synchronous SGD (SSGD) obtains good convergence accuracy while asynchronous SGD (ASGD) delivers a faster raw training speed, we propose Several Steps Delay SGD (SSD-SGD) to combine their merits, aiming at tackling the communication bottleneck via communication sparsification. SSD-SGD explores both global synchronous updates in the parameter servers and asynchronous local updates in the workers in each periodic iteration. The periodic and flexible synchronization makes SSD-SGD achieve good convergence accuracy and fast training speed. To the best of our knowledge, we strike the new balance between synchronization quality and communication sparsification, and improve the trade-off between accuracy and training speed. Specifically, the core components of SSD-SGD include proper warm-up stage, steps delay stage, and our novel algorithm of global gradient for local update (GLU). GLU is critical for local update operations to effectively compensate the delayed local weights. Furthermore, we implement SSD-SGD on MXNet framework and comprehensively evaluate its performance with CIFAR-10 and ImageNet datasets. Experimental results show that SSD-SGD can accelerate distributed training speed under different experimental configurations, by up to 110%, while achieving good convergence accuracy. △ Less

Submitted 9 April, 2021; v1 submitted 9 December, 2020; originally announced December 2020.

arXiv:2010.08191 [pdf, other]

RocketQA: An Optimized Training Approach to Dense Passage Retrieval for Open-Domain Question Answering

Authors: Yingqi Qu, Yuchen Ding, **g Liu, Kai Liu, Ruiyang Ren, Wayne Xin Zhao, Daxiang Dong, Hua Wu, Haifeng Wang

Abstract: In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and in… ▽ More In open-domain question answering, dense passage retrieval has become a new paradigm to retrieve relevant passages for finding answers. Typically, the dual-encoder architecture is adopted to learn dense representations of questions and passages for semantic matching. However, it is difficult to effectively train a dual-encoder due to the challenges including the discrepancy between training and inference, the existence of unlabeled positives and limited training data. To address these challenges, we propose an optimized training approach, called RocketQA, to improving dense passage retrieval. We make three major technical contributions in RocketQA, namely cross-batch negatives, denoised hard negatives and data augmentation. The experiment results show that RocketQA significantly outperforms previous state-of-the-art models on both MSMARCO and Natural Questions. We also conduct extensive experiments to examine the effectiveness of the three strategies in RocketQA. Besides, we demonstrate that the performance of end-to-end QA can be improved based on our RocketQA retriever. △ Less

Submitted 12 May, 2021; v1 submitted 16 October, 2020; originally announced October 2020.

Comments: NAACL 2021

arXiv:2010.04605 [pdf, other]

doi 10.1109/TNNLS.2022.3160173

Instance Weighted Incremental Evolution Strategies for Reinforcement Learning in Dynamic Environments

Authors: Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: Evolution strategies (ES), as a family of black-box optimization algorithms, recently emerge as a scalable alternative to reinforcement learning (RL) approaches such as Q-learning or policy gradient, and are much faster when many central processing units (CPUs) are available due to better parallelization. In this paper, we propose a systematic incremental learning method for ES in dynamic environm… ▽ More Evolution strategies (ES), as a family of black-box optimization algorithms, recently emerge as a scalable alternative to reinforcement learning (RL) approaches such as Q-learning or policy gradient, and are much faster when many central processing units (CPUs) are available due to better parallelization. In this paper, we propose a systematic incremental learning method for ES in dynamic environments. The goal is to adjust previously learned policy to a new one incrementally whenever the environment changes. We incorporate an instance weighting mechanism with ES to facilitate its learning adaptation, while retaining scalability of ES. During parameter updating, higher weights are assigned to instances that contain more new knowledge, thus encouraging the search distribution to move towards new promising areas of parameter space. We propose two easy-to-implement metrics to calculate the weights: instance novelty and instance quality. Instance novelty measures an instance's difference from the previous optimum in the original environment, while instance quality corresponds to how well an instance performs in the new environment. The resulting algorithm, Instance Weighted Incremental Evolution Strategies (IW-IES), is verified to achieve significantly improved performance on challenging RL tasks ranging from robot navigation to locomotion. This paper thus introduces a family of scalable ES algorithms for RL domains that enables rapid learning adaptation to dynamic environments. △ Less

Submitted 31 March, 2022; v1 submitted 9 October, 2020; originally announced October 2020.

Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems, 2022, DOI: 10.1109/TNNLS.2022.3160173

arXiv:2008.09943 [pdf, other]

doi 10.1109/TCYB.2021.3131252

Quantum Language Model with Entanglement Embedding for Question Answering

Authors: Yiwei Chen, Yu Pan, Daoyi Dong

Abstract: Quantum Language Models (QLMs) in which words are modelled as quantum superposition of sememes have demonstrated a high level of model transparency and good post-hoc interpretability. Nevertheless, in the current literature word sequences are basically modelled as a classical mixture of word states, which cannot fully exploit the potential of a quantum probabilistic description. A full quantum mod… ▽ More Quantum Language Models (QLMs) in which words are modelled as quantum superposition of sememes have demonstrated a high level of model transparency and good post-hoc interpretability. Nevertheless, in the current literature word sequences are basically modelled as a classical mixture of word states, which cannot fully exploit the potential of a quantum probabilistic description. A full quantum model is yet to be developed to explicitly capture the non-classical correlations within the word sequences. We propose a neural network model with a novel Entanglement Embedding (EE) module, whose function is to transform the word sequences into entangled pure states of many-body quantum systems. Strong quantum entanglement, which is the central concept of quantum information and an indication of parallelized correlations among the words, is observed within the word sequences. Numerical experiments show that the proposed QLM with EE (QLM-EE) achieves superior performance compared with the classical deep neural network models and other QLMs on Question Answering (QA) datasets. In addition, the post-hoc interpretability of the model can be improved by quantizing the degree of entanglement among the words. △ Less

Submitted 20 December, 2021; v1 submitted 22 August, 2020; originally announced August 2020.

arXiv:2007.14196 [pdf, other]

doi 10.1109/TNNLS.2021.3055499

Lifelong Incremental Reinforcement Learning with Online Bayesian Inference

Authors: Zhi Wang, Chunlin Chen, Daoyi Dong

Abstract: A central capability of a long-lived reinforcement learning (RL) agent is to incrementally adapt its behavior as its environment changes, and to incrementally build upon previous experiences to facilitate future learning in real-world scenarios. In this paper, we propose LifeLong Incremental Reinforcement Learning (LLIRL), a new incremental algorithm for efficient lifelong adaptation to dynamic en… ▽ More A central capability of a long-lived reinforcement learning (RL) agent is to incrementally adapt its behavior as its environment changes, and to incrementally build upon previous experiences to facilitate future learning in real-world scenarios. In this paper, we propose LifeLong Incremental Reinforcement Learning (LLIRL), a new incremental algorithm for efficient lifelong adaptation to dynamic environments. We develop and maintain a library that contains an infinite mixture of parameterized environment models, which is equivalent to clustering environment parameters in a latent space. The prior distribution over the mixture is formulated as a Chinese restaurant process (CRP), which incrementally instantiates new environment models without any external information to signal environmental changes in advance. During lifelong learning, we employ the expectation maximization (EM) algorithm with online Bayesian inference to update the mixture in a fully incremental manner. In EM, the E-step involves estimating the posterior expectation of environment-to-cluster assignments, while the M-step updates the environment parameters for future learning. This method allows for all environment models to be adapted as necessary, with new models instantiated for environmental changes and old models retrieved when previously seen environments are encountered again. Experiments demonstrate that LLIRL outperforms relevant existing methods, and enables effective incremental adaptation to various dynamic environments for lifelong learning. △ Less

Submitted 12 February, 2021; v1 submitted 28 July, 2020; originally announced July 2020.

Comments: Accepted by IEEE Transactions on Neural Networks and Learning Systems, 2021

arXiv:2006.08181 [pdf, ps, other]

Derivative-free global minimization for a class of multiple minima problems

Authors: Xiaopeng Luo, Xin Xu, Daoyi Dong

Abstract: We prove that the finite-difference based derivative-free descent (FD-DFD) methods have a capability to find the global minima for a class of multiple minima problems. Our main result shows that, for a class of multiple minima objectives that is extended from strongly convex functions with Lipschitz-continuous gradients, the iterates of FD-DFD converge to the global minimizer $x_*$ with the linear… ▽ More We prove that the finite-difference based derivative-free descent (FD-DFD) methods have a capability to find the global minima for a class of multiple minima problems. Our main result shows that, for a class of multiple minima objectives that is extended from strongly convex functions with Lipschitz-continuous gradients, the iterates of FD-DFD converge to the global minimizer $x_*$ with the linear convergence $\|x_{k+1}-x_*\|_2^2\leqslantρ^k \|x_1-x_*\|_2^2$ for a fixed $0<ρ<1$ and any initial iteration $x_1\in\mathbb{R}^d$ when the parameters are properly selected. Since the per-iteration cost, i.e., the number of function evaluations, is fixed and almost independent of the dimension $d$, the FD-DFD algorithm has a complexity bound $\mathcal{O}(\log\frac{1}ε)$ for finding a point $x$ such that the optimality gap $\|x-x_*\|_2^2$ is less than $ε>0$. Numerical experiments in various dimensions from $5$ to $500$ demonstrate the benefits of the FD-DFD method. △ Less

Submitted 25 June, 2020; v1 submitted 15 June, 2020; originally announced June 2020.

Comments: 14 pages, 3 figures

MSC Class: 65K05; 68Q25; 90C26; 90C56

Showing 1–50 of 67 results for author: Dong, D