Search | arXiv e-print repository

Optimization-based Structural Pruning for Large Language Models without Back-Propagation

Authors: Yuan Gao, Zu**g Liu, Weizhong Zhang, Bo Du, Gui-Song Xia

Abstract: Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria… ▽ More Compared to the moderate size of neural network models, structural weight pruning on the Large-Language Models (LLMs) imposes a novel challenge on the efficiency of the pruning algorithms, due to the heavy computation/memory demands of the LLMs. Recent efficient LLM pruning methods typically operate at the post-training phase without the expensive weight finetuning, however, their pruning criteria often rely on heuristically designed metrics, potentially leading to suboptimal performance. We instead propose a novel optimization-based structural pruning that learns the pruning masks in a probabilistic space directly by optimizing the loss of the pruned model. To preserve the efficiency, our method 1) works at post-training phase} and 2) eliminates the back-propagation through the LLM per se during the optimization (i.e., only requires the forward pass of the LLM). We achieve this by learning an underlying Bernoulli distribution to sample binary pruning masks, where we decouple the Bernoulli parameters from the LLM loss, thus facilitating an efficient optimization via a policy gradient estimator without back-propagation. As a result, our method is able to 1) operate at structural granularities of channels, heads, and layers, 2) support global and heterogeneous pruning (i.e., our method automatically determines different redundancy for different layers), and 3) optionally use a metric-based method as initialization (of our Bernoulli distributions). Extensive experiments on LLaMA, LLaMA-2, and Vicuna using the C4 and WikiText2 datasets demonstrate that our method operates for 2.7 hours with around 35GB memory for the 13B models on a single A100 GPU, and our pruned models outperform the state-of-the-arts w.r.t. perplexity. Codes will be released. △ Less

Submitted 15 June, 2024; originally announced June 2024.

Comments: 17 pages

arXiv:2404.01273 [pdf, other]

TWIN-GPT: Digital Twins for Clinical Trials via Large Language Model

Authors: Yue Wang, Tianfan Fu, Yinlong Xu, Zihan Ma, Hongxia Xu, Yingzhou Lu, Bang Du, Honghao Gao, Jian Wu

Abstract: Clinical trials are indispensable for medical research and the development of new treatments. However, clinical trials often involve thousands of participants and can span several years to complete, with a high probability of failure during the process. Recently, there has been a burgeoning interest in virtual clinical trials, which simulate real-world scenarios and hold the potential to significa… ▽ More Clinical trials are indispensable for medical research and the development of new treatments. However, clinical trials often involve thousands of participants and can span several years to complete, with a high probability of failure during the process. Recently, there has been a burgeoning interest in virtual clinical trials, which simulate real-world scenarios and hold the potential to significantly enhance patient safety, expedite development, reduce costs, and contribute to the broader scientific knowledge in healthcare. Existing research often focuses on leveraging electronic health records (EHRs) to support clinical trial outcome prediction. Yet, trained with limited clinical trial outcome data, existing approaches frequently struggle to perform accurate predictions. Some research has attempted to generate EHRs to augment model development but has fallen short in personalizing the generation for individual patient profiles. Recently, the emergence of large language models has illuminated new possibilities, as their embedded comprehensive clinical knowledge has proven beneficial in addressing medical issues. In this paper, we propose a large language model-based digital twin creation approach, called TWIN-GPT. TWIN-GPT can establish cross-dataset associations of medical information given limited data, generating unique personalized digital twins for different patients, thereby preserving individual patient characteristics. Comprehensive experiments show that using digital twins created by TWIN-GPT can boost the clinical trial outcome prediction, exceeding various previous prediction approaches. △ Less

Submitted 28 June, 2024; v1 submitted 1 April, 2024; originally announced April 2024.

arXiv:2110.02588 [pdf, ps, other]

Hypothesis Testing of One-Sample Mean Vector in Distributed Frameworks

Authors: Bin Du, Junlong Zhao

Abstract: Distributed frameworks are widely used to handle massive data, where sample size $n$ is very large, and data are often stored in $k$ different machines. For a random vector $X\in \mathbb{R}^p$ with expectation $μ$, testing the mean vector $H_0: μ=μ_0$ vs $H_1: μ\ne μ_0$ for a given vector $μ_0$ is a basic problem in statistics. The centralized test statistics require heavy communication costs, whi… ▽ More Distributed frameworks are widely used to handle massive data, where sample size $n$ is very large, and data are often stored in $k$ different machines. For a random vector $X\in \mathbb{R}^p$ with expectation $μ$, testing the mean vector $H_0: μ=μ_0$ vs $H_1: μ\ne μ_0$ for a given vector $μ_0$ is a basic problem in statistics. The centralized test statistics require heavy communication costs, which can be a burden when $p$ or $k$ is large. To reduce the communication cost, distributed test statistics are proposed in this paper for this problem based on the divide and conquer technique, a commonly used approach for distributed statistical inference. Specifically, we extend two commonly used centralized test statistics to the distributed ones, that apply to low and high dimensional cases, respectively. Comparing the power of centralized test statistics and the distributed ones, it is observed that there is a fundamental tradeoff between communication costs and the powers of the tests. This is quite different from the application of the divide and conquer technique in many other problems such as estimation, where the associated distributed statistics can be as good as the centralized ones. Numerical results confirm the theoretical findings. △ Less

Submitted 6 October, 2021; originally announced October 2021.

arXiv:2103.00719 [pdf, ps, other]

doi 10.1109/TPAMI.2021.3061463

LocalDrop: A Hybrid Regularization for Deep Neural Networks

Authors: Ziqing Lu, Chang Xu, Bo Du, Takashi Ishida, Lefei Zhang, Masashi Sugiyama

Abstract: In neural networks, develo** regularization algorithms to settle overfitting is one of the major study areas. We propose a new approach for the regularization of neural networks by the local Rademacher complexity called LocalDrop. A new regularization function for both fully-connected networks (FCNs) and convolutional neural networks (CNNs), including drop rates and weight matrices, has been dev… ▽ More In neural networks, develo** regularization algorithms to settle overfitting is one of the major study areas. We propose a new approach for the regularization of neural networks by the local Rademacher complexity called LocalDrop. A new regularization function for both fully-connected networks (FCNs) and convolutional neural networks (CNNs), including drop rates and weight matrices, has been developed based on the proposed upper bound of the local Rademacher complexity by the strict mathematical deduction. The analyses of dropout in FCNs and DropBlock in CNNs with keep rate matrices in different layers are also included in the complexity analyses. With the new regularization function, we establish a two-stage procedure to obtain the optimal keep rate matrix and weight matrix to realize the whole training model. Extensive experiments have been conducted to demonstrate the effectiveness of LocalDrop in different models by comparing it with several algorithms and the effects of different hyperparameters on the final performances. △ Less

Submitted 28 February, 2021; originally announced March 2021.

arXiv:2011.05885 [pdf, other]

Leveraged Matrix Completion with Noise

Authors: Xinjian Huang, Weiwei Liu, Bo Du, Dacheng Tao

Abstract: Completing low-rank matrices from subsampled measurements has received much attention in the past decade. Existing works indicate that $\mathcal{O}(nr\log^2(n))$ datums are required to theoretically secure the completion of an $n \times n$ noisy matrix of rank $r$ with high probability, under some quite restrictive assumptions: (1) the underlying matrix must be incoherent; (2) observations follow… ▽ More Completing low-rank matrices from subsampled measurements has received much attention in the past decade. Existing works indicate that $\mathcal{O}(nr\log^2(n))$ datums are required to theoretically secure the completion of an $n \times n$ noisy matrix of rank $r$ with high probability, under some quite restrictive assumptions: (1) the underlying matrix must be incoherent; (2) observations follow the uniform distribution. The restrictiveness is partially due to ignoring the roles of the leverage score and the oracle information of each element. In this paper, we employ the leverage scores to characterize the importance of each element and significantly relax assumptions to: (1) not any other structure assumptions are imposed on the underlying low-rank matrix; (2) elements being observed are appropriately dependent on their importance via the leverage score. Under these assumptions, instead of uniform sampling, we devise an ununiform/biased sampling procedure that can reveal the ``importance'' of each observed element. Our proofs are supported by a novel approach that phrases sufficient optimality conditions based on the Golfing Scheme, which would be of independent interest to the wider areas. Theoretical findings show that we can provably recover an unknown $n\times n$ matrix of rank $r$ from just about $\mathcal{O}(nr\log^2 (n))$ entries, even when the observed entries are corrupted with a small amount of noisy information. The empirical results align precisely with our theories. △ Less

Submitted 14 August, 2023; v1 submitted 11 November, 2020; originally announced November 2020.

Comments: This manuscript has been accepted for publication as a regular paper in the IEEE Transactions on Cybernetics

arXiv:1909.02902 [pdf, other]

Dynamic Spatial-Temporal Representation Learning for Traffic Flow Prediction

Authors: Lingbo Liu, Jiajie Zhen, Guanbin Li, Geng Zhan, Zhaocheng He, Bowen Du, Liang Lin

Abstract: As a crucial component in intelligent transportation systems, traffic flow prediction has recently attracted widespread research interest in the field of artificial intelligence (AI) with the increasing availability of massive traffic mobility data. Its key challenge lies in how to integrate diverse factors (such as temporal rules and spatial dependencies) to infer the evolution trend of traffic f… ▽ More As a crucial component in intelligent transportation systems, traffic flow prediction has recently attracted widespread research interest in the field of artificial intelligence (AI) with the increasing availability of massive traffic mobility data. Its key challenge lies in how to integrate diverse factors (such as temporal rules and spatial dependencies) to infer the evolution trend of traffic flow. To address this problem, we propose a unified neural network called Attentive Traffic Flow Machine (ATFM), which can effectively learn the spatial-temporal feature representations of traffic flow with an attention mechanism. In particular, our ATFM is composed of two progressive Convolutional Long Short-Term Memory (ConvLSTM \cite{xingjian2015convolutional}) units connected with a convolutional layer. Specifically, the first ConvLSTM unit takes normal traffic flow features as input and generates a hidden state at each time-step, which is further fed into the connected convolutional layer for spatial attention map inference. The second ConvLSTM unit aims at learning the dynamic spatial-temporal representations from the attentionally weighted traffic flow features. Further, we develop two deep learning frameworks based on ATFM to predict citywide short-term/long-term traffic flow by adaptively incorporating the sequential and periodic data as well as other external influences. Extensive experiments on two standard benchmarks well demonstrate the superiority of the proposed method for traffic flow prediction. Moreover, to verify the generalization of our method, we also apply the customized framework to forecast the passenger pickup/dropoff demands in traffic prediction and show its superior performance. Our code and data are available at {\color{blue}\url{https://github.com/liulingbo918/ATFM}}. △ Less

Submitted 12 June, 2020; v1 submitted 1 September, 2019; originally announced September 2019.

Comments: Accepted by IEEE Transactions on Intelligent Transportation Systems. arXiv admin note: text overlap with arXiv:1809.00101

arXiv:1908.09002 [pdf, other]

doi 10.1145/3308558.3313398

Autonomous Learning for Face Recognition in the Wild via Ambient Wireless Cues

Authors: Chris Xiaoxuan Lu, Xuan Kan, Bowen Du, Changhao Chen, Hongkai Wen, Andrew Markham, Niki Trigoni, John Stankovic

Abstract: Facial recognition is a key enabling component for emerging Internet of Things (IoT) services such as smart homes or responsive offices. Through the use of deep neural networks, facial recognition has achieved excellent performance. However, this is only possibly when trained with hundreds of images of each user in different viewing and lighting conditions. Clearly, this level of effort in enrolme… ▽ More Facial recognition is a key enabling component for emerging Internet of Things (IoT) services such as smart homes or responsive offices. Through the use of deep neural networks, facial recognition has achieved excellent performance. However, this is only possibly when trained with hundreds of images of each user in different viewing and lighting conditions. Clearly, this level of effort in enrolment and labelling is impossible for wide-spread deployment and adoption. Inspired by the fact that most people carry smart wireless devices with them, e.g. smartphones, we propose to use this wireless identifier as a supervisory label. This allows us to curate a dataset of facial images that are unique to a certain domain e.g. a set of people in a particular office. This custom corpus can then be used to finetune existing pre-trained models e.g. FaceNet. However, due to the vagaries of wireless propagation in buildings, the supervisory labels are noisy and weak.We propose a novel technique, AutoTune, which learns and refines the association between a face and wireless identifier over time, by increasing the inter-cluster separation and minimizing the intra-cluster distance. Through extensive experiments with multiple users on two sites, we demonstrate the ability of AutoTune to design an environment-specific, continually evolving facial recognition system with entirely no user effort. △ Less

Submitted 14 August, 2019; originally announced August 2019.

Comments: 11 pages, accepted in the Web Conference (WWW'2019)

arXiv:1905.06133 [pdf, other]

Multi-scale Dynamic Graph Convolutional Network for Hyperspectral Image Classification

Authors: Sheng Wan, Chen Gong, ** Zhong, Bo Du, Lefei Zhang, Jian Yang

Abstract: Convolutional Neural Network (CNN) has demonstrated impressive ability to represent hyperspectral images and to achieve promising results in hyperspectral image classification. However, traditional CNN models can only operate convolution on regular square image regions with fixed size and weights, so they cannot universally adapt to the distinct local regions with various object distributions and… ▽ More Convolutional Neural Network (CNN) has demonstrated impressive ability to represent hyperspectral images and to achieve promising results in hyperspectral image classification. However, traditional CNN models can only operate convolution on regular square image regions with fixed size and weights, so they cannot universally adapt to the distinct local regions with various object distributions and geometric appearances. Therefore, their classification performances are still to be improved, especially in class boundaries. To alleviate this shortcoming, we consider employing the recently proposed Graph Convolutional Network (GCN) for hyperspectral image classification, as it can conduct the convolution on arbitrarily structured non-Euclidean data and is applicable to the irregular image regions represented by graph topological information. Different from the commonly used GCN models which work on a fixed graph, we enable the graph to be dynamically updated along with the graph convolution process, so that these two steps can be benefited from each other to gradually produce the discriminative embedded features as well as a refined graph. Moreover, to comprehensively deploy the multi-scale information inherited by hyperspectral images, we establish multiple input graphs with different neighborhood scales to extensively exploit the diversified spectral-spatial correlations at multiple scales. Therefore, our method is termed 'Multi-scale Dynamic Graph Convolutional Network' (MDGCN). The experimental results on three typical benchmark datasets firmly demonstrate the superiority of the proposed MDGCN to other state-of-the-art methods in both qualitative and quantitative aspects. △ Less

Submitted 14 May, 2019; originally announced May 2019.

arXiv:1904.06685 [pdf, other]

Exploring Representativeness and Informativeness for Active Learning

Authors: Bo Du, Zengmao Wang, Lefei Zhang, Liangpei Zhang, Wei Liu, Jialie Shen, Dacheng Tao

Abstract: How can we find a general way to choose the most suitable samples for training a classifier? Even with very limited prior information? Active learning, which can be regarded as an iterative optimization procedure, plays a key role to construct a refined training set to improve the classification performance in a variety of applications, such as text analysis, image recognition, social network mode… ▽ More How can we find a general way to choose the most suitable samples for training a classifier? Even with very limited prior information? Active learning, which can be regarded as an iterative optimization procedure, plays a key role to construct a refined training set to improve the classification performance in a variety of applications, such as text analysis, image recognition, social network modeling, etc. Although combining representativeness and informativeness of samples has been proven promising for active sampling, state-of-the-art methods perform well under certain data structures. Then can we find a way to fuse the two active sampling criteria without any assumption on data? This paper proposes a general active learning framework that effectively fuses the two criteria. Inspired by a two-sample discrepancy problem, triple measures are elaborately designed to guarantee that the query samples not only possess the representativeness of the unlabeled data but also reveal the diversity of the labeled data. Any appropriate similarity measure can be employed to construct the triple measures. Meanwhile, an uncertain measure is leveraged to generate the informativeness criterion, which can be carried out in different ways. Rooted in this framework, a practical active learning algorithm is proposed, which exploits a radial basis function together with the estimated probabilities to construct the triple measures and a modified Best-versus-Second-Best strategy to construct the uncertain measure, respectively. Experimental results on benchmark datasets demonstrate that our algorithm consistently achieves superior performance over the state-of-the-art active learning algorithms. △ Less

Submitted 14 April, 2019; originally announced April 2019.

arXiv:1809.00101 [pdf, other]

Attentive Crowd Flow Machines

Authors: Lingbo Liu, Ruimao Zhang, Jiefeng Peng, Guanbin Li, Bowen Du, Liang Lin

Abstract: Traffic flow prediction is crucial for urban traffic management and public safety. Its key challenges lie in how to adaptively integrate the various factors that affect the flow changes. In this paper, we propose a unified neural network module to address this problem, called Attentive Crowd Flow Machine~(ACFM), which is able to infer the evolution of the crowd flow by learning dynamic representat… ▽ More Traffic flow prediction is crucial for urban traffic management and public safety. Its key challenges lie in how to adaptively integrate the various factors that affect the flow changes. In this paper, we propose a unified neural network module to address this problem, called Attentive Crowd Flow Machine~(ACFM), which is able to infer the evolution of the crowd flow by learning dynamic representations of temporally-varying data with an attention mechanism. Specifically, the ACFM is composed of two progressive ConvLSTM units connected with a convolutional layer for spatial weight prediction. The first LSTM takes the sequential flow density representation as input and generates a hidden state at each time-step for attention map inference, while the second LSTM aims at learning the effective spatial-temporal feature expression from attentionally weighted crowd flow features. Based on the ACFM, we further build a deep architecture with the application to citywide crowd flow prediction, which naturally incorporates the sequential and periodic data as well as other external influences. Extensive experiments on two standard benchmarks (i.e., crowd flow in Bei**g and New York City) show that the proposed method achieves significant improvements over the state-of-the-art methods. △ Less

Submitted 31 August, 2018; originally announced September 2018.

Comments: ACM MM, full paper

arXiv:1808.06206 [pdf, other]

TLR: Transfer Latent Representation for Unsupervised Domain Adaptation

Authors: Pan Xiao, Bo Du, Jia Wu, Lefei Zhang, Ruimin Hu, Xuelong Li

Abstract: Domain adaptation refers to the process of learning prediction models in a target domain by making use of data from a source domain. Many classic methods solve the domain adaptation problem by establishing a common latent space, which may cause the loss of many important properties across both domains. In this manuscript, we develop a novel method, transfer latent representation (TLR), to learn a… ▽ More Domain adaptation refers to the process of learning prediction models in a target domain by making use of data from a source domain. Many classic methods solve the domain adaptation problem by establishing a common latent space, which may cause the loss of many important properties across both domains. In this manuscript, we develop a novel method, transfer latent representation (TLR), to learn a better latent space. Specifically, we design an objective function based on a simple linear autoencoder to derive the latent representations of both domains. The encoder in the autoencoder aims to project the data of both domains into a robust latent space. Besides, the decoder imposes an additional constraint to reconstruct the original data, which can preserve the common properties of both domains and reduce the noise that causes domain shift. Experiments on cross-domain tasks demonstrate the advantages of TLR over competing methods. △ Less

Submitted 19 August, 2018; originally announced August 2018.

Showing 1–11 of 11 results for author: Du, B