Search | arXiv e-print repository

Beyond the Limits: A Survey of Techniques to Extend the Context Length in Large Language Models

Authors: Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, Armaghan Eshaghi

Abstract: Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and meth… ▽ More Recently, large language models (LLMs) have shown remarkable capabilities including understanding context, engaging in logical reasoning, and generating responses. However, this is achieved at the expense of stringent computational and memory requirements, hindering their ability to effectively support long input sequences. This survey provides an inclusive review of the recent techniques and methods devised to extend the sequence length in LLMs, thereby enhancing their capacity for long-context understanding. In particular, we review and categorize a wide range of techniques including architectural modifications, such as modified positional encoding and altered attention mechanisms, which are designed to enhance the processing of longer sequences while avoiding a proportional increase in computational requirements. The diverse methodologies investigated in this study can be leveraged across different phases of LLMs, i.e., training, fine-tuning and inference. This enables LLMs to efficiently process extended sequences. The limitations of the current methodologies is discussed in the last section along with the suggestions for future research directions, underscoring the importance of sequence length in the continued advancement of LLMs. △ Less

Submitted 29 May, 2024; v1 submitted 3 February, 2024; originally announced February 2024.

Comments: Accepted to IJCAI 2024 Survey Track -- camera-ready version

arXiv:2310.18975 [pdf, other]

Blacksmith: Fast Adversarial Training of Vision Transformers via a Mixture of Single-step and Multi-step Methods

Authors: Mahdi Salmani, Alireza Dehghanpour Farashah, Mohammad Azizmalayeri, Mahdi Amiri, Navid Eslami, Mohammad Taghi Manzuri, Mohammad Hossein Rohban

Abstract: Despite the remarkable success achieved by deep learning algorithms in various domains, such as computer vision, they remain vulnerable to adversarial perturbations. Adversarial Training (AT) stands out as one of the most effective solutions to address this issue; however, single-step AT can lead to Catastrophic Overfitting (CO). This scenario occurs when the adversarially trained network suddenly… ▽ More Despite the remarkable success achieved by deep learning algorithms in various domains, such as computer vision, they remain vulnerable to adversarial perturbations. Adversarial Training (AT) stands out as one of the most effective solutions to address this issue; however, single-step AT can lead to Catastrophic Overfitting (CO). This scenario occurs when the adversarially trained network suddenly loses robustness against multi-step attacks like Projected Gradient Descent (PGD). Although several approaches have been proposed to address this problem in Convolutional Neural Networks (CNNs), we found out that they do not perform well when applied to Vision Transformers (ViTs). In this paper, we propose Blacksmith, a novel training strategy to overcome the CO problem, specifically in ViTs. Our approach utilizes either of PGD-2 or Fast Gradient Sign Method (FGSM) randomly in a mini-batch during the adversarial training of the neural network. This will increase the diversity of our training attacks, which could potentially mitigate the CO issue. To manage the increased training time resulting from this combination, we craft the PGD-2 attack based on only the first half of the layers, while FGSM is applied end-to-end. Through our experiments, we demonstrate that our novel method effectively prevents CO, achieves PGD-2 level performance, and outperforms other existing techniques including N-FGSM, which is the state-of-the-art method in fast training for CNNs. △ Less

Submitted 29 October, 2023; originally announced October 2023.

arXiv:2308.12871 [pdf, other]

doi 10.5070/SR34163500

IPA: Inference Pipeline Adaptation to Achieve High Accuracy and Cost-Efficiency

Authors: Saeid Ghafouri, Kamran Razavi, Mehran Salmani, Alireza Sanaee, Tania Lorido-Botran, Lin Wang, Joseph Doyle, Pooyan Jamshidi

Abstract: Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However… ▽ More Efficiently optimizing multi-model inference pipelines for fast, accurate, and cost-effective inference is a crucial challenge in machine learning production systems, given their tight end-to-end latency requirements. To simplify the exploration of the vast and intricate trade-off space of latency, accuracy, and cost in inference pipelines, providers frequently opt to consider one of them. However, the challenge lies in reconciling latency, accuracy, and cost trade-offs. To address this challenge and propose a solution to efficiently manage model variants in inference pipelines, we present IPA, an online deep learning Inference Pipeline Adaptation system that efficiently leverages model variants for each deep learning task. Model variants are different versions of pre-trained models for the same deep learning task with variations in resource requirements, latency, and accuracy. IPA dynamically configures batch size, replication, and model variants to optimize accuracy, minimize costs, and meet user-defined latency Service Level Agreements (SLAs) using Integer Programming. It supports multi-objective settings for achieving different trade-offs between accuracy and cost objectives while remaining adaptable to varying workloads and dynamic traffic patterns. Navigating a wider variety of configurations allows \namex{} to achieve better trade-offs between cost and accuracy objectives compared to existing methods. Extensive experiments in a Kubernetes implementation with five real-world inference pipelines demonstrate that IPA improves end-to-end accuracy by up to 21% with a minimal cost increase. The code and data for replications are available at https://github.com/reconfigurable-ml-pipeline/ipa. △ Less

Submitted 26 May, 2024; v1 submitted 24 August, 2023; originally announced August 2023.

Journal ref: Journal of Systems Research, 4(1) (2024)

arXiv:2304.10892 [pdf, other]

doi 10.1145/3578356.3592578

Reconciling High Accuracy, Cost-Efficiency, and Low Latency of Inference Serving Systems

Authors: Mehran Salmani, Saeid Ghafouri, Alireza Sanaee, Kamran Razavi, Max Mühlhäuser, Joseph Doyle, Pooyan Jamshidi, Mohsen Sharifi

Abstract: The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations… ▽ More The use of machine learning (ML) inference for various applications is growing drastically. ML inference services engage with users directly, requiring fast and accurate responses. Moreover, these services face dynamic workloads of requests, imposing changes in their computing resources. Failing to right-size computing resources results in either latency service level objectives (SLOs) violations or wasted computing resources. Adapting to dynamic workloads considering all the pillars of accuracy, latency, and resource cost is challenging. In response to these challenges, we propose InfAdapter, which proactively selects a set of ML model variants with their resource allocations to meet latency SLO while maximizing an objective function composed of accuracy and cost. InfAdapter decreases SLO violation and costs up to 65% and 33%, respectively, compared to a popular industry autoscaler (Kubernetes Vertical Pod Autoscaler). △ Less

Submitted 24 April, 2023; v1 submitted 21 April, 2023; originally announced April 2023.

arXiv:2103.07406 [pdf, other]

doi 10.1364/OE.423747

Photonic Computing to Accelerate Data Processing in Wireless Communications

Authors: Mahsa Salmani, Armaghan Eshaghi, Enxiao Luan, Sreenil Saha

Abstract: Massive multiple-input multiple-output (MIMO) systems are considered as one of the leading technologies employed in the next generations of wireless communication networks (5G), which promise to provide higher spectral efficiency, lower latency, and more reliability. Due to the massive number of devices served by the base stations (BS) equipped with large antenna arrays, massive-MIMO systems need… ▽ More Massive multiple-input multiple-output (MIMO) systems are considered as one of the leading technologies employed in the next generations of wireless communication networks (5G), which promise to provide higher spectral efficiency, lower latency, and more reliability. Due to the massive number of devices served by the base stations (BS) equipped with large antenna arrays, massive-MIMO systems need to perform high-dimensional signal processing in a considerably short amount of time. The computational complexity of such data processing, while satisfying the energy and latency requirements, is beyond the capabilities of the conventional widely-used digital electronics-based computing, i.e., Field-Programmable Gate Arrays (FPGAs) and Application-Specific Integrated Circuits (ASICs). In this paper, the speed and lossless propagation of light is exploited to introduce a photonic computing approach that addresses the high computational complexity required by massive-MIMO systems. The proposed computing approach is based on photonic implementation of multiply and accumulate (MAC) operation achieved by broadcast-and-weight (B&W) architecture. The B&W protocol is limited to real and positive values to perform MAC operations. In this work, preprocessing steps are developed to enable the proposed photonic computing architecture to accept any arbitrary values as the input. This is a requirement for wireless communication systems that typically deal with complex values. Numerical analysis shows that the performance of the wireless communication system is not degraded by the proposed photonic computing architecture, while it provides significant improvements in time and energy efficiency for massive-MIMO systems as compared to the most powerful Graphics Processing Units (GPUs). △ Less

Submitted 12 March, 2021; originally announced March 2021.

arXiv:1809.07453 [pdf, ps, other]

Uplink Resource Allocation for Multiple Access Computational Offloading (Extended Version)

Authors: Mahsa Salmani, Timothy N. Davidson

Abstract: The mobile edge computing framework offers the opportunity to reduce the energy that devices must expend to complete computational tasks. The extent of that energy reduction depends on the nature of the tasks, and on the choice of the multiple access scheme. In this paper, we first address the uplink communication resource allocation for offloading systems that exploit the full capabilities of the… ▽ More The mobile edge computing framework offers the opportunity to reduce the energy that devices must expend to complete computational tasks. The extent of that energy reduction depends on the nature of the tasks, and on the choice of the multiple access scheme. In this paper, we first address the uplink communication resource allocation for offloading systems that exploit the full capabilities of the multiple access channel (FullMA). For indivisible tasks we provide a closed-form optimal solution of the energy minimization problem when a given set of users with different latency constraints are offloading, and a tailored greedy search algorithm for finding a good set of offloading users. For divisible tasks we develop a low-complexity algorithm to find a stationary solution. To highlight the impact of the choice of multiple access scheme, we also consider the TDMA scheme, which, in general, cannot exploit the full capabilities of the channel, and we develop low-complexity optimal resource allocation algorithms for indivisible and divisible tasks under that scheme. The energy reduction facilitated by FullMA is illustrated in our numerical experiments. Further, those results show that the proposed algorithms outperform existing algorithms in terms of energy consumption and computational cost. △ Less

Submitted 29 April, 2019; v1 submitted 19 September, 2018; originally announced September 2018.

arXiv:1805.04981 [pdf, other]

Multiple Access Computational Offloading: Communication Resource Allocation in the Two-User Case (Extended Version)

Authors: Mahsa Salmani, Timothy N. Davidson

Abstract: By offering shared computational facilities to which mobile devices can offload their computational tasks, the mobile edge computing framework is expanding the scope of applications that can be provided on resource-constrained devices. When multiple devices seek to use such a facility simultaneously, both the available computational resources and the available communication resources need to be ap… ▽ More By offering shared computational facilities to which mobile devices can offload their computational tasks, the mobile edge computing framework is expanding the scope of applications that can be provided on resource-constrained devices. When multiple devices seek to use such a facility simultaneously, both the available computational resources and the available communication resources need to be appropriately allocated. In this manuscript, we seek insight into the impact of the choice of the multiple access scheme by develo** solutions to the mobile energy minimization problem in the two-user case with plentiful shared computational resources. In that setting, the allocation of communication resources is constrained by the latency constraints of the applications, the computational capabilities and the transmission power constraints of the devices, and the achievable rate region of the chosen multiple access scheme. For both indivisible tasks and the limiting case of tasks that can be infinitesimally partitioned, we provide a closed-form and quasi-closed-form solution, respectively, for systems that can exploit the full capabilities of the multiple access channel, and for systems based on time-division multiple access (TDMA). For indivisible tasks, we also provide quasi-closed-form solutions for systems that employ sequential decoding without time sharing or independent decoding. Analyses of our results show that when the channel gains are equal and the transmission power budgets are larger than a threshold, TDMA (and the suboptimal multiple access schemes that we have considered) can achieve an optimal solution. However, when the channel gains of each user are significantly different and the latency constraints are tight, systems that take advantage of the full capabilities of the multiple access channel can substantially reduce the energy required to offload. △ Less

Submitted 14 October, 2018; v1 submitted 13 May, 2018; originally announced May 2018.

Comments: 50 pages (single-column), 12 figures, A condensed version of this manuscript is submitted to TSP

Showing 1–7 of 7 results for author: Salmani, M