Search | arXiv e-print repository

CLIP-Driven Cloth-Agnostic Feature Learning for Cloth-Changing Person Re-Identification

Authors: Shuang Li, Jiaxu Leng, Guozhang Li, Ji Gan, Haosheng chen, Xinbo Gao

Abstract: Contrastive Language-Image Pre-Training (CLIP) has shown impressive performance in short-term Person Re-Identification (ReID) due to its ability to extract high-level semantic features of pedestrians, yet its direct application to Cloth-Changing Person Re-Identification (CC-ReID) faces challenges due to CLIP's image encoder overly focusing on clothes clues. To address this, we propose a novel fram… ▽ More Contrastive Language-Image Pre-Training (CLIP) has shown impressive performance in short-term Person Re-Identification (ReID) due to its ability to extract high-level semantic features of pedestrians, yet its direct application to Cloth-Changing Person Re-Identification (CC-ReID) faces challenges due to CLIP's image encoder overly focusing on clothes clues. To address this, we propose a novel framework called CLIP-Driven Cloth-Agnostic Feature Learning (CCAF) for CC-ReID. Accordingly, two modules were custom-designed: the Invariant Feature Prompting (IFP) and the Clothes Feature Minimization (CFM). These modules guide the model to extract cloth-agnostic features positively and attenuate clothes-related features negatively. Specifically, IFP is designed to extract fine-grained semantic features unrelated to clothes from the raw image, guided by the cloth-agnostic text prompts. This module first covers the clothes in the raw image at the pixel level to obtain the shielding image and then utilizes CLIP's knowledge to generate cloth-agnostic text prompts. Subsequently, it aligns the raw image-text and the raw image-shielding image in the feature space, emphasizing discriminative clues related to identity but unrelated to clothes. Furthermore, CFM is designed to examine and weaken the image encoder's ability to extract clothes features. It first generates text prompts corresponding to clothes pixels. Then, guided by these clothes text prompts, it iteratively examines and disentangles clothes features from pedestrian features, ultimately retaining inherent discriminative features. Extensive experiments have demonstrated the effectiveness of the proposed CCAF, achieving new state-of-the-art performance on several popular CC-ReID benchmarks without any additional inference time. △ Less

Submitted 13 June, 2024; originally announced June 2024.

arXiv:2404.14691 [pdf, other]

Towards Fast Setup and High Throughput of GPU Serverless Computing

Authors: Han Zhao, Weihao Cui, Quan Chen, Shulai Zhang, Zijun Li, **gwen Leng, Chao Li, Deze Zeng, Minyi Guo

Abstract: Integrating GPUs into serverless computing platforms is crucial for improving efficiency. However, existing solutions for GPU-enabled serverless computing platforms face two significant problems due to coarse-grained GPU management: long setup time and low function throughput. To address these issues, we propose SAGE, a GPU serverless framework with fast setup and high throughput. First, based o… ▽ More Integrating GPUs into serverless computing platforms is crucial for improving efficiency. However, existing solutions for GPU-enabled serverless computing platforms face two significant problems due to coarse-grained GPU management: long setup time and low function throughput. To address these issues, we propose SAGE, a GPU serverless framework with fast setup and high throughput. First, based on the data knowability of GPU function ahead of actual execution, SAGE first devises the parallelized function setup mechanism, which parallelizes the data preparation and context creation. In this way, SAGE achieves fast setup of GPU function invocations.Second, SAGE further proposes the sharing-based memory management mechanism, which shares the read-only memory and context memory across multiple invocations of the same function. The memory sharing mechanism avoids repeated data preparation and then unnecessary data-loading contention. As a consequence, the function throughput could be improved. Our experimental results show that SAGE reduces function duration by 11.3X and improves function density by 1.22X compared to the state-of-the-art serverless platform. △ Less

Submitted 22 April, 2024; originally announced April 2024.

arXiv:2404.11852 [pdf, other]

Cicero: Addressing Algorithmic and Architectural Bottlenecks in Neural Rendering by Radiance War** and Memory Optimizations

Authors: Yu Feng, Zihan Liu, **gwen Leng, Minyi Guo, Yuhao Zhu

Abstract: Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bo… ▽ More Neural Radiance Field (NeRF) is widely seen as an alternative to traditional physically-based rendering. However, NeRF has not yet seen its adoption in resource-limited mobile systems such as Virtual and Augmented Reality (VR/AR), because it is simply extremely slow. On a mobile Volta GPU, even the state-of-the-art NeRF models generally execute only at 0.8 FPS. We show that the main performance bottlenecks are both algorithmic and architectural. We introduce, CICERO, to tame both forms of inefficiencies. We first introduce two algorithms, one fundamentally reduces the amount of work any NeRF model has to execute, and the other eliminates irregular DRAM accesses. We then describe an on-chip data layout strategy that eliminates SRAM bank conflicts. A pure software implementation of CICERO offers an 8.0x speed-up and 7.9x energy saving over a mobile Volta GPU. When compared to a baseline with a dedicated DNN accelerator, our speed-up and energy reduction increase to 28.2x and 37.8x, respectively - all with minimal quality loss (less than 1.0 dB peak signal-to-noise ratio reduction). △ Less

Submitted 17 April, 2024; originally announced April 2024.

arXiv:2404.07773 [pdf, other]

ConsistencyDet: A Robust Object Detector with a Denoising Paradigm of Consistency Model

Authors: Lifan Jiang, Zhihui Wang, Changmiao Wang, Ming Li, Jiaxu Leng, Xindong Wu

Abstract: Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed ConsistencyDet, leverages an innovative denoising conce… ▽ More Object detection, a quintessential task in the realm of perceptual computing, can be tackled using a generative methodology. In the present study, we introduce a novel framework designed to articulate object detection as a denoising diffusion process, which operates on the perturbed bounding boxes of annotated entities. This framework, termed ConsistencyDet, leverages an innovative denoising concept known as the Consistency Model. The hallmark of this model is its self-consistency feature, which empowers the model to map distorted information from any temporal stage back to its pristine state, thereby realizing a "one-step denoising" mechanism. Such an attribute markedly elevates the operational efficiency of the model, setting it apart from the conventional Diffusion Model. Throughout the training phase, ConsistencyDet initiates the diffusion sequence with noise-infused boxes derived from the ground-truth annotations and conditions the model to perform the denoising task. Subsequently, in the inference stage, the model employs a denoising sampling strategy that commences with bounding boxes randomly sampled from a normal distribution. Through iterative refinement, the model transforms an assortment of arbitrarily generated boxes into definitive detections. Comprehensive evaluations employing standard benchmarks, such as MS-COCO and LVIS, corroborate that ConsistencyDet surpasses other leading-edge detectors in performance metrics. Our code is available at https://github.com/Tankowa/ConsistencyDet. △ Less

Submitted 14 May, 2024; v1 submitted 11 April, 2024; originally announced April 2024.

arXiv:2404.03432 [pdf, other]

Piecemeal Quantum Telescope with Superresolution

Authors: Jian Leng, Yi-Xin Shen, Zhou-Kai Cao, Xiang-Bin Wang

Abstract: Detecting remote objects with higher precision and resolution takes a crucial role in many scientific tasks, such as astronomical observation. Compared with classical telescopes, quantum telescopes can detect more precise angle value for single-star target. The precision of existing quantum telescopes is improved in the scale of square root of incident single photons. Here we propose the piecemeal… ▽ More Detecting remote objects with higher precision and resolution takes a crucial role in many scientific tasks, such as astronomical observation. Compared with classical telescopes, quantum telescopes can detect more precise angle value for single-star target. The precision of existing quantum telescopes is improved in the scale of square root of incident single photons. Here we propose the piecemeal quantum telescope with bit-by-bit iteration. It improves precision exponentially with number of nincident single-photons in detecting the star angle. As a result, it requests to detect only a few hundreds of photons for a precision breaking classical limit by 4 to 5 magnitude orders. Moreover, it can detect a general astronomical target consisting of unknown number of stars. △ Less

Submitted 4 April, 2024; originally announced April 2024.

Comments: 6 figures

arXiv:2402.17995 [pdf, ps, other]

Improved Bounds for Szemerédi's Theorem

Authors: James Leng, Ashwin Sah, Mehtaab Sawhney

Abstract: Let $r_k(N)$ denote the size of the largest subset of $[N] = \{1,\ldots,N\}$ with no $k$-term arithmetic progression. We show that for $k\ge 5$, there exists $c_k>0$ such that \[r_k(N)\ll N\exp(-(\log\log N)^{c_k}).\] Our proof is a consequence of recent quasipolynomial bounds on the inverse theorem for the Gowers $U^k$-norm as well as the density increment strategy of Heath-Brown and Szemerédi as… ▽ More Let $r_k(N)$ denote the size of the largest subset of $[N] = \{1,\ldots,N\}$ with no $k$-term arithmetic progression. We show that for $k\ge 5$, there exists $c_k>0$ such that \[r_k(N)\ll N\exp(-(\log\log N)^{c_k}).\] Our proof is a consequence of recent quasipolynomial bounds on the inverse theorem for the Gowers $U^k$-norm as well as the density increment strategy of Heath-Brown and Szemerédi as reformulated by Green and Tao. △ Less

Submitted 29 February, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 13 pages

arXiv:2402.17994 [pdf, ps, other]

Quasipolynomial bounds on the inverse theorem for the Gowers $U^{s+1}[N]$-norm

Authors: James Leng, Ashwin Sah, Mehtaab Sawhney

Abstract: We prove quasipolynomial bounds on the inverse theorem for the Gowers $U^{s+1}[N]$-norm. The proof is modeled after work of Green, Tao, and Ziegler and uses as a crucial input recent work of the first author regarding the equidistribution of nilsequences. In a companion paper, this result will be used to improve the bounds on Szemerédi's theorem. We prove quasipolynomial bounds on the inverse theorem for the Gowers $U^{s+1}[N]$-norm. The proof is modeled after work of Green, Tao, and Ziegler and uses as a crucial input recent work of the first author regarding the equidistribution of nilsequences. In a companion paper, this result will be used to improve the bounds on Szemerédi's theorem. △ Less

Submitted 10 April, 2024; v1 submitted 27 February, 2024; originally announced February 2024.

Comments: 100 pages

arXiv:2402.10876 [pdf, other]

Accelerating Sparse DNNs Based on Tiled GEMM

Authors: Cong Guo, Fengchen Xue, **gwen Leng, Yuxian Qiu, Yue Guan, Weihao Cui, Quan Chen, Minyi Guo

Abstract: Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structu… ▽ More Network pruning can reduce the computation cost of deep neural network (DNN) models. However, sparse models often produce randomly-distributed weights to maintain accuracy, leading to irregular computations. Consequently, unstructured sparse models cannot achieve meaningful speedup on commodity hardware built for dense matrix computations. Accelerators are usually modified or designed with structured sparsity-optimized architectures for exploiting sparsity. For example, the Ampere architecture introduces a sparse tensor core, which adopts the 2:4 sparsity pattern. We propose a pruning method that builds upon the insight that matrix multiplication generally breaks the large matrix into multiple smaller tiles for parallel execution. We present the tile-wise sparsity pattern, which maintains a structured sparsity pattern at the tile level for efficient execution but allows for irregular pruning at the global scale to maintain high accuracy. In addition, the tile-wise sparsity is implemented at the global memory level, and the 2:4 sparsity executes at the register level inside the sparse tensor core. We can combine these two patterns into a tile-vector-wise (TVW) sparsity pattern to explore more fine-grained sparsity and further accelerate the sparse DNN models. We evaluate the TVW on the GPU, achieving averages of $1.85\times$, $2.75\times$, and $22.18\times$ speedups over the dense model, block sparsity, and unstructured sparsity. △ Less

Submitted 16 February, 2024; originally announced February 2024.

Comments: Accepted by IEEE Transactions on Computers. arXiv admin note: substantial text overlap with arXiv:2008.13006

arXiv:2401.13472 [pdf, other]

Segmenting Cardiac Muscle Z-disks with Deep Neural Networks

Authors: Mihaela Croitor Ibrahim, Nishant Ravikumar, Alistair Curd, Joanna Leng, Oliver Umney, Michelle Peckham

Abstract: Z-disks are complex structures that delineate repeating sarcomeres in striated muscle. They play significant roles in cardiomyocytes such as providing mechanical stability for the contracting sarcomere, cell signalling and autophagy. Changes in Z-disk architecture have been associated with impaired cardiac function. Hence, there is a strong need to create tools to segment Z-disks from microscopy i… ▽ More Z-disks are complex structures that delineate repeating sarcomeres in striated muscle. They play significant roles in cardiomyocytes such as providing mechanical stability for the contracting sarcomere, cell signalling and autophagy. Changes in Z-disk architecture have been associated with impaired cardiac function. Hence, there is a strong need to create tools to segment Z-disks from microscopy images, that overcome traditional limitations such as variability in image brightness and staining technique. In this study, we apply deep learning based segmentation models to extract Z-disks in images of striated muscle tissue. We leverage a novel Airyscan confocal dataset, which comprises high resolution images of Z-disks of healthy heart tissue, stained with Affimers for specific Z-disk proteins. We employed an interactive labelling tool, Ilastik to obtain ground truth segmentation masks and use the resulting data set to train and evaluate the performance of several state-of-the-art segmentation networks. On the test set, UNet++ achieves best segmentation performance for Z-disks in cardiomyocytes, with an average Dice score of 0.91 and outperforms other established segmentation methods including UNet, FPN, DeepLabv3+ and pix2pix. However, pix2pix demonstrates improved generalisation, when tested on an additional dataset of cardiomyocytes with a titin mutation. This is the first study to demonstrate that automated machine learning-based segmentation approaches may be used effectively to segment Z-disks in confocal microscopy images. Automated segmentation approaches and predicted segmentation masks could be used to derive morphological features of Z-disks (e.g. width and orientation), and subsequently, to quantify disease-related changes to cardiac microstructure. △ Less

Submitted 24 January, 2024; originally announced January 2024.

arXiv:2401.08550 [pdf, other]

Expanding Hardware-Efficiently Manipulable Hilbert Space via Hamiltonian Embedding

Authors: Jiaqi Leng, Joseph Li, Yuxiang Peng, Xiaodi Wu

Abstract: Many promising quantum applications depend on the efficient quantum simulation of an exponentially large sparse Hamiltonian, a task known as sparse Hamiltonian simulation, which is fundamentally important in quantum computation. Although several theoretically appealing quantum algorithms have been proposed for this task, they typically require a black-box query model of the sparse Hamiltonian, ren… ▽ More Many promising quantum applications depend on the efficient quantum simulation of an exponentially large sparse Hamiltonian, a task known as sparse Hamiltonian simulation, which is fundamentally important in quantum computation. Although several theoretically appealing quantum algorithms have been proposed for this task, they typically require a black-box query model of the sparse Hamiltonian, rendering them impractical for near-term implementation on quantum devices. In this paper, we propose a technique named Hamiltonian embedding. This technique simulates a desired sparse Hamiltonian by embedding it into the evolution of a larger and more structured quantum system, allowing for more efficient simulation through hardware-efficient operations. We conduct a systematic study of this new technique and demonstrate significant savings in computational resources for implementing prominent quantum applications. As a result, we can now experimentally realize quantum walks on complicated graphs (e.g., binary trees, glued-tree graphs), quantum spatial search, and the simulation of real-space Schrödinger equations on current trapped-ion and neutral-atom platforms. Given the fundamental role of Hamiltonian evolution in the design of quantum algorithms, our technique markedly expands the horizon of implementable quantum advantages in the NISQ era. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: 68 pages, 10 figures, an accompanying GitHub repository is at https://github.com/jiaqileng/hamiltonian-embedding

arXiv:2401.08156 [pdf, other]

GMLake: Efficient and Transparent GPU Memory Defragmentation for Large-scale DNN Training with Virtual Memory Stitching

Authors: Cong Guo, Rui Zhang, Jiale Xu, **gwen Leng, Zihan Liu, Ziyu Huang, Minyi Guo, Hao Wu, Shouren Zhao, Jun** Zhao, Ke Zhang

Abstract: Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the proh… ▽ More Large-scale deep neural networks (DNNs), such as large language models (LLMs), have revolutionized the artificial intelligence (AI) field and become increasingly popular. However, training or fine-tuning such models requires substantial computational power and resources, where the memory capacity of a single acceleration device like a GPU is one of the most important bottlenecks. Owing to the prohibitively large overhead (e.g., $10 \times$) of GPUs' native memory allocator, DNN frameworks like PyTorch and TensorFlow adopt a caching allocator that maintains a memory pool with a splitting mechanism for fast memory (de)allocation. Unfortunately, the caching allocator's efficiency degrades quickly for popular memory reduction techniques such as recomputation, offloading, distributed training, and low-rank adaptation. The primary reason is that those memory reduction techniques introduce frequent and irregular memory (de)allocation requests, leading to severe fragmentation problems for the splitting-based caching allocator. To mitigate this fragmentation problem, we propose a novel memory allocation framework based on low-level GPU virtual memory management called GPU memory lake (GMLake). GMLake employs a novel virtual memory stitching (VMS) mechanism, which can fuse or combine non-contiguous memory blocks with a virtual memory address map**. GMLake can reduce an average of 9.2 GB (up to 25 GB) GPU memory usage and 15% (up to 33% ) fragmentation among eight LLM models on GPU A100 with 80 GB memory. GMLake is completely transparent to the DNN models and memory reduction techniques and ensures the seamless execution of resource-intensive deep-learning tasks. We have open-sourced GMLake at https://github.com/intelligent-machine-learning/glake/tree/main/GMLake. △ Less

Submitted 16 January, 2024; originally announced January 2024.

Comments: Accepted by ASPLOS24

arXiv:2312.10776 [pdf, ps, other]

Improved bounds for five-term arithmetic progressions

Authors: James Leng, Ashwin Sah, Mehtaab Sawhney

Abstract: Let $r_5(N)$ be the largest cardinality of a set in $\{1,\ldots,N\}$ which does not contain $5$ elements in arithmetic progression. Then there exists a constant $c\in (0,1)$ such that \[r_5(N)\ll \frac{N}{\exp((\log\log N)^{c})}.\] Our work is a consequence of recent improved bounds on the $U^4$-inverse theorem of the first author and the fact that $3$-step nilsequences may be approximated by loca… ▽ More Let $r_5(N)$ be the largest cardinality of a set in $\{1,\ldots,N\}$ which does not contain $5$ elements in arithmetic progression. Then there exists a constant $c\in (0,1)$ such that \[r_5(N)\ll \frac{N}{\exp((\log\log N)^{c})}.\] Our work is a consequence of recent improved bounds on the $U^4$-inverse theorem of the first author and the fact that $3$-step nilsequences may be approximated by locally cubic functions on shifted Bohr sets. This combined with the density increment strategy of Heath-Brown and Szemer{é}di, codified by Green and Tao, gives the desired result. △ Less

Submitted 10 April, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: 35 pages, comments welcome!

arXiv:2312.10772 [pdf, ps, other]

Efficient Equidistribution of Nilsequences

Authors: James Leng

Abstract: We give improved bounds for the equidistribution of (multiparameter) nilsequences subject to any degree filtration. The bounds we obtain are single exponential in dimension, improving on double exponential bounds of Green and Tao. To obtain these bounds, we avoid "induction of dimension" which is ubiquitous throughout higher order Fourier analysis. These improved equidistribution results are a c… ▽ More We give improved bounds for the equidistribution of (multiparameter) nilsequences subject to any degree filtration. The bounds we obtain are single exponential in dimension, improving on double exponential bounds of Green and Tao. To obtain these bounds, we avoid "induction of dimension" which is ubiquitous throughout higher order Fourier analysis. These improved equidistribution results are a crucial ingredient in joint work with Sah and Sawhney proving quasi-polynomial bounds for $U^{s + 1}[N]$ inverse theorem and proving bounds for linear equations in the primes which save an arbitrary power of logarithm. △ Less

Submitted 27 February, 2024; v1 submitted 17 December, 2023; originally announced December 2023.

Comments: 56 pages, comments welcome! v4. Reorganized content from arXiv:2306.13820 and updated citations

arXiv:2312.01712 [pdf, other]

JUNO: Optimizing High-Dimensional Approximate Nearest Neighbour Search with Sparsity-Aware Algorithm and Ray-Tracing Core Map**

Authors: Zihan Liu, Wentao Ni, **gwen Leng, Yu Feng, Cong Guo, Quan Chen, Chao Li, Minyi Guo, Yuhao Zhu

Abstract: Approximate nearest neighbor (ANN) search is a widely applied technique in modern intelligent applications, such as recommendation systems and vector databases. Therefore, efficient and high-throughput execution of ANN search has become increasingly important. In this paper, we first characterize the state-of-the-art product quantization-based method of ANN search and identify a significant source… ▽ More Approximate nearest neighbor (ANN) search is a widely applied technique in modern intelligent applications, such as recommendation systems and vector databases. Therefore, efficient and high-throughput execution of ANN search has become increasingly important. In this paper, we first characterize the state-of-the-art product quantization-based method of ANN search and identify a significant source of inefficiency in the form of unnecessary pairwise distance calculations and accumulations. To improve efficiency, we propose JUNO, an end-to-end ANN search system that adopts a carefully designed sparsity- and locality-aware search algorithm. We also present an efficient hardware map** that utilizes ray tracing cores in modern GPUs with pipelined execution on tensor cores to execute our sparsity-aware ANN search algorithm. Our evaluations on four datasets ranging in size from 1 to 100 million search points demonstrate 2.2x-8.5x improvements in search throughput. Moreover, our algorithmic enhancements alone achieve a maximal 2.6x improvement on the hardware without the acceleration of the RT core. △ Less

Submitted 4 December, 2023; originally announced December 2023.

arXiv:2311.15145 [pdf, other]

Choosing Wisely and Learning Deeply: Selective Cross-Modality Distillation via CLIP for Domain Generalization

Authors: Jixuan Leng, Yijiang Li, Haohan Wang

Abstract: Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust… ▽ More Domain Generalization (DG), a crucial research area, seeks to train models across multiple domains and test them on unseen ones. In this paper, we introduce a novel approach, namely, Selective Cross-Modality Distillation for Domain Generalization (SCMD). SCMD leverages the capabilities of large vision-language models, specifically CLIP, to train a more efficient model, ensuring it acquires robust generalization capabilities across unseen domains. Our primary contribution is a unique selection framework strategically designed to identify hard-to-learn samples for distillation. In parallel, we introduce a novel cross-modality module that seamlessly combines the projected features of the student model with the text embeddings from CLIP, ensuring the alignment of similarity distributions. We assess SCMD's performance on various benchmarks, where it empowers a ResNet50 to deliver state-of-the-art performance, surpassing existing domain generalization methods. Furthermore, we provide a theoretical analysis of our selection strategy, offering deeper insight into its effectiveness and potential in the field of DG. △ Less

Submitted 21 April, 2024; v1 submitted 25 November, 2023; originally announced November 2023.

arXiv:2311.08217 [pdf, other]

Peer is Your Pillar: A Data-unbalanced Conditional GANs for Few-shot Image Generation

Authors: Ziqiang Li, Chaoyue Wang, Xue Rui, Chao Xue, Jiaxu Leng, Bin Li

Abstract: Few-shot image generation aims to train generative models using a small number of training images. When there are few images available for training (e.g. 10 images), Learning From Scratch (LFS) methods often generate images that closely resemble the training data while Transfer Learning (TL) methods try to improve performance by leveraging prior knowledge from GANs pre-trained on large-scale datas… ▽ More Few-shot image generation aims to train generative models using a small number of training images. When there are few images available for training (e.g. 10 images), Learning From Scratch (LFS) methods often generate images that closely resemble the training data while Transfer Learning (TL) methods try to improve performance by leveraging prior knowledge from GANs pre-trained on large-scale datasets. However, current TL methods may not allow for sufficient control over the degree of knowledge preservation from the source model, making them unsuitable for setups where the source and target domains are not closely related. To address this, we propose a novel pipeline called Peer is your Pillar (PIP), which combines a target few-shot dataset with a peer dataset to create a data-unbalanced conditional generation. Our approach includes a class embedding method that separates the class space from the latent space, and we use a direction loss based on pre-trained CLIP to improve image diversity. Experiments on various few-shot datasets demonstrate the advancement of the proposed PIP, especially reduces the training requirements of few-shot image generation. △ Less

Submitted 14 November, 2023; originally announced November 2023.

Comments: Under Review

arXiv:2311.08134 [pdf, other]

Applying hybrid clustering in pulsar candidate sifting with multi-modality for FAST survey

Authors: Zi-Yi You, Yun-Rong Pan, Zhi Ma, Li Zhang, Shuo Xiao, Dan-Dan Zhang, Shi-Jun Dang, Ru-Shuang Zhao, Pei Wang, Ai-Jun Dong, Jia-Tao Jiang, Ji-Bing Leng, Wei-An Li, Si-Yao Li

Abstract: Pulsar search is always the basis of pulsar navigation, gravitational wave detection and other research topics. Currently, the volume of pulsar candidates collected by Five-hundred-meter Aperture Spherical radio Telescope (FAST) shows an explosive growth rate that has brought challenges for its pulsar candidate filtering System. Particularly, the multi-view heterogeneous data and class imbalance b… ▽ More Pulsar search is always the basis of pulsar navigation, gravitational wave detection and other research topics. Currently, the volume of pulsar candidates collected by Five-hundred-meter Aperture Spherical radio Telescope (FAST) shows an explosive growth rate that has brought challenges for its pulsar candidate filtering System. Particularly, the multi-view heterogeneous data and class imbalance between true pulsars and non-pulsar candidates have negative effects on traditional single-modal supervised classification methods. In this study, a multi-modal and semi-supervised learning based pulsar candidate sifting algorithm is presented, which adopts a hybrid ensemble clustering scheme of density-based and partition-based methods combined with a feature-level fusion strategy for input data and a data partition strategy for parallelization. Experiments on both HTRU (The High Time Resolution Universe Survey) 2 and FAST actual observation data demonstrate that the proposed algorithm could excellently identify the pulsars: On HTRU2, the precision and recall rates of its parallel mode reach 0.981 and 0.988. On FAST data, those of its parallel mode reach 0.891 and 0.961, meanwhile, the running time also significantly decrease with the increment of parallel nodes within limits. So, we can get the conclusion that our algorithm could be a feasible idea for large scale pulsar candidate sifting of FAST drift scan observation. △ Less

Submitted 14 November, 2023; originally announced November 2023.

arXiv:2311.07102 [pdf, other]

Fovea Transformer: Efficient Long-Context Modeling with Structured Fine-to-Coarse Attention

Authors: Ziwei He, Jian Yuan, Le Zhou, **gwen Leng, Bo Jiang

Abstract: The quadratic complexity of self-attention in Transformers has hindered the processing of long text. To alleviate this problem, previous works have proposed to sparsify the attention matrix, taking advantage of the observation that crucial information about a token can be derived from its neighbors. These methods typically combine one or another form of local attention and global attention. Such c… ▽ More The quadratic complexity of self-attention in Transformers has hindered the processing of long text. To alleviate this problem, previous works have proposed to sparsify the attention matrix, taking advantage of the observation that crucial information about a token can be derived from its neighbors. These methods typically combine one or another form of local attention and global attention. Such combinations introduce abrupt changes in contextual granularity when going from local to global, which may be undesirable. We believe that a smoother transition could potentially enhance model's ability to capture long-context dependencies. In this study, we introduce Fovea Transformer, a long-context focused transformer that addresses the challenges of capturing global dependencies while maintaining computational efficiency. To achieve this, we construct a multi-scale tree from the input sequence, and use representations of context tokens with a progressively coarser granularity in the tree, as their distance to the query token increases. We evaluate our model on three long-context summarization tasks\footnote{Our code is publicly available at: \textit{https://github.com/ZiweiHe/Fovea-Transformer}}. It achieves state-of-the-art performance on two of them, and competitive results on the third with mixed improvement and setback of the evaluation metrics. △ Less

Submitted 11 January, 2024; v1 submitted 13 November, 2023; originally announced November 2023.

arXiv:2311.03977 [pdf, ps, other]

A quantum central path algorithm for linear optimization

Authors: Brandon Augustino, Jiaqi Leng, Giacomo Nannicini, Tamás Terlaky, Xiaodi Wu

Abstract: We propose a novel quantum algorithm for solving linear optimization problems by quantum-mechanical simulation of the central path. While interior point methods follow the central path with an iterative algorithm that works with successive linearizations of the perturbed KKT conditions, we perform a single simulation working directly with the nonlinear complementarity equations. Combining our appr… ▽ More We propose a novel quantum algorithm for solving linear optimization problems by quantum-mechanical simulation of the central path. While interior point methods follow the central path with an iterative algorithm that works with successive linearizations of the perturbed KKT conditions, we perform a single simulation working directly with the nonlinear complementarity equations. Combining our approach with iterative refinement techniques, we obtain an exact solution to a linear optimization problem involving $m$ constraints and $n$ variables using at most $\mathcal{O} \left( (m + n) \text{nnz} (A) κ(\mathcal{M}) L \cdot \text{polylog} \left(m, n, κ(\mathcal{M}) \right) \right)$ elementary gates and $\mathcal{O} \left( \text{nnz} (A) L \right)$ classical arithmetic operations, where $ \text{nnz} (A)$ is the total number of non-zero elements found in the constraint matrix, $L$ denotes binary input length of the problem data, and $κ(\mathcal{M})$ is a condition number that depends only on the problem data. △ Less

Submitted 7 November, 2023; originally announced November 2023.

arXiv:2311.00811 [pdf, other]

A quantum-classical performance separation in nonconvex optimization

Authors: Jiaqi Leng, Yufan Zheng, Xiaodi Wu

Abstract: In this paper, we identify a family of nonconvex continuous optimization instances, each $d$-dimensional instance with $2^d$ local minima, to demonstrate a quantum-classical performance separation. Specifically, we prove that the recently proposed Quantum Hamiltonian Descent (QHD) algorithm [Leng et al., arXiv:2303.01471] is able to solve any $d$-dimensional instance from this family using… ▽ More In this paper, we identify a family of nonconvex continuous optimization instances, each $d$-dimensional instance with $2^d$ local minima, to demonstrate a quantum-classical performance separation. Specifically, we prove that the recently proposed Quantum Hamiltonian Descent (QHD) algorithm [Leng et al., arXiv:2303.01471] is able to solve any $d$-dimensional instance from this family using $\widetilde{\mathcal{O}}(d^3)$ quantum queries to the function value and $\widetilde{\mathcal{O}}(d^4)$ additional 1-qubit and 2-qubit elementary quantum gates. On the other side, a comprehensive empirical study suggests that representative state-of-the-art classical optimization algorithms/solvers (including Gurobi) would require a super-polynomial time to solve such optimization instances. △ Less

Submitted 1 November, 2023; originally announced November 2023.

Comments: 32 pages, 7 figures. More details of the original Quantum Hamiltonian Descent (QHD) algorithm can be found at arXiv:2303.01471

arXiv:2310.17952 [pdf, other]

Shape-centered Representation Learning for Visible-Infrared Person Re-identification

Authors: Shuang Li, Jiaxu Leng, Ji Gan, Meng**gcheng Mo, Xinbo Gao

Abstract: Current Visible-Infrared Person Re-Identification (VI-ReID) methods prioritize extracting distinguishing appearance features, ignoring the natural resistance of body shape against modality changes. Initially, we gauged the discriminative potential of shapes by a straightforward concatenation of shape and appearance features. However, two unresolved issues persist in the utilization of shape featur… ▽ More Current Visible-Infrared Person Re-Identification (VI-ReID) methods prioritize extracting distinguishing appearance features, ignoring the natural resistance of body shape against modality changes. Initially, we gauged the discriminative potential of shapes by a straightforward concatenation of shape and appearance features. However, two unresolved issues persist in the utilization of shape features. One pertains to the dependence on auxiliary models for shape feature extraction in the inference phase, along with the errors in generated infrared shapes due to the intrinsic modality disparity. The other issue involves the inadequately explored correlation between shape and appearance features. To tackle the aforementioned challenges, we propose the Shape-centered Representation Learning framework (ScRL), which focuses on learning shape features and appearance features associated with shapes. Specifically, we devise the Shape Feature Propagation (SFP), facilitating direct extraction of shape features from original images with minimal complexity costs during inference. To restitute inaccuracies in infrared body shapes at the feature level, we present the Infrared Shape Restitution (ISR). Furthermore, to acquire appearance features related to shape, we design the Appearance Feature Enhancement (AFE), which accentuates identity-related features while suppressing identity-unrelated features guided by shape features. Extensive experiments are conducted to validate the effectiveness of the proposed ScRL. Achieving remarkable results, the Rank-1 (mAP) accuracy attains 76.1%, 71.2%, 92.4% (72.6%, 52.9%, 86.7%) on the SYSU-MM01, HITSZ-VCM, RegDB datasets respectively, outperforming existing state-of-the-art methods. △ Less

Submitted 29 October, 2023; v1 submitted 27 October, 2023; originally announced October 2023.

arXiv:2310.15725 [pdf, other]

Ranking-based Adaptive Query Generation for DETRs in Crowded Pedestrian Detection

Authors: Feng Gao, Jiaxu Leng, Ji Gan, Xinbo Gao

Abstract: DEtection TRansformer (DETR) and its variants (DETRs) have been successfully applied to crowded pedestrian detection, which achieved promising performance. However, we find that, in different degrees of crowded scenes, the number of DETRs' queries must be adjusted manually, otherwise, the performance would degrade to varying degrees. In this paper, we first analyze the two current query generation… ▽ More DEtection TRansformer (DETR) and its variants (DETRs) have been successfully applied to crowded pedestrian detection, which achieved promising performance. However, we find that, in different degrees of crowded scenes, the number of DETRs' queries must be adjusted manually, otherwise, the performance would degrade to varying degrees. In this paper, we first analyze the two current query generation methods and summarize four guidelines for designing the adaptive query generation method. Then, we propose Rank-based Adaptive Query Generation (RAQG) to alleviate the problem. Specifically, we design a rank prediction head that can predict the rank of the lowest confidence positive training sample produced by the encoder. Based on the predicted rank, we design an adaptive selection method that can adaptively select coarse detection results produced by the encoder to generate queries. Moreover, to train the rank prediction head better, we propose Soft Gradient L1 Loss. The gradient of Soft Gradient L1 Loss is continuous, which can describe the relationship between the loss value and the updated value of model parameters granularly. Our method is simple and effective, which can be plugged into any DETRs to make it query-adaptive in theory. The experimental results on Crowdhuman dataset and Citypersons dataset show that our method can adaptively generate queries for DETRs and achieve competitive results. Especially, our method achieves state-of-the-art 39.4% MR on Crowdhuman dataset. △ Less

Submitted 8 January, 2024; v1 submitted 24 October, 2023; originally announced October 2023.

Comments: 10 pages, 6 figures

arXiv:2308.08174 [pdf, other]

Accelerating Generic Graph Neural Networks via Architecture, Compiler, Partition Method Co-Design

Authors: Shuwen Lu, Zhihui Zhang, Cong Guo, **gwen Leng, Yangjie Zhou, Minyi Guo

Abstract: Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high… ▽ More Graph neural networks (GNNs) have shown significant accuracy improvements in a variety of graph learning domains, sparking considerable research interest. To translate these accuracy improvements into practical applications, it is essential to develop high-performance and efficient hardware acceleration for GNN models. However, designing GNN accelerators faces two fundamental challenges: the high bandwidth requirement of GNN models and the diversity of GNN models. Previous works have addressed the first challenge by using more expensive memory interfaces to achieve higher bandwidth. For the second challenge, existing works either support specific GNN models or have generic designs with poor hardware utilization. In this work, we tackle both challenges simultaneously. First, we identify a new type of partition-level operator fusion, which we utilize to internally reduce the high bandwidth requirement of GNNs. Next, we introduce partition-level multi-threading to schedule the concurrent processing of graph partitions, utilizing different hardware resources. To further reduce the extra on-chip memory required by multi-threading, we propose fine-grained graph partitioning to generate denser graph partitions. Importantly, these three methods make no assumptions about the targeted GNN models, addressing the challenge of model variety. We implement these methods in a framework called SwitchBlade, consisting of a compiler, a graph partitioner, and a hardware accelerator. Our evaluation demonstrates that SwitchBlade achieves an average speedup of $1.85\times$ and energy savings of $19.03\times$ compared to the NVIDIA V100 GPU. Additionally, SwitchBlade delivers performance comparable to state-of-the-art specialized accelerators. △ Less

Submitted 16 August, 2023; originally announced August 2023.

arXiv:2307.09146 [pdf, other]

PRO-Face S: Privacy-preserving Reversible Obfuscation of Face Images via Secure Flow

Authors: Lin Yuan, Kai Liang, Xiao Pu, Yan Zhang, Jiaxu Leng, Tao Wu, Nannan Wang, Xinbo Gao

Abstract: This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process th… ▽ More This paper proposes a novel paradigm for facial privacy protection that unifies multiple characteristics including anonymity, diversity, reversibility and security within a single lightweight framework. We name it PRO-Face S, short for Privacy-preserving Reversible Obfuscation of Face images via Secure flow-based model. In the framework, an Invertible Neural Network (INN) is utilized to process the input image along with its pre-obfuscated form, and generate the privacy protected image that visually approximates to the pre-obfuscated one, thus ensuring privacy. The pre-obfuscation applied can be in diversified form with different strengths and styles specified by users. Along protection, a secret key is injected into the network such that the original image can only be recovered from the protection image via the same model given the correct key provided. Two modes of image recovery are devised to deal with malicious recovery attempts in different scenarios. Finally, extensive experiments conducted on three public image datasets demonstrate the superiority of the proposed framework over multiple state-of-the-art approaches. △ Less

Submitted 18 July, 2023; originally announced July 2023.

arXiv:2306.13820 [pdf, ps, other]

Efficient equidistribution of periodic nilsequences and applications

Authors: James Leng

Abstract: This is a companion paper to arXiv:2312.10772. We deduce an equidistribution theorem for periodic nilsequences and use this theorem to give two applications in arithmetic combinatorics. The first application is quasi-polynomial bounds for a certain complexity one polynomial progression, improving the iterated logarithm bound previusly obtained. The second application is a proof of the quasi-polyno… ▽ More This is a companion paper to arXiv:2312.10772. We deduce an equidistribution theorem for periodic nilsequences and use this theorem to give two applications in arithmetic combinatorics. The first application is quasi-polynomial bounds for a certain complexity one polynomial progression, improving the iterated logarithm bound previusly obtained. The second application is a proof of the quasi-polynomial $U^4[N]$ inverse theorem. In work with Sah and Sawhney, we obtain improved bounds for sets lacking nontrivial $5$-term arithmetic progressions. △ Less

Submitted 27 February, 2024; v1 submitted 23 June, 2023; originally announced June 2023.

Comments: 50 pages, comments welcome! v5. Reorganized content from arXiv:2312.10772

arXiv:2306.11043 [pdf, other]

DFlow: Efficient Dataflow-based Invocation Workflow Execution for Function-as-a-Service

Authors: Xiaoxiang Shi, Chao Li, Zijun Li, Zihan Liu, Dianmo Sheng, Quan Chen, **gwen Leng, Minyi Guo

Abstract: The Serverless Computing is becoming increasingly popular due to its ease of use and fine-grained billing. These features make it appealing for stateful application or serverless workflow. However, current serverless workflow systems utilize a controlflow-based invocation pattern to invoke functions. In this execution pattern, the function invocation depends on the state of the function. A functio… ▽ More The Serverless Computing is becoming increasingly popular due to its ease of use and fine-grained billing. These features make it appealing for stateful application or serverless workflow. However, current serverless workflow systems utilize a controlflow-based invocation pattern to invoke functions. In this execution pattern, the function invocation depends on the state of the function. A function can only begin executing once all its precursor functions have completed. As a result, this pattern may potentially lead to longer end-to-end execution time. We design and implement the DFlow, a novel dataflow-based serverless workflow system that achieves high performance for serverless workflow. DFlow introduces a distributed scheduler (DScheduler) by using the dataflow-based invocation pattern to invoke functions. In this pattern, the function invocation depends on the data dependency between functions. The function can start to execute even its precursor functions are still running. DFlow further features a distributed store (DStore) that utilizes effective fine-grained optimization techniques to eliminate function interaction, thereby enabling efficient data exchange. With the support of DScheduler and DStore, DFlow can achieving an average improvement of 60% over CFlow, 40% over FaaSFlow, 25% over FaasFlowRedis, and 40% over KNIX on 99%-ile latency respectively. Further, it can improve network bandwidth utilization by 2x-4x over CFlow and 1.5x-3x over FaaSFlow, FaaSFlowRedis and KNIX, respectively. DFlow effectively reduces the cold startup latency, achieving an average improvement of 5.6x over CFlow and 1.1x over FaaSFlow △ Less

Submitted 4 July, 2023; v1 submitted 19 June, 2023; originally announced June 2023.

Comments: 22 pages, 13 figures

arXiv:2306.10855 [pdf, other]

Quantum Advantage of Noisy Grover's Algorithm

Authors: Jian Leng, Fan Yang, Xiang-Bin Wang

Abstract: Quantum advantage is the core of quantum computing. Grover's search algorithm is the only quantum algorithm with proven advantage to any possible classical search algorithm. However, realizing this quantum advantage in practice is quite challenging since Grover's algorithm is very sensitive to noise. Here we present a noise-tolerant method that exponentially improves the noise threshold of Grover'… ▽ More Quantum advantage is the core of quantum computing. Grover's search algorithm is the only quantum algorithm with proven advantage to any possible classical search algorithm. However, realizing this quantum advantage in practice is quite challenging since Grover's algorithm is very sensitive to noise. Here we present a noise-tolerant method that exponentially improves the noise threshold of Grover's algorithm. We present a lower bound for average fidelity of any quantum circuit with O(log D log D) cost under time-independent noise, where D is the dimension of Hilbert space. According to this bound value, we determine the number of iterates which will be applied in Grover's algorithm. Numerical simulation shows that the noise threshold of quantum advantage of Grover's algorithm by our noise-tolerant method is improved by an exponential factor with qubit amount rise. △ Less

Submitted 19 June, 2023; originally announced June 2023.

Comments: 6 figures

arXiv:2306.08423 [pdf, other]

DistSim: A performance model of large-scale hybrid distributed DNN training

Authors: Guandong Lu, Runzhe Chen, Yakai Wang, Yangjie Zhou, Rui Zhang, Zheng Hu, Yanming Miao, Zhifang Cai, Li Li, **gwen Leng, Minyi Guo

Abstract: With the ever-increasing computational demand of DNN training workloads, distributed training has been widely adopted. A combination of data, model and pipeline parallelism strategy, called hybrid parallelism distributed training, is imported to tackle the problem of deploying large-scale models. However, how to evaluate the hybrid strategy and the utilization of each device remains a challenge si… ▽ More With the ever-increasing computational demand of DNN training workloads, distributed training has been widely adopted. A combination of data, model and pipeline parallelism strategy, called hybrid parallelism distributed training, is imported to tackle the problem of deploying large-scale models. However, how to evaluate the hybrid strategy and the utilization of each device remains a challenge since existing works either profile on a real large-scale cluster with high time and money costs or only analyze a specific type of parallelism without considering the hybrid parallelism. In this work, we proposed DistSim, an event-based performance model to accurately analyze each device's computation and communication activities with low profiling costs. DistDim breaks down the model into events according to the given distributed strategy, which can be profiled on two nodes. Then DistSim leverages the hierarchy of different parallel strategies to generate the computation and communication event-flow from layer level to model level and finally the activity timeline of each device participating in training. Experiment shows that DistSim can reach \revise{<4\%} errors when predicting distributing training batch time and \revise{<5\%} errors when predicting a single device's activity time in various hybrid strategy settings. We also provide a use-case of DistSim, automatically evaluate and search the best distributed training strategy, and find a hybrid strategy with at most $7.37\times$ throughput improvement. △ Less

Submitted 14 June, 2023; originally announced June 2023.

arXiv:2305.17408 [pdf, other]

AdaptGear: Accelerating GNN Training via Adaptive Subgraph-Level Kernels on GPUs

Authors: Yangjie Zhou, Yaoxu Song, **gwen Leng, Zihan Liu, Weihao Cui, Zhendong Zhang, Cong Guo, Quan Chen, Li Li, Minyi Guo

Abstract: Graph neural networks (GNNs) are powerful tools for exploring and learning from graph structures and features. As such, achieving high-performance execution for GNNs becomes crucially important. Prior works have proposed to explore the sparsity (i.e., low density) in the input graph to accelerate GNNs, which uses the full-graph-level or block-level sparsity format. We show that they fail to balanc… ▽ More Graph neural networks (GNNs) are powerful tools for exploring and learning from graph structures and features. As such, achieving high-performance execution for GNNs becomes crucially important. Prior works have proposed to explore the sparsity (i.e., low density) in the input graph to accelerate GNNs, which uses the full-graph-level or block-level sparsity format. We show that they fail to balance the sparsity benefit and kernel execution efficiency. In this paper, we propose a novel system, referred to as AdaptGear, that addresses the challenge of optimizing GNNs performance by leveraging kernels tailored to the density characteristics at the subgraph level. Meanwhile, we also propose a method that dynamically chooses the optimal set of kernels for a given input graph. Our evaluation shows that AdaptGear can achieve a significant performance improvement, up to $6.49 \times$ ($1.87 \times$ on average), over the state-of-the-art works on two mainstream NVIDIA GPUs across various datasets. △ Less

Submitted 27 May, 2023; originally announced May 2023.

arXiv:2305.15099 [pdf, other]

doi 10.18653/v1/2023.findings-acl.570

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

Authors: Ziwei He, Meng Yang, Minwei Feng, **gcheng Yin, Xinbing Wang, **gwen Leng, Zhouhan Lin

Abstract: The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to i… ▽ More The transformer model is known to be computationally demanding, and prohibitively costly for long sequences, as the self-attention module uses a quadratic time and space complexity with respect to sequence length. Many researchers have focused on designing new forms of self-attention or introducing new parameters to overcome this limitation, however a large portion of them prohibits the model to inherit weights from large pretrained models. In this work, the transformer's inefficiency has been taken care of from another perspective. We propose Fourier Transformer, a simple yet effective approach by progressively removing redundancies in hidden sequence using the ready-made Fast Fourier Transform (FFT) operator to perform Discrete Cosine Transformation (DCT). Fourier Transformer is able to significantly reduce computational costs while retain the ability to inherit from various large pretrained models. Experiments show that our model achieves state-of-the-art performances among all transformer-based models on the long-range modeling benchmark LRA with significant improvement in both speed and space. For generative seq-to-seq tasks including CNN/DailyMail and ELI5, by inheriting the BART weights our model outperforms the standard BART and other efficient models. \footnote{Our code is publicly available at \url{https://github.com/LUMIA-Group/FourierTransformer}} △ Less

Submitted 24 May, 2023; originally announced May 2023.

Journal ref: Findings of the Association for Computational Linguistics: ACL 2023

arXiv:2305.15018 [pdf, other]

Modifying $n$-qubit controlled-$ZX$ gate to be $n$-qubit Toffoli gate

Authors: Jian Leng, Fan Yang, Xiang-Bin Wang

Abstract: The decomposition for controlled-$ZX$ gate in [Phys. Rev. A, 87, 062318 (2013)] has a shallow circuit depth $8n-20$ with no ancilla. Here we modify this decomposition to decompose $n$-qubit Toffoli gate with only $2n-3$ additional single-qubit gates. The circuit depth is unchanged and no ancilla is needed. We explicitly show that the circuit after decomposition can be easily constructed in present… ▽ More The decomposition for controlled-$ZX$ gate in [Phys. Rev. A, 87, 062318 (2013)] has a shallow circuit depth $8n-20$ with no ancilla. Here we modify this decomposition to decompose $n$-qubit Toffoli gate with only $2n-3$ additional single-qubit gates. The circuit depth is unchanged and no ancilla is needed. We explicitly show that the circuit after decomposition can be easily constructed in present physical systems. △ Less

Submitted 24 May, 2023; originally announced May 2023.

Comments: 6 pages, 9 figures

arXiv:2305.12300 [pdf, other]

Improving D2p Grover's algorithm to reach performance upper bound under phase noise

Authors: Jian Leng, Fan Yang, Xiang-Bin Wang

Abstract: The original Grover's algorithm has a success probability to output a correct solution, while deterministic Grover's algorithms improve the success probability to 100%. However, the success probability of deterministic Grover's algorithm decreases in noisy environment. Here we improve the deterministic two-parameter (D2p) Grover's algorithm to reach the upper bound for success probability under ph… ▽ More The original Grover's algorithm has a success probability to output a correct solution, while deterministic Grover's algorithms improve the success probability to 100%. However, the success probability of deterministic Grover's algorithm decreases in noisy environment. Here we improve the deterministic two-parameter (D2p) Grover's algorithm to reach the upper bound for success probability under phase noise. We prove that it is not possible to design any deterministic Grover's algorithm whose success probability is higher than our improved D2p protocol's under phase noise. △ Less

Submitted 20 May, 2023; originally announced May 2023.

Comments: 7 pages, 8 figures

arXiv:2305.10801 [pdf, other]

Selecting Learnable Training Samples is All DETRs Need in Crowded Pedestrian Detection

Authors: Feng Gao, Jiaxu Leng, Gan Ji, Xinbo Gao

Abstract: DEtection TRansformer (DETR) and its variants (DETRs) achieved impressive performance in general object detection. However, in crowded pedestrian detection, the performance of DETRs is still unsatisfactory due to the inappropriate sample selection method which results in more false positives. To settle the issue, we propose a simple but effective sample selection method for DETRs, Sample Selection… ▽ More DEtection TRansformer (DETR) and its variants (DETRs) achieved impressive performance in general object detection. However, in crowded pedestrian detection, the performance of DETRs is still unsatisfactory due to the inappropriate sample selection method which results in more false positives. To settle the issue, we propose a simple but effective sample selection method for DETRs, Sample Selection for Crowded Pedestrians (SSCP), which consists of the constraint-guided label assignment scheme (CGLA) and the utilizability-aware focal loss (UAFL). Our core idea is to select learnable samples for DETRs and adaptively regulate the loss weights of samples based on their utilizability. Specifically, in CGLA, we proposed a new cost function to ensure that only learnable positive training samples are retained and the rest are negative training samples. Further, considering the utilizability of samples, we designed UAFL to adaptively assign different loss weights to learnable positive samples depending on their gradient ratio and IoU. Experimental results show that the proposed SSCP effectively improves the baselines without introducing any overhead in inference. Especially, Iter Deformable DETR is improved to 39.7(-2.0)% MR on Crowdhuman and 31.8(-0.4)% MR on Citypersons. △ Less

Submitted 18 May, 2023; originally announced May 2023.

arXiv:2304.14846 [pdf]

doi 10.1021/acs.nanolett.3c00213

Ultrafast and Electrically Tunable Rabi Frequency in a Germanium Hut Wire Hole Spin Qubit

Authors: He Liu, Ke Wang, Fei Gao, ** Leng, Yang Liu, Yu-Chen Zhou, Gang Cao, Ting Wang, Jianjun Zhang, Peihao Huang, Hai-Ou Li, Guo-** Guo

Abstract: Hole spin qubits based on germanium (Ge) have strong tunable spin orbit interaction (SOI) and ultrafast qubit operation speed. Here we report that the Rabi frequency (f_Rabi) of a hole spin qubit in a Ge hut wire (HW) double quantum dot (DQD) is electrically tuned through the detuning energy and middle gate voltage (V_M). f_Rabi gradually decreases with increasing detuning energy; on the contrary,… ▽ More Hole spin qubits based on germanium (Ge) have strong tunable spin orbit interaction (SOI) and ultrafast qubit operation speed. Here we report that the Rabi frequency (f_Rabi) of a hole spin qubit in a Ge hut wire (HW) double quantum dot (DQD) is electrically tuned through the detuning energy and middle gate voltage (V_M). f_Rabi gradually decreases with increasing detuning energy; on the contrary, f_Rabi is positively correlated with V_M. We attribute our results to the change of electric field on SOI and the contribution of the excited state in quantum dots to f_Rabi. We further demonstrate an ultrafast f_Rabi exceeding 1.2 GHz, which evidences the strong SOI in our device. The discovery of an ultrafast and electrically tunable f_Rabi in a hole spin qubit has potential applications in semiconductor quantum computing. △ Less

Submitted 28 April, 2023; originally announced April 2023.

Comments: 19 pages, 4 figures

Journal ref: Nano Letters 23, 3810-3817 (2023)

arXiv:2304.07493 [pdf, other]

doi 10.1145/3579371.3589038

OliVe: Accelerating Large Language Models via Hardware-friendly Outlier-Victim Pair Quantization

Authors: Cong Guo, Jiaming Tang, Weiming Hu, **gwen Leng, Chen Zhang, Fan Yang, Yunxin Liu, Minyi Guo, Yuhao Zhu

Abstract: Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs' size grows by $240\times$ every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with sig… ▽ More Transformer-based large language models (LLMs) have achieved great success with the growing model size. LLMs' size grows by $240\times$ every two years, which outpaces the hardware progress and makes model inference increasingly costly. Model quantization is a promising approach to mitigate the widening gap between LLM size and hardware capacity. However, the existence of outliers, values with significant magnitudes, in LLMs makes existing quantization methods less effective. Prior outlier-aware quantization schemes adopt sparsity encoding techniques to separate outliers from normal values where the process requires global coordination (e.g., a global sparsity coordination list). This incurs complex encoding/decoding hardware logics and an extra orchestration controller for the computation between outlier and normal values. As such, it is not hardware-efficient and hence only achieves sub-optimal quantization benefits. We propose OliVe, an algorithm/architecture co-designed solution that adopts an outlier-victim pair (OVP) quantization and handles outlier values locally with low hardware overheads and high performance gains. The key insight of OliVe is that outliers are important while the normal values next to them are not. Thus those normal values (called victims) can be sacrificed to accommodate outliers. This enables a memory-aligned OVP encoding scheme, which can be efficiently integrated to the existing hardware accelerators like systolic array and tensor core. As a result, OliVe-based accelerator surpasses the existing outlier-aware accelerator, GOBO, by 4.5$\times$ speedup and 4.0$\times$ energy reduction, respectively, with a superior model accuracy. △ Less

Submitted 15 April, 2023; originally announced April 2023.

Comments: ISCA 2023

arXiv:2304.03352 [pdf, other]

ImaGen: A General Framework for Generating Memory- and Power-Efficient Image Processing Accelerators

Authors: Nisarg Ujjainkar, **gwen Leng, Yuhao Zhu

Abstract: Image processing algorithms are prime targets for hardware acceleration as they are commonly used in resource- and power-limited applications. Today's image processing accelerator designs make rigid assumptions about the algorithm structures and/or on-chip memory resources. As a result, they either have narrow applicability or result in inefficient designs. This paper presents a compiler framewo… ▽ More Image processing algorithms are prime targets for hardware acceleration as they are commonly used in resource- and power-limited applications. Today's image processing accelerator designs make rigid assumptions about the algorithm structures and/or on-chip memory resources. As a result, they either have narrow applicability or result in inefficient designs. This paper presents a compiler framework that automatically generates memory- and power-efficient image processing accelerators. We allow programmers to describe generic image processing algorithms (in a domain specific language) and specify on-chip memory structures available. Our framework then formulates a constrained optimization problem that minimizes on-chip memory usage while maintaining theoretical maximum throughput. The key challenge we address is to analytically express the throughput bottleneck, on-chip memory contention, to enable a lightweight compilation. FPGA prototy** and ASIC synthesis show that, compared to existing approaches, accelerators generated by our framework reduce the on-chip memory usage and/or power consumption by double digits. △ Less

Submitted 6 April, 2023; originally announced April 2023.

arXiv:2303.01471 [pdf, other]

Quantum Hamiltonian Descent

Authors: Jiaqi Leng, Ethan Hickman, Joseph Li, Xiaodi Wu

Abstract: Gradient descent is a fundamental algorithm in both theory and practice for continuous optimization. Identifying its quantum counterpart would be appealing to both theoretical and practical quantum applications. A conventional approach to quantum speedups in optimization relies on the quantum acceleration of intermediate steps of classical algorithms, while kee** the overall algorithmic trajecto… ▽ More Gradient descent is a fundamental algorithm in both theory and practice for continuous optimization. Identifying its quantum counterpart would be appealing to both theoretical and practical quantum applications. A conventional approach to quantum speedups in optimization relies on the quantum acceleration of intermediate steps of classical algorithms, while kee** the overall algorithmic trajectory and solution quality unchanged. We propose Quantum Hamiltonian Descent (QHD), which is derived from the path integral of dynamical systems referring to the continuous-time limit of classical gradient descent algorithms, as a truly quantum counterpart of classical gradient methods where the contribution from classically-prohibited trajectories can significantly boost QHD's performance for non-convex optimization. Moreover, QHD is described as a Hamiltonian evolution efficiently simulatable on both digital and analog quantum computers. By embedding the dynamics of QHD into the evolution of the so-called Quantum Ising Machine (including D-Wave and others), we empirically observe that the D-Wave-implemented QHD outperforms a selection of state-of-the-art gradient-based classical solvers and the standard quantum adiabatic algorithm, based on the time-to-solution metric, on non-convex constrained quadratic programming instances up to 75 dimensions. Finally, we propose a "three-phase picture" to explain the behavior of QHD, especially its difference from the quantum adiabatic algorithm. △ Less

Submitted 2 March, 2023; originally announced March 2023.

Comments: 71 pages, 13 figures, an accompanying website is at https://jiaqileng.github.io/quantum-hamiltonian-descent/

arXiv:2302.11708 [pdf, other]

The fractal uncertainty principle via Dolgopyat's method in higher dimensions

Authors: Aidan Backus, James Leng, Zhongkai Tao

Abstract: We prove a fractal uncertainty principle with exponent $\frac{d}{2} - δ+ \varepsilon$, $\varepsilon > 0$, for Ahlfors--David regular subsets of $\mathbb R^d$ with dimension $δ$ which satisfy a suitable "nonorthogonality condition". This generalizes the application of Dolgopyat's method by Dyatlov--** (arXiv:1702.03619) to prove the same result in the special case $d = 1$. As a corollary, we get a… ▽ More We prove a fractal uncertainty principle with exponent $\frac{d}{2} - δ+ \varepsilon$, $\varepsilon > 0$, for Ahlfors--David regular subsets of $\mathbb R^d$ with dimension $δ$ which satisfy a suitable "nonorthogonality condition". This generalizes the application of Dolgopyat's method by Dyatlov--** (arXiv:1702.03619) to prove the same result in the special case $d = 1$. As a corollary, we get a quantitative spectral gap for the Laplacian on convex cocompact hyperbolic manifolds of arbitrary dimension with Zariski dense fundamental groups. △ Less

Submitted 9 October, 2023; v1 submitted 22 February, 2023; originally announced February 2023.

Comments: 33 pages, 5 figures, comments welcome. Contains corrections and improved graphics

MSC Class: 28A80; 35B34; 81Q50

arXiv:2212.09635 [pdf, other]

Improved quadratic Gowers uniformity for the Möbius function

Authors: James Leng

Abstract: We demonstrate that $$\|μ\|_{U^3([N])} \ll_{A}^{\text{ineff}} \log^{-A}(N)$$ $$\|Λ- Λ_Q\|_{U^3([N])} \ll_{A}^{\text{ineff}} \log^{-A}(N)$$ for any $A > 0$ where $Λ_Q$ is an approximant to the von Mangoldt function and will be defined below, improving upon a bound of Tao-Teräväinen (2021). As a consequence, among other things, we have the following:… ▽ More We demonstrate that $$\|μ\|_{U^3([N])} \ll_{A}^{\text{ineff}} \log^{-A}(N)$$ $$\|Λ- Λ_Q\|_{U^3([N])} \ll_{A}^{\text{ineff}} \log^{-A}(N)$$ for any $A > 0$ where $Λ_Q$ is an approximant to the von Mangoldt function and will be defined below, improving upon a bound of Tao-Teräväinen (2021). As a consequence, among other things, we have the following: $$\mathbb{E}_{x, y \in [N], x + 3y \in [N]} Λ(x)Λ(x + y)Λ(x + 2y)Λ(x + 3y) = \mathfrak{S} + O_A(\log^{-A}(N))$$ where $\mathfrak{S}$ is the singular series for the configuration $(x, x + y, x + 2y, x + 3y)$. In fact, we show that $$\|μ- μ_{Siegel}\|_{U^3([N])} \ll \exp(-O(\log^{1/C}(N)))$$ $$\|Λ- Λ_{Siegel}\|_{U^3([N])} \ll \exp(-O(\log^{1/C}(N)))$$ where $μ_{Siegel}$ and $Λ_{Siegel}$ are approximants of $μ$, and $Λ$, respectively, representing the Siegel zero contribution of $μ$ and are defined in the above article. To do so, we use an improvement of the $U^3$ inverse theorem due to Sanders and we follow the approach of Green and Tao (2007), opting to use the ``old-fashioned" approach to equidistribution on two-step nilmanifolds which was also considered by Green and Tao (2017), and by Gowers and Wolf (2010). To the author's knowledge, this is the first time that quadratic Fourier analysis over $\mathbb{Z}/N\mathbb{Z}$ has achieved quasi-polynomial type bounds in applications. △ Less

Submitted 20 March, 2023; v1 submitted 19 December, 2022; originally announced December 2022.

Comments: 38 pages. Comments welcome! v3: Fixed sections 1.2 and 8.1

arXiv:2211.10188 [pdf, other]

Piecewise Affine Curvature model: a reduced-order model for soft robot-environment interaction beyond PCC

Authors: Francesco Stella, Qinghua Guan, **song Leng, Cosimo Della Santina, Josie Hughes

Abstract: Soft robot are celebrated for their propensity to enable compliant and complex robot-environment interactions. Soft robotic manipulators, or slender continuum structure robots have the potential to exploit these interactions to enable new exploration and manipulation capabilities and safe human-robot interactions. However, the interactions, or perturbations by external forces cause the soft struct… ▽ More Soft robot are celebrated for their propensity to enable compliant and complex robot-environment interactions. Soft robotic manipulators, or slender continuum structure robots have the potential to exploit these interactions to enable new exploration and manipulation capabilities and safe human-robot interactions. However, the interactions, or perturbations by external forces cause the soft structure to deform in an infinite degree of freedom (DOF) space. To control such system, reduced order models are needed; typically models consider piecewise sections of constant curvature although external forces often deform the structure out of the constant curvature hypothesis. In this work we perform an analysis of the trade-off between computational treatability and modelling accuracy. We then propose a new kinematic model, the Piecewise Affine Curvature (PAC) which we validate theoretically and experimentally showing that this higher-order model better captures the configuration of a soft continuum body robot when perturbed by the external forces. In comparison to the current state of the art Piecewise Constant Curvature (PCC) model we demonstrate up to 30\% reduction in error for the end position of a soft continuum body robot. △ Less

Submitted 18 November, 2022; originally announced November 2022.

Comments: Submitted to IEEE RoboSoft 2023

arXiv:2210.15972 [pdf, other]

Contextual Learning in Fourier Complex Field for VHR Remote Sensing Images

Authors: Yan Zhang, Xiyuan Gao, Qingyan Duan, Jiaxu Leng, Xiao Pu, Xinbo Gao

Abstract: Very high-resolution (VHR) remote sensing (RS) image classification is the fundamental task for RS image analysis and understanding. Recently, transformer-based models demonstrated outstanding potential for learning high-order contextual relationships from natural images with general resolution (224x224 pixels) and achieved remarkable results on general image classification tasks. However, the com… ▽ More Very high-resolution (VHR) remote sensing (RS) image classification is the fundamental task for RS image analysis and understanding. Recently, transformer-based models demonstrated outstanding potential for learning high-order contextual relationships from natural images with general resolution (224x224 pixels) and achieved remarkable results on general image classification tasks. However, the complexity of the naive transformer grows quadratically with the increase in image size, which prevents transformer-based models from VHR RS image (500x500 pixels) classification and other computationally expensive downstream tasks. To this end, we propose to decompose the expensive self-attention (SA) into real and imaginary parts via discrete Fourier transform (DFT) and therefore propose an efficient complex self-attention (CSA) mechanism. Benefiting from the conjugated symmetric property of DFT, CSA is capable to model the high-order contextual information with less than half computations of naive SA. To overcome the gradient explosion in Fourier complex field, we replace the Softmax function with the carefully designed Logmax function to normalize the attention map of CSA and stabilize the gradient propagation. By stacking various layers of CSA blocks, we propose the Fourier Complex Transformer (FCT) model to learn global contextual information from VHR aerial images following the hierarchical manners. Universal experiments conducted on commonly used RS classification data sets demonstrate the effectiveness and efficiency of FCT, especially on very high-resolution RS images. △ Less

Submitted 28 October, 2022; originally announced October 2022.

arXiv:2210.15812 [pdf, other]

Differentiable Analog Quantum Computing for Optimization and Control

Authors: Jiaqi Leng, Yuxiang Peng, Yi-Ling Qiao, Ming Lin, Xiaodi Wu

Abstract: We formulate the first differentiable analog quantum computing framework with a specific parameterization design at the analog signal (pulse) level to better exploit near-term quantum devices via variational methods. We further propose a scalable approach to estimate the gradients of quantum dynamics using a forward pass with Monte Carlo sampling, which leads to a quantum stochastic gradient desce… ▽ More We formulate the first differentiable analog quantum computing framework with a specific parameterization design at the analog signal (pulse) level to better exploit near-term quantum devices via variational methods. We further propose a scalable approach to estimate the gradients of quantum dynamics using a forward pass with Monte Carlo sampling, which leads to a quantum stochastic gradient descent algorithm for scalable gradient-based training in our framework. Applying our framework to quantum optimization and control, we observe a significant advantage of differentiable analog quantum computing against SOTAs based on parameterized digital quantum circuits by orders of magnitude. △ Less

Submitted 27 October, 2022; originally announced October 2022.

Comments: Code available at https://github.com/YilingQiao/diffquantum

Journal ref: In the Proceedings of the 36th Conference on Neural Information Processing Systems (NeurIPS 2022)

arXiv:2209.10778 [pdf, other]

Nesting Forward Automatic Differentiation for Memory-Efficient Deep Neural Network Training

Authors: Cong Guo, Yuxian Qiu, **gwen Leng, Chen Zhang, Ying Cao, Quanlu Zhang, Yunxin Liu, Fan Yang, Minyi Guo

Abstract: An activation function is an element-wise mathematical function and plays a crucial role in deep neural networks (DNN). Many novel and sophisticated activation functions have been proposed to improve the DNN accuracy but also consume massive memory in the training process with back-propagation. In this study, we propose the nested forward automatic differentiation (Forward-AD), specifically for th… ▽ More An activation function is an element-wise mathematical function and plays a crucial role in deep neural networks (DNN). Many novel and sophisticated activation functions have been proposed to improve the DNN accuracy but also consume massive memory in the training process with back-propagation. In this study, we propose the nested forward automatic differentiation (Forward-AD), specifically for the element-wise activation function for memory-efficient DNN training. We deploy nested Forward-AD in two widely-used deep learning frameworks, TensorFlow and PyTorch, which support the static and dynamic computation graph, respectively. Our evaluation shows that nested Forward-AD reduces the memory footprint by up to 1.97x than the baseline model and outperforms the recomputation by 20% under the same memory reduction ratio. △ Less

Submitted 22 September, 2022; originally announced September 2022.

Comments: 8 pages, ICCD 2022

arXiv:2208.14286 [pdf, other]

ANT: Exploiting Adaptive Numerical Data Type for Low-bit Deep Neural Network Quantization

Authors: Cong Guo, Chen Zhang, **gwen Leng, Zihan Liu, Fan Yang, Yunxin Liu, Minyi Guo, Yuhao Zhu

Abstract: Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and… ▽ More Quantization is a technique to reduce the computation and memory cost of DNN models, which are getting increasingly large. Existing quantization solutions use fixed-point integer or floating-point types, which have limited benefits, as both require more bits to maintain the accuracy of original models. On the other hand, variable-length quantization uses low-bit quantization for normal values and high-precision for a fraction of outlier values. Even though this line of work brings algorithmic benefits, it also introduces significant hardware overheads due to variable-length encoding and decoding. In this work, we propose a fixed-length adaptive numerical data type called ANT to achieve low-bit quantization with tiny hardware overheads. Our data type ANT leverages two key innovations to exploit the intra-tensor and inter-tensor adaptive opportunities in DNN models. First, we propose a particular data type, flint, that combines the advantages of float and int for adapting to the importance of different values within a tensor. Second, we propose an adaptive framework that selects the best type for each tensor according to its distribution characteristics. We design a unified processing element architecture for ANT and show its ease of integration with existing DNN accelerators. Our design results in 2.8$\times$ speedup and 2.5$\times$ energy efficiency improvement over the state-of-the-art quantization accelerators. △ Less

Submitted 30 August, 2022; originally announced August 2022.

Comments: 20 pages, accepted by MICRO 2022

arXiv:2208.11945 [pdf, other]

Efficient Adaptive Activation Rounding for Post-Training Quantization

Authors: Zhengyi Li, Cong Guo, Zhanda Zhu, Yangjie Zhou, Yuxian Qiu, Xiaotian Gao, **gwen Leng, Minyi Guo

Abstract: Post-training quantization attracts increasing attention due to its convenience in deploying quantized neural networks. Although rounding-to-nearest remains the prevailing method for DNN quantization, prior research has demonstrated its suboptimal nature when applied to weight quantization. They propose optimizing weight rounding schemes by leveraging output error rather than the traditional weigh… ▽ More Post-training quantization attracts increasing attention due to its convenience in deploying quantized neural networks. Although rounding-to-nearest remains the prevailing method for DNN quantization, prior research has demonstrated its suboptimal nature when applied to weight quantization. They propose optimizing weight rounding schemes by leveraging output error rather than the traditional weight quantization error. Our study reveals that similar rounding challenges also extend to activation quantization. Despite the easy generalization, the challenges lie in the dynamic nature of activation. Adaptive rounding is expected for varying activations and the method is subjected to runtime overhead. To tackle this, we propose the AQuant quantization framework with a novel perspective to reduce output error by adjusting rounding schemes of activations. Instead of using the constant rounding border 0.5 of the rounding-to-nearest operation, we make the border become a function w.r.t. the activation value to change the activation rounding by the adaptive border. To deal with the runtime overhead, we use a coarse-grained version of the border function. Finally, we introduce our framework to optimize the border function. Extensive experiments show that AQuant achieves notable improvements compared to state-of-the-art works and pushes the accuracy of ResNet-18 up to 60.31% under the 2-bit weight and activation quantization. △ Less

Submitted 23 August, 2023; v1 submitted 25 August, 2022; originally announced August 2022.

arXiv:2206.14550 [pdf, other]

SALO: An Efficient Spatial Accelerator Enabling Hybrid Sparse Attention Mechanisms for Long Sequences

Authors: Guan Shen, Jieru Zhao, Quan Chen, **gwen Leng, Chao Li, Minyi Guo

Abstract: The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse atten… ▽ More The attention mechanisms of transformers effectively extract pertinent information from the input sequence. However, the quadratic complexity of self-attention w.r.t the sequence length incurs heavy computational and memory burdens, especially for tasks with long sequences. Existing accelerators face performance degradation in these tasks. To this end, we propose SALO to enable hybrid sparse attention mechanisms for long sequences. SALO contains a data scheduler to map hybrid sparse attention patterns onto hardware and a spatial accelerator to perform the efficient attention computation. We show that SALO achieves 17.66x and 89.33x speedup on average compared to GPU and CPU implementations, respectively, on typical workloads, i.e., Longformer and ViL. △ Less

Submitted 29 June, 2022; originally announced June 2022.

Comments: Accepted by 59th DAC

arXiv:2205.07324 [pdf, other]

Transkimmer: Transformer Learns to Layer-wise Skim

Authors: Yue Guan, Zhengyi Li, **gwen Leng, Zhouhan Lin, Minyi Guo

Abstract: Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the T… ▽ More Transformer architecture has become the de-facto model for many machine learning tasks from natural language processing and computer vision. As such, improving its computational efficiency becomes paramount. One of the major computational inefficiency of Transformer-based models is that they spend the identical amount of computation throughout all layers. Prior works have proposed to augment the Transformer model with the capability of skimming tokens to improve its computational efficiency. However, they suffer from not having effectual and end-to-end optimization of the discrete skimming predictor. To address the above limitations, we propose the Transkimmer architecture, which learns to identify hidden state tokens that are not required by each layer. The skimmed tokens are then forwarded directly to the final output, thus reducing the computation of the successive layers. The key idea in Transkimmer is to add a parameterized predictor before each layer that learns to make the skimming decision. We also propose to adopt reparameterization trick and add skim loss for the end-to-end training of Transkimmer. Transkimmer achieves 10.97x average speedup on GLUE benchmark compared with vanilla BERT-base baseline with less than 1% accuracy degradation. △ Less

Submitted 15 May, 2022; originally announced May 2022.

Comments: Published as a conference paper at ACL 2022

arXiv:2205.05540 [pdf, other]

A Quantitative Bound For Szemerédi's Theorem for a Complexity One Polynomial Progression over $\mathbb{Z}/N\mathbb{Z}$

Authors: James Leng

Abstract: Let $N$ be a large prime and $P, Q \in \mathbb{Z}[x]$ two linearly independent polynomials with $P(0) = Q(0) = 0$. We show that if a subset $A$ of $\mathbb{Z}/N\mathbb{Z}$ lacks a progression of the form $(x, x + P(y), x + Q(y), x + P(y) + Q(y))$, then $$|A| \le O\left(\frac{N}{\log_{(O(1))}(N)}\right)$$ where $\log_{C}(N)$ is an iterated logarithm of order $C$ (e.g., $\log_{2}(N) = \log\log(N)$).… ▽ More Let $N$ be a large prime and $P, Q \in \mathbb{Z}[x]$ two linearly independent polynomials with $P(0) = Q(0) = 0$. We show that if a subset $A$ of $\mathbb{Z}/N\mathbb{Z}$ lacks a progression of the form $(x, x + P(y), x + Q(y), x + P(y) + Q(y))$, then $$|A| \le O\left(\frac{N}{\log_{(O(1))}(N)}\right)$$ where $\log_{C}(N)$ is an iterated logarithm of order $C$ (e.g., $\log_{2}(N) = \log\log(N)$). To establish this bound, we adapt Peluse's (2018) degree lowering argument to the quadratic Fourier analysis setting to obtain quantitative bounds on the true complexity of the above progression. Our method also shows that for a large class of polynomial progressions, if one can establish polynomial-type bounds on the true complexity of those progressions, then one can establish polynomial-type bounds on Szemerédi's theorem for that type of polynomial progression. △ Less

Submitted 20 May, 2024; v1 submitted 11 May, 2022; originally announced May 2022.

Comments: 33 pages. Journal version

arXiv:2203.17006 [pdf, other]

doi 10.22331/q-2022-11-17-860

Quantum simulation of real-space dynamics

Authors: Andrew M. Childs, Jiaqi Leng, Tongyang Li, **-Peng Liu, Chenyi Zhang

Abstract: Quantum simulation is a prominent application of quantum computers. While there is extensive previous work on simulating finite-dimensional systems, less is known about quantum algorithms for real-space dynamics. We conduct a systematic study of such algorithms. In particular, we show that the dynamics of a $d$-dimensional Schrödinger equation with $η$ particles can be simulated with gate complexi… ▽ More Quantum simulation is a prominent application of quantum computers. While there is extensive previous work on simulating finite-dimensional systems, less is known about quantum algorithms for real-space dynamics. We conduct a systematic study of such algorithms. In particular, we show that the dynamics of a $d$-dimensional Schrödinger equation with $η$ particles can be simulated with gate complexity $\tilde{O}\bigl(ηd F \text{poly}(\log(g'/ε))\bigr)$, where $ε$ is the discretization error, $g'$ controls the higher-order derivatives of the wave function, and $F$ measures the time-integrated strength of the potential. Compared to the best previous results, this exponentially improves the dependence on $ε$ and $g'$ from $\text{poly}(g'/ε)$ to $\text{poly}(\log(g'/ε))$ and polynomially improves the dependence on $T$ and $d$, while maintaining best known performance with respect to $η$. For the case of Coulomb interactions, we give an algorithm using $η^{3}(d+η)T\text{poly}(\log(ηdTg'/(Δε)))/Δ$ one- and two-qubit gates, and another using $η^{3}(4d)^{d/2}T\text{poly}(\log(ηdTg'/(Δε)))/Δ$ one- and two-qubit gates and QRAM operations, where $T$ is the evolution time and the parameter $Δ$ regulates the unbounded Coulomb interaction. We give applications to several computational problems, including faster real-space simulation of quantum chemistry, rigorous analysis of discretization error for simulation of a uniform electron gas, and a quadratic improvement to a quantum algorithm for esca** saddle points in nonconvex optimization. △ Less

Submitted 7 November, 2022; v1 submitted 31 March, 2022; originally announced March 2022.

Journal ref: Quantum 6, 860 (2022)

arXiv:2203.14101

A Roadmap for Big Model

Authors: Sha Yuan, Hanyu Zhao, Shuai Zhao, Jiahong Leng, Yangxiao Liang, Xiaozhi Wang, Jifan Yu, Xin Lv, Zhou Shao, Jiaao He, Yankai Lin, Xu Han, Zhenghao Liu, Ning Ding, Yongming Rao, Yizhao Gao, Liang Zhang, Ming Ding, Cong Fang, Yisen Wang, Mingsheng Long, **g Zhang, Yinpeng Dong, Tianyu Pang, Peng Cui , et al. (75 additional authors not shown)

Abstract: With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM… ▽ More With the rapid development of deep learning, training Big Models (BMs) for multiple downstream tasks becomes a popular paradigm. Researchers have achieved various outcomes in the construction of BMs and the BM application in many fields. At present, there is a lack of research work that sorts out the overall progress of BMs and guides the follow-up research. In this paper, we cover not only the BM technologies themselves but also the prerequisites for BM training and applications with BMs, dividing the BM review into four parts: Resource, Models, Key Technologies and Application. We introduce 16 specific BM-related topics in those four parts, they are Data, Knowledge, Computing System, Parallel Training System, Language Model, Vision Model, Multi-modal Model, Theory&Interpretability, Commonsense Reasoning, Reliability&Security, Governance, Evaluation, Machine Translation, Text Generation, Dialogue and Protein Research. In each topic, we summarize clearly the current studies and propose some future research directions. At the end of this paper, we conclude the further development of BMs in a more general view. △ Less

Submitted 20 April, 2022; v1 submitted 26 March, 2022; originally announced March 2022.

Comments: This report has been withdrawn by the authors due to critical issues in Section 2.3.1 of Article 2

Showing 1–50 of 107 results for author: Leng, J