Skip to main content

Showing 1–2 of 2 results for author: Bikshandi, G

.
  1. arXiv:2407.08608  [pdf, other

    cs.LG cs.AI

    FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision

    Authors: Jay Shah, Ganesh Bikshandi, Ying Zhang, Vijay Thakkar, Pradeep Ramani, Tri Dao

    Abstract: Attention, as a core layer of the ubiquitous Transformer architecture, is the bottleneck for large language models and long-context applications. FlashAttention elaborated an approach to speed up attention on GPUs through minimizing memory reads/writes. However, it has yet to take advantage of new capabilities present in recent hardware, with FlashAttention-2 achieving only 35% utilization on the… ▽ More

    Submitted 11 July, 2024; originally announced July 2024.

  2. arXiv:2312.11918  [pdf, other

    cs.LG cs.DC

    A Case Study in CUDA Kernel Fusion: Implementing FlashAttention-2 on NVIDIA Hopper Architecture using the CUTLASS Library

    Authors: Ganesh Bikshandi, Jay Shah

    Abstract: We provide an optimized implementation of the forward pass of FlashAttention-2, a popular memory-aware scaled dot-product attention algorithm, as a custom fused CUDA kernel targeting NVIDIA Hopper architecture and written using the open-source CUTLASS library. In doing so, we explain the challenges and techniques involved in fusing online-softmax with back-to-back GEMM kernels, utilizing the Hoppe… ▽ More

    Submitted 19 December, 2023; originally announced December 2023.

    Comments: 13 pages, comments welcome