License: CC BY-NC-ND 4.0
arXiv:2401.11459v1 [cs.AR] 21 Jan 2024

AttentionLego: An Open-Source Building Block For Spatially-Scalable Large Language Model Accelerator With Processing-In-Memory Technology

Rongqing Cong , Wenyang He*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Mingxuan Li*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Bangning Luo*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT, Zebin Yang*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT,
Yuchao Yangnormal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Ru Huangnormal-†{}^{\dagger}start_FLOATSUPERSCRIPT † end_FLOATSUPERSCRIPT, Bonan Yan
Peking University
https://bonany.cc/attentionlego
All authors contributed equally, listed in alphabetical order by last names.Corresponding authors:Yuchao Yang ([email protected]), Ru Huang ([email protected]) and Bonan Yan ([email protected]).
Abstract

Large language models (LLMs) with Transformer architectures have become phenomenal in natural language processing, multimodal generative artificial intelligence, and agent-oriented artificial intelligence. The self-attention module is the most dominating sub-structure inside Transformer-based LLMs. Computation using general-purpose graphics processing units (GPUs) inflicts reckless demand for I/O bandwidth for transferring intermediate calculation results between memories and processing units. To tackle this challenge, this work develops a fully customized vanilla self-attention accelerator, AttentionLego, as the basic building block for constructing spatially expandable LLM processors. AttentionLego provides basic implementation with fully-customized digital logic incorporating Processing-In-Memory (PIM) technology. It is based on PIM-based matrix-vector multiplication and look-up table-based Softmax design. The open-source code is available online: https://bonany.cc/attentionleg.

1 Introduction

The Transformer architecture has attracted significant attention due to its exceptional performance in a variety of natural language processing, vision, and multimodal generative tasks Touvron et al. (2023a, b); Workshop et al. (2023); Dey et al. (2023); Dao et al. (2022); Black et al. (2022); Li et al. (2023); Biderman et al. (2023). The Transformer was first introduced in 2017 introducing full integration of self-attention mechanism Vaswani et al. (2023). It can model complex relationships between different parts of a sequence, making it an ideal choice for handling long-range dependencies and capturing contextual information in sequential signal processing. Experiments have shown that tiling self-attention beyond a specific scale leads to emergent abilities of large language models (LLMs), for example, performing planning, arithmetic, summarizing messages, etc.

The increasing demand for efficient, intelligent devices and systems highlights the importance of building Large Language Model (LLM) accelerators. LLMs are becoming a key component in Artificial Intelligence and the Internet of Things (AIoT) to enable the integration of natural language processing (NLP) capabilities into various Internet of Things (IoT) applications, allowing for more intuitive and user-friendly interfaces Yan et al. (2019a). However, training and deploying LLMs are computationally intensive and cause unreasonable carbon emissions, making it challenging to scale them to meet the demands of IoT devices and systems. Besides, the complex computing mechanism of the self-attention module calls for primers with source code, especially for hardware designers who need a starting point for LLM accelerator design.

Realization of LLM accelerators should focus on implementing self-attention modules because the self-attention modules occupy over 68% of operations in the prevailing LLM architectures (as shown in Fig. 1Touvron et al. (2023b, a); Dey et al. (2023); Black et al. (2022); Biderman et al. (2023); Li et al. (2023). With this observation, this work develops a fully-customized vanilla self-attention accelerator, AttentionLego. It aims to provide a fundamental building block for constructing spatially expandable LLM processors. AttentionLego implements hardware computation for self-attention with fully customized digital logic incorporating Processing-In-Memory (PIM) technology to boost the computing efficiency. It is based on PIM-based matrix-vector multiplication and look-up-table-based Softmax design. This work significantly improves the performance and efficiency of LLMs, making them more accessible to developers and users.

Refer to caption
Figure 1: Operation number breakdown for popular large language models. Self-attention module dominates the operation counts in LLMs.1 Multiply-Accumulate (MAC) counts 2 operations. We unify the operation counts for floating-point numbers and integers.

2 Preliminaries

2.1 Processing-In-Memory Technology

Refer to caption
Figure 2: Processing-in-memory macro to perform in situ general matrix-vector multiplication.

Processing-in-memory (PIM) technology is a game-changing innovation that integrates processing units and memory on the same physical chip Yan et al. (2019b). By collocating memory and processing units, PIM technology eliminates the need for data transfers between the processor and memory, significantly reducing latency and improving performance in various scientific and engineering applications. One application where PIM technology can profoundly impact is matrix-vector multiplication, a fundamental operation in LLMs. According to Fig. 1, the primary operations are self-attention and feed forward layers. Both heavily rely on storage and matrix multiplication, which are the key features offered by PIM. Implementation for the feed forward using PIM has been intensely investigated. Therefore, this work centers on the self-attention module part.

Fig. 2 illustrates the principle of using PIM to execute in-memory matrix multiplications. General matrix multiplication can be divided into matrix-vector multiplication (Fig. 2(a)). The deep learning network parameters (synaptic weights) are pre-loaded into the PIM macro array. With the help of PIM peripheral circuits, an input vector is fed into the PIM macro and interacts with the parameters stored in the PIM macro (as depicted in Fig. 2(b)). Such PIM macro can be tiled to a spatially expandable architecture (Fig. 2(c)) to store all the parameters/weights used in deep learning networks. Further, the parallelism of the matrix multiplication per operation is tunable by choosing different design parameters for the PIM macro. For example, to turn on 4, 8, 16-word lines of the PIM macro at each computing step to provide different computing throughput by comprising power consumption and chip area Yan et al. (2019a, b). In this way, PIM macros, with the help of peripheral digital logic controlling circuits, form accelerators that can exploit the inherent parallelism in matrix-vector multiplication to achieve even higher speeds and lower latencies. This work aims to answer the question of how to utilize the primary matrix-vector multiplication operator(realized by the open-source PIM behavioral model from https://bonany.gitlab.io/pis Yan (2024)) to construct a vanilla (baseline) design for the self-attention module in LLMs.

2.2 Self-Attention Module in Large Language Models

Refer to caption
Figure 3: (a) Basic block diagram and calculations for the self-attention module. (b) LLaMA 2 model architecture diagram Touvron et al. (2023b).

A self-attention module in Transformer-like architectures computes the attention weights and output values for a set of input vectors based on similarity. The computation flow can be broken down into the following steps:

  1. 1.

    Compute Query, Key, and Value vectors: The input vectors (each token is a vector) are transformed into three sets of vectors: Query (Q𝑄Qitalic_Q), Key (K𝐾Kitalic_K), and Value (V𝑉Vitalic_V) vectors by multiplying weight matrices (𝐖𝐐subscript𝐖𝐐\mathbf{W_{Q}}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐖𝐊subscript𝐖𝐊\mathbf{W_{K}}bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, 𝐖𝐕subscript𝐖𝐕\mathbf{W_{V}}bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT) to the input vectors using linear transformations.

  2. 2.

    Compute Attention Weights: The attention weights are computed by taking the dot product of the Query vectors and Key vectors, then apply a softmax function to obtain probabilities. These probabilities (also called “score”) represent the attention that should be paid to each input vector when computing the output values.

  3. 3.

    Compute Output Values: The output values are computed by taking the weighted sum of the Value vectors, where the weights are determined by the attention probabilities calculated in the previous step. This is done for each input vector, resulting in a set of output vectors that have considered the relationships between all of the input vectors.

  4. 4.

    Apply Final Linear Transformation: The output values are then passed through a final linear transformation (feed forward layer) to produce the final output vectors of the self-attention module.

This computation flow allows the self-attention module to focus on different parts of the selectively input sequence, and compute output values that take into account the relationships between all of the input vectors.

Formally, given a set of query vectors Q𝑄Qitalic_Q, key vectors K𝐾Kitalic_K, and value vectors V𝑉Vitalic_V, the Scale Dot-Product Attention can be defined as:

Attention(𝐐,𝐊,𝐕)=softmax(𝐐𝐊dk)𝐕𝐴𝑡𝑡𝑒𝑛𝑡𝑖𝑜𝑛𝐐𝐊𝐕𝑠𝑜𝑓𝑡𝑚𝑎𝑥superscript𝐐𝐊subscript𝑑𝑘𝐕Attention(\mathbf{Q},\mathbf{K},\mathbf{V})=softmax(\dfrac{\mathbf{Q}\mathbf{K% }^{\intercal}}{\sqrt{d_{k}}})\mathbf{V}italic_A italic_t italic_t italic_e italic_n italic_t italic_i italic_o italic_n ( bold_Q , bold_K , bold_V ) = italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( divide start_ARG bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT end_ARG start_ARG square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG end_ARG ) bold_V (1)

where QKT𝑄superscript𝐾𝑇QK^{T}italic_Q italic_K start_POSTSUPERSCRIPT italic_T end_POSTSUPERSCRIPT denotes the dot product between each query vector and each key vector, and dksubscript𝑑𝑘\sqrt{d_{k}}square-root start_ARG italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT end_ARG is a scaling factor that helps to stabilize the gradient during training. The resulting output has the same shape as the value vectors V𝑉Vitalic_V, with the same number of elements along each dimension.

3 AttentionLego Design

Fig. 4 illustrates the core idea of this work. This work implements a vanilla self-attention computation building block with Verilog HDL and the PIM macro behavioral model Yan (2024). It can be conveniently tiled up for LLM with repetitive self-attention modules.

Refer to caption
Figure 4: Core idea and the spatial scalability of AttentionLego.

3.1 Architecture

As illustrated in Fig. 5, AttentionLego is divided into 5 parts:

Table 1: AttentionLego Modules
No. Module Description
1 Input Process module compute 𝐗𝐖𝐐subscript𝐗𝐖𝐐\mathbf{XW_{Q}}bold_XW start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐗𝐖𝐊subscript𝐗𝐖𝐊\mathbf{XW_{K}}bold_XW start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, and 𝐗𝐖𝐕subscript𝐗𝐖𝐕\mathbf{XW_{V}}bold_XW start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT
2 Score module compute 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT
3 Softmax module compute softmax nonlinear activation function in a vector manner
4 DMA module controls data transfer among modules and between
AttentionLego and external extra storage
5 top controller controls the pipeline of computing of the above modules
Figure 5: Architecture of AttentionLego
Refer to caption

.

Figure 5: Architecture of AttentionLego

3.2 Input Process Module

The Input Process module is responsible for (a) completing the writing and reading of the input weight matrices 𝐖𝐐subscript𝐖𝐐\mathbf{W_{Q}}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐖𝐊subscript𝐖𝐊\mathbf{W_{K}}bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, and 𝐖𝐕subscript𝐖𝐕\mathbf{W_{V}}bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT, and (b) execute the multiplying the input token with each of the three matrices. The input process module works in 4 possible states:

  1. a)

    IDLE: stand-by mode, do nothing;

  2. b)

    WRITE: load pretrained the parameters of 𝐖𝐐subscript𝐖𝐐\mathbf{W_{Q}}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐖𝐊subscript𝐖𝐊\mathbf{W_{K}}bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, and 𝐖𝐕subscript𝐖𝐕\mathbf{W_{V}}bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT into the input process module;

  3. c)

    READ: read out the parameters of 𝐖𝐐subscript𝐖𝐐\mathbf{W_{Q}}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐖𝐊subscript𝐖𝐊\mathbf{W_{K}}bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, and 𝐖𝐕subscript𝐖𝐕\mathbf{W_{V}}bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT for the testing and checking purpose;

  4. d)

    CIM111We use “CIM” and “PIM” interchangeably.: compute 𝐐,𝐊,𝐕𝐐𝐊𝐕\mathbf{Q},\mathbf{K},\mathbf{V}bold_Q , bold_K , bold_V with the input 𝐗𝐗\mathbf{X}bold_X and the pre-loaded 𝐖𝐐subscript𝐖𝐐\mathbf{W_{Q}}bold_W start_POSTSUBSCRIPT bold_Q end_POSTSUBSCRIPT, 𝐖𝐊subscript𝐖𝐊\mathbf{W_{K}}bold_W start_POSTSUBSCRIPT bold_K end_POSTSUBSCRIPT, and 𝐖𝐕subscript𝐖𝐕\mathbf{W_{V}}bold_W start_POSTSUBSCRIPT bold_V end_POSTSUBSCRIPT.

The inputs ports of the input process module are:

  1. 1.

    Data:

    1. a)

      Weight data written/Input for CIM calculation: [DATA_WIDTH * D_MODEL-1:0] data_in𝑑𝑎𝑡𝑎_𝑖𝑛data\_initalic_d italic_a italic_t italic_a _ italic_i italic_n;

    2. b)

      Write to column address, range from 0 to 127, only for write mode: [ADDR_COL_WIDTH-1:0] col_sel𝑐𝑜𝑙_𝑠𝑒𝑙col\_selitalic_c italic_o italic_l _ italic_s italic_e italic_l;

  2. 2.

    Control:

    1. a)

      Clock and reset signal, rising edge triggered: clk𝑐𝑙𝑘clkitalic_c italic_l italic_k, rst𝑟𝑠𝑡rstitalic_r italic_s italic_t;

    2. b)

      Chip selection signal, input is 1 for valid: cs𝑐𝑠csitalic_c italic_s;

    3. c)

      The mode selection is jointly determined by (web𝑤𝑒𝑏webitalic_w italic_e italic_b, cimeb𝑐𝑖𝑚𝑒𝑏cimebitalic_c italic_i italic_m italic_e italic_b): READ: (1,1), WRITE: (0,1), IDLE: (0,0), CIM: (1,0);

    4. d)

      Weight matrix selection signal: [2:0] weight_sel𝑤𝑒𝑖𝑔𝑡_𝑠𝑒𝑙weight\_selitalic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_e italic_l. Among them, 𝐐𝐐\mathbf{Q}bold_Q: weight_sel=𝑤𝑒𝑖𝑔𝑡_𝑠𝑒𝑙absentweight\_sel=italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_e italic_l =(0,0); 𝐊𝐊\mathbf{K}bold_K: weight_sel=𝑤𝑒𝑖𝑔𝑡_𝑠𝑒𝑙absentweight\_sel=italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_e italic_l =(0,1);𝐕𝐕\mathbf{V}bold_V: weight_sel=𝑤𝑒𝑖𝑔𝑡_𝑠𝑒𝑙absentweight\_sel=italic_w italic_e italic_i italic_g italic_h italic_t _ italic_s italic_e italic_l =(1,0);

The output ports are:

  1. 1.

    Data:

    1. a)

      Output data for CIM mode: [DATA_WIDTH * D_k-1:0] data_out𝑑𝑎𝑡𝑎_𝑜𝑢𝑡data\_outitalic_d italic_a italic_t italic_a _ italic_o italic_u italic_t;

    2. b)

      Read out the output data of (READ) mode: [DATA_WIDTH * ‘ D_MODEL-1:0] mem_data_out𝑚𝑒𝑚_𝑑𝑎𝑡𝑎_𝑜𝑢𝑡mem\_data\_outitalic_m italic_e italic_m _ italic_d italic_a italic_t italic_a _ italic_o italic_u italic_t;

  2. 2.

    Control:

    1. a)

      Feedback completion calculation/read/write: done𝑑𝑜𝑛𝑒doneitalic_d italic_o italic_n italic_e;

The input process module consists of two parts, which are the APIM module for parameter storage and calculation, and the control module at the top for data distribution and state control. The operating process is:

  1. 1.

    Writing weight matrix: Switch to WRITE mode and select a column of a specific matrix for writing each time. For the 𝐐𝐐\mathbf{Q}bold_Q matrix, writing one column simultaneously requires 128 repetitions to complete, depending on the designed APIM IO bandwidth.

  2. 2.

    Multiplication of Input and Weight Matrix: Switch to CIM mode and feed the inputs (token) 𝐗𝐗\mathbf{X}bold_X parts by parts in order to match the stored weights/parameters. After passing in the data, the matrix-vector multiplication of 𝐗𝐗\mathbf{X}bold_X with each column vector in the weight matrix is carried out. It outputs a row vector with its length as dksubscript𝑑𝑘d_{k}italic_d start_POSTSUBSCRIPT italic_k end_POSTSUBSCRIPT.

  3. 3.

    Reading out the weight matrix: used to test whether all weights are correctly written. Given the column selection signal, col_sel𝑐𝑜𝑙_𝑠𝑒𝑙col\_selitalic_c italic_o italic_l _ italic_s italic_e italic_l can read all rows of data for the 32 APIM modules in this column.

Detailed of each working mode is given as follows:

WRITE mode:1

  1. 1.

    Decompose the input data into an array and write it in parallel to 32 APIM modules, with the writing position specified by col_sel𝑐𝑜𝑙_𝑠𝑒𝑙col\_selitalic_c italic_o italic_l _ italic_s italic_e italic_l decision. All are in IDLE state by default.

  2. 2.

    For an APIM module, if the control signal is WRITE, it enters the BUSY state. In BUSY mode, repeat the following operation 128 times: select a row to write data at the given column address. Write the next row of data for this column in the next loop until all writes are completed.

  3. 3.

    After completing the write, switch to the DONE state to reset the state and transmit the DONE signal.

READ mode:

  1. 1.

    According to col_sel𝑐𝑜𝑙_𝑠𝑒𝑙col\_selitalic_c italic_o italic_l _ italic_s italic_e italic_l and read weight data stored in 32 APIM modules in parallel.

  2. 2.

    For an APIM module, the control signal enters the BUSY state if the control signal is READ. In BUSY mode, repeat the following operation 128 times: select a row to read data at the given column address and store it in a temporary register. Read the next row of data for that column in the next loop until all readings are completed.

  3. 3.

    After completing the write, switch to the DONE state to reset the state and transmit the DONE signal.

CIM mode:

  1. 1.

    The function is to organize and add the output results of a single APIM cycle, corresponding to the external circuit structure of the adder.

  2. 2.

    Due to the parallelism of the APIM module itself, data can be input, calculated, and read out in parallel. For a given column of addresses, it is necessary to repeat all calculations for that column 8 times. Due to the parallelism of the column output being 16, the above operation only needs to be repeated 8 times to complete the calculation for all columns. Therefore, completing a matrix multiplication requires 64 clock cycles.

Description of the underlying APIM calculation module: 𝐐𝐐\mathbf{Q}bold_Q, 𝐊𝐊\mathbf{K}bold_K, and 𝐕𝐕\mathbf{V}bold_V each consisting of 32 APIM modules to store the entire parameters. Each APIM is a 128×\times×128 square matrix used to store weights and perform calculations. The stored weights are all 8-bit data. The input parallelism is 16, and a single input port shares 8 rows of addresses.The output parallelism is 16, a single output port shares 8 columns of addresses, and the ADC accuracy is 6 bits. Detailed can be found in AttentionLego/InputProcess/src/defines.v.

Refer to caption
Figure 6: State-transfer diagram for the Input Process module.

3.3 Score Module

The Score module computes the score before Softmax shown in Fig. 3, generating the square matrix 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT. AttentionLego design is to provide a template or starting point for fully customized Transformer accelerators. As exemplary dimensions, we choose the input is a vector with 128 elements that represents the value of a row in the 𝐐𝐐\mathbf{Q}bold_Q or 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT matrix, and the main output is a vector with 2048 elements that represents a row of the calculated 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT. The detailed input and output ports are described as follows:

  1. 1.

    Outputs:

    • input_done𝑖𝑛𝑝𝑢𝑡_𝑑𝑜𝑛𝑒input\_doneitalic_i italic_n italic_p italic_u italic_t _ italic_d italic_o italic_n italic_e: 1 bit, to indicate whether a row of 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT as input vector has been cached to the inputs of this module;

    • output_done𝑜𝑢𝑡𝑝𝑢𝑡_𝑑𝑜𝑛𝑒output\_doneitalic_o italic_u italic_t italic_p italic_u italic_t _ italic_d italic_o italic_n italic_e: 1 bit, to indicate whether calculation and transmission of a certain row of 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT has been done;

    • QK_output𝑄𝐾_𝑜𝑢𝑡𝑝𝑢𝑡QK\_outputitalic_Q italic_K _ italic_o italic_u italic_t italic_p italic_u italic_t: 2048×\times×8 bits, to output a certain row of 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT (calculation results).

  2. 2.

    Inputs:

    • clk𝑐𝑙𝑘clkitalic_c italic_l italic_k: clock;

    • cs𝑐𝑠csitalic_c italic_s: chip select as the global enable for the Score module;

    • reset𝑟𝑒𝑠𝑒𝑡resetitalic_r italic_e italic_s italic_e italic_t: reset signal for the Score module;

    • K_mode_enable𝐾_𝑚𝑜𝑑𝑒_𝑒𝑛𝑎𝑏𝑙𝑒K\_mode\_enableitalic_K _ italic_m italic_o italic_d italic_e _ italic_e italic_n italic_a italic_b italic_l italic_e: high active, controlling to start loading 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT row by row to the internal registers;

    • Q_mode_enable𝑄_𝑚𝑜𝑑𝑒_𝑒𝑛𝑎𝑏𝑙𝑒Q\_mode\_enableitalic_Q _ italic_m italic_o italic_d italic_e _ italic_e italic_n italic_a italic_b italic_l italic_e:high active, controlling to start calculation with APIM for 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT row by row;

    • K_address𝐾_𝑎𝑑𝑑𝑟𝑒𝑠𝑠K\_addressitalic_K _ italic_a italic_d italic_d italic_r italic_e italic_s italic_s: 11 bits, identify which row the input 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT vector is in the 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT matrix;

    • K_input𝐾_𝑖𝑛𝑝𝑢𝑡K\_inputitalic_K _ italic_i italic_n italic_p italic_u italic_t: 128×\times×8 bits, row vector inputs for the 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT;

    • Q_input𝑄_𝑖𝑛𝑝𝑢𝑡Q\_inputitalic_Q _ italic_i italic_n italic_p italic_u italic_t: 128×\times×8 bits, row vector inputs for the 𝐐𝐐\mathbf{Q}bold_Q;.

Figure 7: Architecture of the Score module
Refer to caption

.

Figure 7: Architecture of the Score module

This module comprises APIM modules of 32×\times×32 dimension (each APIM can store 32×\times×32 matrix of weights and perform general matrix-vector multiplication with 32×\times×1 input vector). 4 APIM modules are vertically arranged to form a 128×\times×32 matrix-vector multiplication engine, called col_cim . Then we coalesce 64 columns of col_cim to form a 128×\times×2048 PIM module to calculate 2048×\times×2048 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT.

This module has three states: State0=idle, State1=K_mode𝐾_𝑚𝑜𝑑𝑒K\_modeitalic_K _ italic_m italic_o italic_d italic_e, State2=Q_mode𝑄_𝑚𝑜𝑑𝑒Q\_modeitalic_Q _ italic_m italic_o italic_d italic_e. State1 and State2 are enabled by K_mode_enable𝐾_𝑚𝑜𝑑𝑒_𝑒𝑛𝑎𝑏𝑙𝑒K\_mode\_enableitalic_K _ italic_m italic_o italic_d italic_e _ italic_e italic_n italic_a italic_b italic_l italic_e and Q_mode_enable𝑄_𝑚𝑜𝑑𝑒_𝑒𝑛𝑎𝑏𝑙𝑒Q\_mode\_enableitalic_Q _ italic_m italic_o italic_d italic_e _ italic_e italic_n italic_a italic_b italic_l italic_e signals. After each operation is completed in State1 or State2, the Score module returns to State0 and emits inputdone𝑖𝑛𝑝𝑢subscript𝑡𝑑𝑜𝑛𝑒input_{d}oneitalic_i italic_n italic_p italic_u italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_o italic_n italic_e or outputdone𝑜𝑢𝑡𝑝𝑢subscript𝑡𝑑𝑜𝑛𝑒output_{d}oneitalic_o italic_u italic_t italic_p italic_u italic_t start_POSTSUBSCRIPT italic_d end_POSTSUBSCRIPT italic_o italic_n italic_e signals to indicate the operations are done.

Refer to caption
Figure 8: State-transfer diagram for the Score module.
Refer to caption
Figure 9: Timing diagram for the Score module.

3.4 Softmax Module

Softmax module computes the softmax function, which is the core operation in the attention-based Transformer architecture. The function of softmax is formulated as: given v=[v1,v2,,vn]𝑣superscriptsubscript𝑣1subscript𝑣2subscript𝑣𝑛\vec{v}=[v_{1},v_{2},\cdots,v_{n}]^{\intercal}over→ start_ARG italic_v end_ARG = [ italic_v start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_v start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_v start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT, output:

softmax(v)=[a1,a2,,an],whereai=evii=1neviformulae-sequence𝑠𝑜𝑓𝑡𝑚𝑎𝑥𝑣superscriptsubscript𝑎1subscript𝑎2subscript𝑎𝑛wheresubscript𝑎𝑖superscript𝑒subscript𝑣𝑖superscriptsubscript𝑖1𝑛superscript𝑒subscript𝑣𝑖softmax(\vec{v})=[a_{1},a_{2},\cdots,a_{n}]^{\intercal},\mathrm{where\ }a_{i}=% \frac{e^{v_{i}}}{\sum_{i=1}^{n}e^{v_{i}}}italic_s italic_o italic_f italic_t italic_m italic_a italic_x ( over→ start_ARG italic_v end_ARG ) = [ italic_a start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_a start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT , ⋯ , italic_a start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ] start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT , roman_where italic_a start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = divide start_ARG italic_e start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG start_ARG ∑ start_POSTSUBSCRIPT italic_i = 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_e start_POSTSUPERSCRIPT italic_v start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT end_POSTSUPERSCRIPT end_ARG (2)

This operation can be divided into two steps: first, find the exponents of each element, then do normalization. The results of 𝐐𝐊superscript𝐐𝐊\mathbf{QK}^{\intercal}bold_QK start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT block go through the Softmax block, which normalized all attention coefficients to 1.

This module, as explained in 2-steps, can be divided into two blocks:

  1. 1.

    A look-up table implementation for the exponent function exp_function𝑒𝑥𝑝_𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛exp\_functionitalic_e italic_x italic_p _ italic_f italic_u italic_n italic_c italic_t italic_i italic_o italic_n: input is a 8-bit fixed-point number x𝑥xitalic_x, the output is a 16-bit fixed-point number exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT. We use a Python look-up table generator (AttentionLego/Softmax/src/softmax.py) to generate 256 possible cases for an 8-bit input x𝑥xitalic_x.

  2. 2.

    Normalization block: here, we introduce a method to calculate the summation for all exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT and normalization in two steps (clock cycles). The first cycle loads the inputs and computes the summation of all exsuperscript𝑒𝑥e^{x}italic_e start_POSTSUPERSCRIPT italic_x end_POSTSUPERSCRIPT; the second cycle calculates the normalization.

In the example code, we provide a 32-number softmax implemented by digital logic.

Refer to caption
Figure 10: State-transfer diagram for the Softmax module.

3.5 Direct Memory Access (DMA) Module

AttentionLego employs a special DMA module that orchestrates all of the data transportation, including the inputs and weights transfer between the AttentionLego with external storage as well as the internal intermediate results. DMA has three channels:

  1. 1.

    The channel between external memory and the Input Process module. The DMA module converts the data read serially on the bus into parallel data and send it to the PIM module.

  2. 2.

    The channel between the Input Process module and the Score module: feed the calculated 𝐐𝐐\mathbf{Q}bold_Q or 𝐊superscript𝐊\mathbf{K}^{\intercal}bold_K start_POSTSUPERSCRIPT ⊺ end_POSTSUPERSCRIPT to the Score module.

  3. 3.

    The channel between the Score module and the Softmax module.

Fig. 11 illustrates the state transfer diagram for the DMA module. Admittedly the data loading control for the Input Process module, the Score module and the Softmax module can be fused into those modules; here we make the DMA module a standalone one for the purpose of adapting the AttentionLego scheme for various on-chip and off-chip data transfer bandwidth.

Refer to caption
Figure 11: State-transfer diagram for the DMA module.

3.6 Top Controller Module

This module is mainly responsible for managing and coordinating communication, data flow, and functional operations between different modules within the chip to ensure that the entire system can operate normally according to design requirements. This design implements the inference operation of the transformer’s attention module in the case of a batch size of 1. The entire process includes each module taking weights from memory and calculating the 𝐐,𝐊,𝐕𝐐𝐊𝐕\mathbf{Q},\mathbf{K},\mathbf{V}bold_Q , bold_K , bold_V matrix, attention score, and softmax. The top controller controls a module to start working through the enable signal. After the module finishes working, it sends a do signal to the top controller to inform them that the work is complete so that the top controller can continue to control subsequent work. The top controller completed the design of the entire process timing logic through different enable and do signals. The main output is the enable signal for controlling the selection and calculation of the input process module, the enable signal for controlling the selection, input, and calculation of the attention score calculation module, the enable signal for controlling the selection and calculation of the softmax calculation module, and the enable signal for controlling the DMA transmission information. The main input is the corresponding done signal.

This module consists of a two-layer nested state machine. The outer state machine has four states, with state 0 being the ready state and containing four inner states. It mainly performs weight loading, input loading, k-matrix calculation and loading, and the calculation of the q-vector of the first token. The last three states are the main part of inference, with each loop performing a token inference. State 1 transfers the q vector of the current token to the attention score calculation module, State 2 calculates the attention score, the q vector of the new token is calculated (not required for the last loop), and the softmax value of the previous token is calculated (not needed in the first loop), State 3 sends the calculated attention score of the current token to the softmax calculation module. The inner state machine has four states that execute sequentially. State 0.0 loads weight from memory into the input process module, state 0.1 loads input from memory into the input process module, state 0.2 completes the calculation of the 𝐊𝐊\mathbf{K}bold_K matrix in the input process module, state 0.3 loads the 𝐊𝐊\mathbf{K}bold_K matrix into the attention score calculation module, and completes the calculation of the q-vector of the first token.

The Top controller consists of a two-layer nested state machine, with state 0 in the outer state being the ready state and states 1, 2, and 3 forming a loop. Each loop completes the calculation for a token, and state 0 contains an inner state machine with four states. The four states proceed sequentially to complete the preparation phase. The Top controller outputs an enable signal in each stage, and each time it receives a done signal, it enters the next stage.

The Top controller uses a pipeline to complete token-level parallel inference during the inference process. In the second state of each loop, the attention score calculation module processes the current token, the softmax calculation module calculates the previous token, the input process module calculates the q vector of the next token, and during DMA transmission, the three calculation modules temporarily stop computing, This can maximize hardware utilization while ensuring the accuracy of computation and transmission.

4 Conclusion and Perspective

This paper reveals the basic design of AttentionLego. This work develops a vanilla self-attention module in Verilog HDL based on the PIM macro behavioral model. Thanks to the duality of efficient storage and computation of PIM macro for tensor processing, this work deploys the matrix multiplications in LLMs onto a PIM-based weight-stationary data flow. In this process, the parameters of LLMs are loaded into AttentionLego only once. This is the major power and energy-saving technique used in this work. This work, for the moment, releases the initial version of the source code; more quantitative analysis with the proposed AttentionLego method and design are coming up.

Acknowledgments and Disclosure of Funding

This work was supported by the National Key R&D Program of China (2023YFB4502200), and National Natural Science Foundation of China (92264201, 92364102, T2350006), by the 111 Project under Grant B18001. Partial of this work is the results out of an undergraduate course at Peking University with course no. 04835370 in 2023’fall semester.

References

  • Touvron et al. [2023a] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurelien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. LLaMA: Open and Efficient Foundation Language Models, February 2023a. URL http://arxiv.longhoe.net/abs/2302.13971. arXiv:2302.13971 [cs].
  • Touvron et al. [2023b] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, Dan Bikel, Lukas Blecher, Cristian Canton Ferrer, Moya Chen, Guillem Cucurull, David Esiobu, Jude Fernandes, Jeremy Fu, Wenyin Fu, Brian Fuller, Cynthia Gao, Vedanuj Goswami, Naman Goyal, Anthony Hartshorn, Saghar Hosseini, Rui Hou, Hakan Inan, Marcin Kardas, Viktor Kerkez, Madian Khabsa, Isabel Kloumann, Artem Korenev, Punit Singh Koura, Marie-Anne Lachaux, Thibaut Lavril, Jenya Lee, Diana Liskovich, Yinghai Lu, Yuning Mao, Xavier Martinet, Todor Mihaylov, Pushkar Mishra, Igor Molybog, Yixin Nie, Andrew Poulton, Jeremy Reizenstein, Rashi Rungta, Kalyan Saladi, Alan Schelten, Ruan Silva, Eric Michael Smith, Ranjan Subramanian, Xiaoqing Ellen Tan, Binh Tang, Ross Taylor, Adina Williams, Jian Xiang Kuan, Puxin Xu, Zheng Yan, Iliyan Zarov, Yuchen Zhang, Angela Fan, Melanie Kambadur, Sharan Narang, Aurelien Rodriguez, Robert Stojnic, Sergey Edunov, and Thomas Scialom. Llama 2: Open Foundation and Fine-Tuned Chat Models, July 2023b. URL http://arxiv.longhoe.net/abs/2307.09288. arXiv:2307.09288 [cs].
  • Workshop et al. [2023] BigScience Workshop, Teven Le Scao, Angela Fan, Christopher Akiki, Ellie Pavlick, Suzana Ilić, Daniel Hesslow, Roman Castagné, Alexandra Sasha Luccioni, François Yvon, Matthias Gallé, Jonathan Tow, Alexander M. Rush, Stella Biderman, Albert Webson, Pawan Sasanka Ammanamanchi, Thomas Wang, Benoît Sagot, Niklas Muennighoff, Albert Villanova del Moral, Olatunji Ruwase, Rachel Bawden, Stas Bekman, Angelina McMillan-Major, Iz Beltagy, Huu Nguyen, Lucile Saulnier, Samson Tan, Pedro Ortiz Suarez, Victor Sanh, Hugo Laurençon, Yacine Jernite, Julien Launay, Margaret Mitchell, Colin Raffel, Aaron Gokaslan, Adi Simhi, Aitor Soroa, Alham Fikri Aji, Amit Alfassy, Anna Rogers, Ariel Kreisberg Nitzav, Canwen Xu, Chenghao Mou, Chris Emezue, Christopher Klamm, Colin Leong, Daniel van Strien, David Ifeoluwa Adelani, Dragomir Radev, Eduardo González Ponferrada, Efrat Levkovizh, Ethan Kim, Eyal Bar Natan, Francesco De Toni, Gérard Dupont, Germán Kruszewski, Giada Pistilli, Hady Elsahar, Hamza Benyamina, Hieu Tran, Ian Yu, Idris Abdulmumin, Isaac Johnson, Itziar Gonzalez-Dios, Javier de la Rosa, Jenny Chim, Jesse Dodge, Jian Zhu, Jonathan Chang, Jörg Frohberg, Joseph Tobing, Joydeep Bhattacharjee, Khalid Almubarak, Kimbo Chen, Kyle Lo, Leandro Von Werra, Leon Weber, Long Phan, Loubna Ben allal, Ludovic Tanguy, Manan Dey, Manuel Romero Muñoz, Maraim Masoud, María Grandury, Mario Šaško, Max Huang, Maximin Coavoux, Mayank Singh, Mike Tian-Jian Jiang, Minh Chien Vu, Mohammad A. Jauhar, Mustafa Ghaleb, Nishant Subramani, Nora Kassner, Nurulaqilla Khamis, Olivier Nguyen, Omar Espejel, Ona de Gibert, Paulo Villegas, Peter Henderson, Pierre Colombo, Priscilla Amuok, Quentin Lhoest, Rheza Harliman, Rishi Bommasani, Roberto Luis López, Rui Ribeiro, Salomey Osei, Sampo Pyysalo, Sebastian Nagel, Shamik Bose, Shamsuddeen Hassan Muhammad, Shanya Sharma, Shayne Longpre, Somaieh Nikpoor, Stanislav Silberberg, Suhas Pai, Sydney Zink, Tiago Timponi Torrent, Timo Schick, Tristan Thrush, Valentin Danchev, Vassilina Nikoulina, Veronika Laippala, Violette Lepercq, Vrinda Prabhu, Zaid Alyafeai, Zeerak Talat, Arun Raja, Benjamin Heinzerling, Chenglei Si, Davut Emre Taşar, Elizabeth Salesky, Sabrina J. Mielke, Wilson Y. Lee, Abheesht Sharma, Andrea Santilli, Antoine Chaffin, Arnaud Stiegler, Debajyoti Datta, Eliza Szczechla, Gunjan Chhablani, Han Wang, Harshit Pandey, Hendrik Strobelt, Jason Alan Fries, Jos Rozen, Leo Gao, Lintang Sutawika, M. Saiful Bari, Maged S. Al-shaibani, Matteo Manica, Nihal Nayak, Ryan Teehan, Samuel Albanie, Sheng Shen, Srulik Ben-David, Stephen H. Bach, Taewoon Kim, Tali Bers, Thibault Fevry, Trishala Neeraj, Urmish Thakker, Vikas Raunak, Xiangru Tang, Zheng-Xin Yong, Zhiqing Sun, Shaked Brody, Yallow Uri, Hadar Tojarieh, Adam Roberts, Hyung Won Chung, Jaesung Tae, Jason Phang, Ofir Press, Conglong Li, Deepak Narayanan, Hatim Bourfoune, Jared Casper, Jeff Rasley, Max Ryabinin, Mayank Mishra, Minjia Zhang, Mohammad Shoeybi, Myriam Peyrounette, Nicolas Patry, Nouamane Tazi, Omar Sanseviero, Patrick von Platen, Pierre Cornette, Pierre François Lavallée, Rémi Lacroix, Samyam Rajbhandari, Sanchit Gandhi, Shaden Smith, Stéphane Requena, Suraj Patil, Tim Dettmers, Ahmed Baruwa, Amanpreet Singh, Anastasia Cheveleva, Anne-Laure Ligozat, Arjun Subramonian, Aurélie Névéol, Charles Lovering, Dan Garrette, Deepak Tunuguntla, Ehud Reiter, Ekaterina Taktasheva, Ekaterina Voloshina, Eli Bogdanov, Genta Indra Winata, Hailey Schoelkopf, Jan-Christoph Kalo, Jekaterina Novikova, Jessica Zosa Forde, Jordan Clive, Jungo Kasai, Ken Kawamura, Liam Hazan, Marine Carpuat, Miruna Clinciu, Najoung Kim, Newton Cheng, Oleg Serikov, Omer Antverg, Oskar van der Wal, Rui Zhang, Ruochen Zhang, Sebastian Gehrmann, Shachar Mirkin, Shani Pais, Tatiana Shavrina, Thomas Scialom, Tian Yun, Tomasz Limisiewicz, Verena Rieser, Vitaly Protasov, Vladislav Mikhailov, Yada Pruksachatkun, Yonatan Belinkov, Zachary Bamberger, Zdeněk Kasner, Alice Rueda, Amanda Pestana, Amir Feizpour, Ammar Khan, Amy Faranak, Ana Santos, Anthony Hevia, Antigona Unldreaj, Arash Aghagol, Arezoo Abdollahi, Aycha Tammour, Azadeh HajiHosseini, Bahareh Behroozi, Benjamin Ajibade, Bharat Saxena, Carlos Muñoz Ferrandis, Daniel McDuff, Danish Contractor, David Lansky, Davis David, Douwe Kiela, Duong A. Nguyen, Edward Tan, Emi Baylor, Ezinwanne Ozoani, Fatima Mirza, Frankline Ononiwu, Habib Rezanejad, Hessie Jones, Indrani Bhattacharya, Irene Solaiman, Irina Sedenko, Isar Nejadgholi, Jesse Passmore, Josh Seltzer, Julio Bonis Sanz, Livia Dutra, Mairon Samagaio, Maraim Elbadri, Margot Mieskes, Marissa Gerchick, Martha Akinlolu, Michael McKenna, Mike Qiu, Muhammed Ghauri, Mykola Burynok, Nafis Abrar, Nazneen Rajani, Nour Elkott, Nour Fahmy, Olanrewaju Samuel, Ran An, Rasmus Kromann, Ryan Hao, Samira Alizadeh, Sarmad Shubber, Silas Wang, Sourav Roy, Sylvain Viguier, Thanh Le, Tobi Oyebade, Trieu Le, Yoyo Yang, Zach Nguyen, Abhinav Ramesh Kashyap, Alfredo Palasciano, Alison Callahan, Anima Shukla, Antonio Miranda-Escalada, Ayush Singh, Benjamin Beilharz, Bo Wang, Caio Brito, Chenxi Zhou, Chirag Jain, Chuxin Xu, Clémentine Fourrier, Daniel León Periñán, Daniel Molano, Dian Yu, Enrique Manjavacas, Fabio Barth, Florian Fuhrimann, Gabriel Altay, Giyaseddin Bayrak, Gully Burns, Helena U. Vrabec, Imane Bello, Ishani Dash, Jihyun Kang, John Giorgi, Jonas Golde, Jose David Posada, Karthik Rangasai Sivaraman, Lokesh Bulchandani, Lu Liu, Luisa Shinzato, Madeleine Hahn de Bykhovetz, Maiko Takeuchi, Marc Pàmies, Maria A. Castillo, Marianna Nezhurina, Mario Sänger, Matthias Samwald, Michael Cullan, Michael Weinberg, Michiel De Wolf, Mina Mihaljcic, Minna Liu, Moritz Freidank, Myungsun Kang, Natasha Seelam, Nathan Dahlberg, Nicholas Michio Broad, Nikolaus Muellner, Pascale Fung, Patrick Haller, Ramya Chandrasekhar, Renata Eisenberg, Robert Martin, Rodrigo Canalli, Rosaline Su, Ruisi Su, Samuel Cahyawijaya, Samuele Garda, Shlok S. Deshmukh, Shubhanshu Mishra, Sid Kiblawi, Simon Ott, Sinee Sang-aroonsiri, Srishti Kumar, Stefan Schweter, Sushil Bharati, Tanmay Laud, Théo Gigant, Tomoya Kainuma, Wojciech Kusa, Yanis Labrak, Yash Shailesh Bajaj, Yash Venkatraman, Yifan Xu, Yingxin Xu, Yu Xu, Zhe Tan, Zhongli Xie, Zifan Ye, Mathilde Bras, Younes Belkada, and Thomas Wolf. BLOOM: A 176B-Parameter Open-Access Multilingual Language Model, June 2023. URL http://arxiv.longhoe.net/abs/2211.05100. arXiv:2211.05100 [cs].
  • Dey et al. [2023] Nolan Dey, Gurpreet Gosal, Zhiming, Chen, Hemant Khachane, William Marshall, Ribhu Pathria, Marvin Tom, and Joel Hestness. Cerebras-GPT: Open Compute-Optimal Language Models Trained on the Cerebras Wafer-Scale Cluster, April 2023. URL http://arxiv.longhoe.net/abs/2304.03208. arXiv:2304.03208 [cs].
  • Dao et al. [2022] Tri Dao, Daniel Y. Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness, June 2022. URL http://arxiv.longhoe.net/abs/2205.14135. arXiv:2205.14135 [cs].
  • Black et al. [2022] Sid Black, Stella Biderman, Eric Hallahan, Quentin Anthony, Leo Gao, Laurence Golding, Horace He, Connor Leahy, Kyle McDonell, Jason Phang, Michael Pieler, USVSN Sai Prashanth, Shivanshu Purohit, Laria Reynolds, Jonathan Tow, Ben Wang, and Samuel Weinbach. GPT-NeoX-20B: An Open-Source Autoregressive Language Model, April 2022. URL http://arxiv.longhoe.net/abs/2204.06745. arXiv:2204.06745 [cs].
  • Li et al. [2023] Yuanzhi Li, Sébastien Bubeck, Ronen Eldan, Allie Del Giorno, Suriya Gunasekar, and Yin Tat Lee. Textbooks Are All You Need II: phi-1.5 technical report, September 2023. URL http://arxiv.longhoe.net/abs/2309.05463. arXiv:2309.05463 [cs].
  • Biderman et al. [2023] Stella Biderman, Hailey Schoelkopf, Quentin Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, Aviya Skowron, Lintang Sutawika, and Oskar van der Wal. Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling, May 2023. URL http://arxiv.longhoe.net/abs/2304.01373. arXiv:2304.01373 [cs].
  • Vaswani et al. [2023] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention Is All You Need, August 2023. URL http://arxiv.longhoe.net/abs/1706.03762. arXiv:1706.03762 [cs].
  • Yan et al. [2019a] Bonan Yan, Bing Li, Ximing Qiao, Cheng-Xin Xue, Meng-Fan Chang, Yiran Chen, and Hai (Helen) Li. Resistive Memory-Based In-Memory Computing: From Device and Large-Scale Integration System Perspectives. Advanced Intelligent Systems, 1(7):1900068, November 2019a. ISSN 2640-4567, 2640-4567. doi: 10.1002/aisy.201900068. URL https://onlinelibrary.wiley.com/doi/10.1002/aisy.201900068.
  • Yan et al. [2019b] Bonan Yan, Qing Yang, Wei-Hao Chen, Kung-Tang Chang, Jian-Wei Su, Chien-Hua Hsu, Sih-Han Li, Heng-Yuan Lee, Shyh-Shyuan Sheu, Mon-Shu Ho, Qing Wu, Meng-Fan Chang, Yiran Chen, and Hai Li. RRAM-based Spiking Nonvolatile Computing-In-Memory Processing Engine with Precision-Configurable In Situ Nonlinear Activation. In 2019 Symposium on VLSI Technology, pages T86–T87, Kyoto, Japan, June 2019b. IEEE. ISBN 978-4-86348-719-2. doi: 10.23919/VLSIT.2019.8776485. URL https://ieeexplore.ieee.org/document/8776485/.
  • Yan [2024] Bonan Yan. PISLIB, January 2024. URL https://bonany.gitlab.io/pis/.