License: CC BY-NC-ND 4.0
arXiv:2403.00232v1 [cs.AR] 01 Mar 2024

FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators

Xinyi Li Kahlert School of Computing
University of Utah
USA
[email protected]
   Ang Li Pacific Northwest National
Laboratory
USA
[email protected]
   Bo Fang Pacific Northwest National
Laboratory
USA
[email protected]
   Katarzyna Swirydowicz                Pacific Northwest National               
Laboratory
USA
[email protected]
   Ignacio Laguna Lawrence Livermore National
Laboratory
USA
[email protected]
   Ganesh Gopalakrishnan Kahlert School of Computing
University of Utah
USA
[email protected]
Abstract

NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance at low power consumption. Unfortunately, very few facts are publicly documented about some of their attributes that can affect answers computed on identical code. Examples of such features are the number of extra precision bits, accumulation order of addition, and predictable subnormal number handling during computations. We demonstrate that the lack of information on how these features differ across two matrix accelerators can make it impossible to reliably port codes across GPUs containing these differing accelerators. In response to this challenge, this paper offers a collection of tests that are based on a precise understanding of the IEEE floating-point standard as well as previously discovered formal results about the impact of floating-point features on numerical behavior. By running these tests on a large number of widely used and recent GPUs, we show that our tests can unearth feature differences that affect computed results. We exhibit these differences across five floating-point formats, four standard rounding modes and additional four feature combinations including those relating to rounding and preservation of extra precision bits. This extensive testing demonstrates the versatility of our tests in picking up salient differences that can affect numerical behavior across this space. As further proof of the discriminative power of our approach, we design a simple matrix-multiplication test with the matrix entries designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. There is no prior work that shows that a simple test like this can produce three different answers on five different platforms—raising concern that one carefully understand Matrix Accelerator features before porting code across them.

Index Terms:
NVIDIA GPU, Tensor Cores, AMD GPU, Matrix Units, floating-point arithmetic, high performance computing, machine learning, Correctness Portability

I Introduction

We are in an era of rising computing hardware heterogeneity where many new CPU and GPU components are introduced in rapid succession [1], and are fueling performance advances in HPC and ML: from drug discovery to climate simulations and beyond. While no scientist aims to achieve higher performance at the expense of correctness, ensuring correctness has become a serious challenge given the sheer number of hardware units and the rapidity of their adoption. Specifically in the realm of GPU-based accelerators, programmers are interested in testing codes developed for NVIDIA GPUs on AMD GPUs that are becoming available: MI100 at first, MI250X now and soon MI300 that will be used in the upcoming El Capitan Exascale machine111https://www.llnl.gov/article/49131/llnl-scientists-eagerly-anticipate-el-capitans-potential-impact, with earlier AMD models already in use in Oakridge OLCF Frontier222https://www.olcf.ornl.gov/olcf-resources/compute-systems/frontier/. Unfortunately, documentation about many aspects of these GPUs is found seriously lacking in terms of numerical aspects. Unanswered questions not only pertain to particular behaviors such as precision loss for a specific operator but also important features such as the rounding modes supported, fused-multiply-addition (FMA) details, the number of extra precision bits held inside, the granularity of their block fused-multiply-add, etc. The presence or absence of these features significantly influences numerous high-stakes algorithms in fields like HPC and ML. At present, programmers resort to running many applications and monitoring the results; this process is not rigorous or scalable, and a new program might one day cause a serious result difference. It is highly desirable to have a set of straightforward tests that can quickly pick up salient feature differences between GPUs, but these do not exist. Our primary contribution in this paper is a rigorous methodology that has enabled us to create such discriminatory tests for NVIDIA and AMD GPUs, with the methodology generalizable and applicable to future GPUs.

Matrix Accelerators

While the general lack of information that affects reliable code porting across GPUs is well-acknowledged, matrix accelerators pose an even higher degree of difficulty because of their growing importance and even more dearth of information—hence forming the central focus of this paper. We use the term “Matrix Accelerator” as a generic term to refer to what NVIDIA calls “Tensor Cores [2]” and AMD calls “Matrix Cores [3].” Matrix Accelerators are indispensable for achieving today’s performance levels in ML. It is safe to say that language-level models (e.g. ChatGPT) will not have happened without Tensor Cores (ML training will take at least 10 times longer without the acceleration of Tensor Cores [4]). Naturally, matrix accelerators have caught the eye of HPC designers who see its 4×\times× speedup with 80% less energy consumption [5] a real avenue toward much faster and energy-efficient codes as promoted in prominent articles [6]. The work in this paper is designed to help programmers port code across matrix accelerators more reliably, based on the commonality of features that our tests help confirm.

Unfortunately, matrix accelerators are described in the literature mainly with respect to their usage and not numerical properties. With the growing number of these units in upcoming GPUs [7], this situation poses a serious impediment to those wanting to use them for HPC or port code across two different models. Additionally, designers have to keep in mind the variety of number formats supported by them, and variety of hardware features.

What makes this variety troubling is that many of these details (barring a few basic aspects of floating-point arithmetic supported) are not documented anywhere in a manner that is easily accessible to the general public. We show in this work that the undocumented feature differences can affect computational result portability in an extreme manner.

In today’s HPC software development approaches, not being able to use a new GPU or its matrix accelerator as soon as they become available is a serious handicap: one cannot plan code migration, evaluate previous applications on new machines, or provide timely feedback to vendors. Going “eye-ball result agreements” on test codes (often today’s approach) runs the risk of the unexamined cases. The community urgently needs approaches that can reveal the differences in terms of critical numerical features that are bound to affect some (future) program, thus serving as early warning. Such an approach applicable to CPUs or GPUs in general, and matrix accelerators in particular is the key goal of this paper.

To provide some more evidence, recall that if one uses 32-bit floating-point format (FP32), one can expect a result-difference in the 8th decimal fraction position; this is because FP32’s round-to-nearest rounding can guarantee 7 (fractional) digits of accuracy. The corresponding number for the 16-bit format (FP16, used internally by matrix accelerators) is three fractional digits being guaranteed. However, if one ports an FP16 implementation of an HPC routine across platforms and obtains a difference in the third decimal fraction position, then it is perhaps worth investigating what caused the error to exceed what precision arguments imply.

In fact, we could create a focused test that went much further! This test performed matrix multiplication based on our understanding of feature differences across matrix accelerators, obtaining the following answers: V100, A100, and MI250X produced 0; MI100 produced 255.875; and Hopper H100 produced 191.875. This error is clearly far more serious: not merely the fourth fractional digit but the hundredth digit has been affected—an error that is six orders of magnitude higher. Such results have never been demonstrated before, thus making the work in this paper—specifically our straightforward tests—important to put in the hands of today’s programmers (we will release all our tests upon acceptance).

Problem Addressed

The problem addressed is the design of a simple and practical approach to check whether critical feature differences lurk in execution units. The tests ought to be based on straightforward logic pertaining to basic floating-point facts, allowing them to be easily extended to newer model GPUs and Matrix Accelerators being developed by multiple companies (e.g., Google TPUs [7] models; such accelerators may be of interest to HPC designers for their power/performance advantages). Running larger test programs such as actual large-scale numerical solvers does not meet this need, as there is significant overhead associated with making programs run—especially on new hardware where libraries and compiler features might be missing.

Our feature-directed testing approach goes to a considerable distance in meeting these idealized goals. By showing the presence/absence of a feature, it helps answer how whole classes of programs can be affected, as will be shown (§V).

I-A Related Work and Improvements Over Them

Correctness of GPU numerical behavior is a vast topic; given space restrictions, we study only those that target matrix accelerators or contrast different GPUs. In [8, 9], the concept of monotonicity (applicable to vector reductions) was studied using a testing-guided approach (the mention of monotonicity as a desired property appears even earlier [10]). Monotonicity is true with respect to two vectors A and B of equal length if the result of adding all elements of A must not exceeding that produced by all the elements of B provided A𝐴Aitalic_A is position-wise less than or equal to B (i.e., A[i]B[i]𝐴delimited-[]𝑖𝐵delimited-[]𝑖A[i]\leq B[i]italic_A [ italic_i ] ≤ italic_B [ italic_i ]). Monotonicity can be violated if (1) the reduction order is not externally controllable, and (2) there aren’t enough extra precision bits (specifically, 3 bits) provided within the arithmetic unit [8]. In this sense, their work set the stage for focusing on feature differences; we advance this direction significantly in this paper.

The question of numerical portability between various GPUs including AMD GPUs has been studied in [11]. They target math function results produced by earlier generation GPUs and libraries, with no coverage of matrix accelerators. The paper [12] targets performance portability across NVIDIA and AMD MI-100 GPUs The paper [13] tests and reports the extent to which numercial errors exhibited by standard library math functions vary across platforms.

TABLE I: Floating-Point Format Comparison (TF32 is a proprietary format and can be implementation-dependent). The more the mantissa bits, the lesser the ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p. Notice how BF16 sacrifices mantissae for higher dynamic range (larger exponent size)
Format Sign Bit (S) Exponent Bits (E) Mantissa Bits (F) Min. Exponent (eminsubscript𝑒𝑚𝑖𝑛e_{min}italic_e start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT) Max. Exponent (emaxsubscript𝑒𝑚𝑎𝑥e_{max}italic_e start_POSTSUBSCRIPT italic_m italic_a italic_x end_POSTSUBSCRIPT ) ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p, i.e. ulp(1)
FP16 1 5 10 -14 15 210superscript2102^{-10}2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT
FP32 1 8 23 -126 127 223superscript2232^{-23}2 start_POSTSUPERSCRIPT - 23 end_POSTSUPERSCRIPT
FP64 1 11 52 -1022 1023 252superscript2522^{-52}2 start_POSTSUPERSCRIPT - 52 end_POSTSUPERSCRIPT
BF16 1 8 7 -63 63 27superscript272^{-7}2 start_POSTSUPERSCRIPT - 7 end_POSTSUPERSCRIPT
TF32 1 8 10 -126 127 210superscript2102^{-10}2 start_POSTSUPERSCRIPT - 10 end_POSTSUPERSCRIPT

Summary of Contributions:

  • We offer a novel set of feature-targeted tests, with clear evidence that if such features are not preserved across the source and target platform, execution results may seriously differ.

  • Our tests form a pipeline with earlier tests confirming/denying certain features and later feature tests taking advantage of it to unambiguously confirm/deny a second feature, and so on.

  • Our tests are simple enough to be run on early access machines that may not have full-fledged libraries or runtimes, yet powerful enough to confirm (or refute) the status of higher level features supported in hardware.

  • With the availability of AMD machines, our work meets a critical need of moving code from NVIDIA to AMD for full comparisons. We also point out the dangers of doing porting in reverse, when subnormal support is missing (this relates to support for hardware trap** of exceptions).

  • This is the most extensive testing of matrix accelerators to date that we are aware of, including the first results on H100 about its numeric features.

  • Our tests indicate that AMD MI250X is closer to CPU behavior, thus perhaps requiring fewer changes during CPU-to-GPU porting.

Roadmap: We provide self-contained background crucial to understand the testing approaches in this paper (§II). The main technical part of this paper is higher-level feature testing as applied to matrix accelerators (§III). The extent to which feature differences caused the results of a basic matrix multiplication routine to jump across three values on five GPUs is then presented (§V). How our tests characterized three GPUs across five precision formats is summarized in a comprehensive results table (§IV). Conclusions follow (§VI).

II Background

Floating-point arithmetic is a vast domain, and our objective here is to provide a high level overview of facts crucial to understand how we designed our tests.

II-A Floating-point background

A floating-point number [14] x=(s,e,m)𝑥𝑠𝑒𝑚x=(s,e,m)italic_x = ( italic_s , italic_e , italic_m ) consists of a single sign bit s𝑠sitalic_s, a mantissa (also called significand) m𝑚mitalic_m (of 23 bits) representing a value in the real interval (0,2)02(0,2)( 0 , 2 ) and an exponent e𝑒eitalic_e (of 8 bits, typically presented as a biased integer). Regard m𝑚mitalic_m and e𝑒eitalic_e as the intended (i.e., ignore the bias in e𝑒eitalic_e) real-numbered. Then the value of the floating-point number is

fp_value(x)=(1)sm2efp_value𝑥superscript1𝑠𝑚superscript2𝑒\text{fp}\_\text{value}(x)=(-1)^{s}\cdot m\cdot 2^{e}fp _ value ( italic_x ) = ( - 1 ) start_POSTSUPERSCRIPT italic_s end_POSTSUPERSCRIPT ⋅ italic_m ⋅ 2 start_POSTSUPERSCRIPT italic_e end_POSTSUPERSCRIPT

(see Table I for other pertinent details). Aiming for a unique and convenient representation, the mantissa m𝑚mitalic_m remains in the range of [1,2)12[1,2)[ 1 , 2 ) whenever e>emin𝑒subscript𝑒𝑚𝑖𝑛e>e_{min}italic_e > italic_e start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, and hence can be expressed as a fraction 1.(..23bits..)1.(..23bits..)1 . ( ..23 italic_b italic_i italic_t italic_s . . ) which is called the normalized representation. For cases where e=emin𝑒subscript𝑒𝑚𝑖𝑛e=e_{min}italic_e = italic_e start_POSTSUBSCRIPT italic_m italic_i italic_n end_POSTSUBSCRIPT, the mantissa falls within the open interval (0,1)01(0,1)( 0 , 1 ), and then represent subnormal numbers.333Both +00+0+ 0 and 00-0- 0 are supported, but neither is a subnormal number. However note that ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p and half of ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p are both normal numbers. We use ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p as an abbreviation for units in the last place and represents fp_value(x)𝑓𝑝_𝑣𝑎𝑙𝑢𝑒𝑥fp\_value(x)italic_f italic_p _ italic_v italic_a italic_l italic_u italic_e ( italic_x ) when s=0,e=0formulae-sequence𝑠0𝑒0s=0,e=0italic_s = 0 , italic_e = 0 and only the LSB of m𝑚mitalic_m is set. The IEEE standard also details the specifications for 16-bit, 32-bit, and 64-bit floating-point numbers, which are referred as FP16, FP32, and FP64 respectively. With the rise in demand for less but acceptable precision in deep learning, Google introduced the brain-float 16444https://en.wikipedia.org/wiki/Bfloat16_floating-point_format format or BF16. Additionally, NVIDIA unveiled a custom format, notably TensorFloat32555https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/, tailored for matrix multiplication. This format optimizes for both precision and range, specifically for their tensor cores.

Behavioral Portability Issues due to Subnormals

Our feature-targeted tests include testing for subnormal support. To motivate reasons for such tests in a general context, consider two floating-point normal values a𝑎aitalic_a and b𝑏bitalic_b are close together but not individually equal to 00. Suppose we now have an expression E1=c/(ab)subscript𝐸1𝑐𝑎𝑏E_{1}=c/(a-b)italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = italic_c / ( italic_a - italic_b ) where E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is some expression and c𝑐citalic_c is a normal number. If the result of (ab)𝑎𝑏(a-b)( italic_a - italic_b ) is a subnormal number as per an infinite-precision calculation but the hardware does not provide support for subnormals, then the hardware turns the denominator into 00 causing a division-by-zero exception.666Assuming that pertinent compiler flags are applied.

The first problem posed by this situation is that some GPUs do not have hardware traps for exceptions ( [15] confirms this for NVIDIA). Now if we have another expression E2subscript𝐸2E_{2}italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT similar to E1subscript𝐸1E_{1}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and we have E1/E2subscript𝐸1subscript𝐸2E_{1}/E_{2}italic_E start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT / italic_E start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT; then the resulting /\infty/\infty∞ / ∞ results in a NaN (“not a number”) exception—for which also GPUs lack adequate exception-trap** support in hardware. Also note that AMD provides some support for exception trap** [16], and thus porting from AMD to NVIDIA will turn the lack of subnormal support into a significant exception-behavior difference. Thus, a subnormal-targeted test (which we provide) can meaningfully distinguish between GPU behaviors.

Rounding Mechanisms in Floating-Point Arithmetic

Given the constraints of the floating-point format, rounding becomes essential when a value surpasses its bounds. The IEEE prescribes that rounding should emulate an intermediate result that is infinitely precise and possesses an unbounded range (this is called correct rounding). To realize this ideal, supplementary bits (guard or G, rounding or R, and sticky or S, collectively called “extra bits”, see Table II) are incorporated in the design of IEEE-compliant hardware, with G, R, and S having lower significance (in that order) than the mantissa least significant bit (LSB). These bits are set when operation results are normalized via a right shift (see below for an illustration). Higham [17] notes that a single additional bit will not consistently yield the same outcome as obtaining the precise result followed by rounding. However, incorporating a second guard bit and a third sticky bit (which is the logical OR of all bits that are shifted through the S position) permits correct rounding. To realize correct rounding [17], there are two requirements: (1) employ three extra precision bits, and (2) employ round-to nearest with ties to even.

TABLE II: Rounding Rules of FP Arithmetic. To read this table, first locate the GRS bits. Then decide the result sign and the rounding mode desired. Finally, for all but truncate, add the specified bit to the mantissa least significant bit (LSB). For truncate, set the LSB as per this value.
The three extra bits GRS where (xy)=1𝑥𝑦1(x\vee y)=1( italic_x ∨ italic_y ) = 1 Result sign New value for mantissa LSB (add this bit to m𝑚mitalic_m’s LSB except for truncate it is assigned to LSB
Round
up
(tow.
++\infty+ ∞)
Round
down
(tow.
)-\infty)- ∞ )
RTN-TE
Round to zero (truncate)
0xy + 1 0 0 0
- 0 1 0 0
100 + 1 0 1111 0
- 0 1 1111 0
1xy + 1 0 1 0
- 0 1 1 0

Description of Rounding For clarity, let us walk through a simple example which will help read the rest of this paper with more assurance:

  • Align: (If necessary), make the exponent of the two numbers to be added the same by right-shifting the number with the smaller exponent.

  • Operate, normalize, set extra bits: Perform the addition, and then normalize the result; specifically, if the result mantissa is 2 or more in value, bring it within [1,2)12[1,2)[ 1 , 2 ) by right-shifting the mantissa, suitably adjusting the exponent. This right shift sets through and sets the extra bits.

  • Round as per rules, normalize again if needed: Consult Table II to round or truncate.

An Example Consider an FP scheme with one bit mantissa and suppose the result after calculation is positive 1.11001.11001.11001.1100 in binary or 1.75 in decimal (GRS=100𝐺𝑅𝑆100GRS=100italic_G italic_R italic_S = 100 is attached at the end), and let e=0𝑒0e=0italic_e = 0. This cannot be represented using one mantissa bit, and so we must round. For RTN-TE777RTN-TE stands for ”Round to Nearest, Ties to Even,” which is a rounding method commonly used in floating-point arithmetic. When a number falls exactly halfway between two possible rounded values, this method rounds the number to the nearest even value. the mantissa LSB resulting in 10.010010.010010.010010.0100. This needs normalization, and after that, the result is 1.00101.00101.00101.0010 (and e=1𝑒1e=1italic_e = 1)—i.e., 2 in decimal. The answer for truncate is 1.

Fused Multiply-Add (FMA) Operation

In contemporary computational architectures, certain machines incorporate hardware components specifically designed to facilitate the Fused Multiply-Add (FMA) operation. As per the IEEE 754 standards, this operation computes c+(ab)𝑐𝑎𝑏c+(a\cdot b)italic_c + ( italic_a ⋅ italic_b ) by ensuring two pivotal conditions: (1) computation is performed as though it has infinite precision and an unbounded range, and (2) rounding is applied only once, after the completion of both ‘*’ and ‘+’. These are referred to as FMA conventions in this paper. Matrix multiplication, represented as AB+C𝐴𝐵𝐶A\cdot B+Citalic_A ⋅ italic_B + italic_C, can be conceptualized as a series of blocked Multiply-Add Operations. This operation is natively supported by GPU architectures from both AMD and NVIDIA, as elaborated in Section II-B. Given this context, we hypothesize these matrix accelerators also adhere to these FMA conventions.

II-B Matrix Acceleration

Implementing matrix operations efficiently benefits a plethora of numerical algorithms underlying HPC and ML [6]. Acknowledging this, NVIDIA and AMD have developed specialized compute units. NVIDIA’s Tensor Cores and AMD’s Matrix Cores are designed to optimize matrix operations, enhancing computational speed and efficiency. We use the neutral term matrix accelerator when referring to either. Matrix multiplication, represented by the equation D=AB+C𝐷𝐴𝐵𝐶D=A\cdot B+Citalic_D = italic_A ⋅ italic_B + italic_C, is a foundational primitive in Linear Algebra (it is a BLAS level 3 operation). Equation 1 for all i,j𝑖𝑗i,jitalic_i , italic_j in the allowed range of matrix indices 1Size1𝑆𝑖𝑧𝑒1\dots Size1 … italic_S italic_i italic_z italic_e governs the behavior of matrix accelerators:

dij=ai1*bj1+ai2b2j++ainbnj+cij.subscript𝑑𝑖𝑗subscript𝑎𝑖1subscript𝑏𝑗1subscript𝑎𝑖2subscript𝑏2𝑗subscript𝑎𝑖𝑛subscript𝑏𝑛𝑗subscript𝑐𝑖𝑗d_{ij}=a_{i1}*b_{j1}+a_{i2}b_{2j}+...+a_{in}b_{nj}+c_{ij}.italic_d start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT * italic_b start_POSTSUBSCRIPT italic_j 1 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT italic_i 2 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 2 italic_j end_POSTSUBSCRIPT + … + italic_a start_POSTSUBSCRIPT italic_i italic_n end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n italic_j end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT . (1)

II-C Block FMA

Existing public documentation on matrix accelerators [5, 6] indicates that they employ the so-called block FMA where the calculation in Equation 1 is achieved in parallel, essentially suffering rounding error comparable to doing one scalar FMA. In other words, if one unrolls Equation 1 into a serial loop and determines the rounding error, then it is clear that each add-multiply step can incur a half ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p error in RTN-TE, thus making the worst-case error grow with Size𝑆𝑖𝑧𝑒Sizeitalic_S italic_i italic_z italic_e. This is avoided in block-FMA; we assume block-FMA in the rest of this paper.

II-D Coding Matrix Acceleration

There are mainly two ways to invoke matrix accelerators:

  1. 1.

    Via High-Level APIs: One can make use of high-level C++ APIs such as nvcuda::wmma for NVIDIA and nrocmwmma:wmma for AMD. These APIs provide a structured and relatively user-friendly interface to interact with Tensor and Matrix cores, respectively.

  2. 2.

    Intermediate-Level Assembly Manipulation: For those delving deeper into the architecture, direct interaction with matrix accelerators unit is feasible through specific instruction sets. Within the NVIDIA platform, this is achieved by the PTX instruction set wmma operations. In contrast, AMD offers compiler intrinsic instructions such as __builtin_amdgcn_mfma_ are tailored for matrix operations.

In our work, we employed the high-level API directly, abiding by all the requirements for its invocation such as meeting dimensionality restrictions published by manufacturers. To double-check that matrix accelerator units will be active during operation, we check the underlying code. For NVIDIA, the presence of HMMA/DMMA in the SASS code indicates that the Tensor Cores will be activated. For AMD, spotting MFMA in the LLVM intermediate representation indicates the use of their Matrix Computing Units 888NVIDIA’s official documentation highlights the roles of HMMA and DMMA operations in Tensor Core operations [18] Similarly, AMD describes the role of MFMA in their official documentation [16]. These documents also mention the conditions to be met before these units are activated. Leaving nothing to chance, we check for these instructions explicitly. .

III Numerical Behaviors of Matrix Accelerators

Refer to caption
Figure 1: Matrix Unit Testing Approach. For the matrix accelerator under test, all the properties shown on the right are checked (all but monotonicity) or implied (monotonicity) by our tests.

Matrix computing units are now an indispensable part of GPU usage in machine learning while also attacting considerable interest from HPC developers [6]. This work aims to close significant gaps in the official documentation of NVIDIA and AMD detailing their numerical behaviors by designing tests that highlight specific differences. Our overall testing plan is illustrated in Figure 1, and detailed in subquent sections. Given our coverage of close to a dozen high-level features, this section will be hierarchically organized where sections detailing specific tests may be skipped on first reading.

III-A High-Level Testing Plan

The overall goal of a matrix accelerator is to efficiently support the calculations in producing the D𝐷Ditalic_D matrix where D=AB+C𝐷𝐴𝐵𝐶D=AB+Citalic_D = italic_A italic_B + italic_C, with A,B𝐴𝐵A,Bitalic_A , italic_B and C𝐶Citalic_C also being matrices. Since all D𝐷Ditalic_D entries are calculated in an identical manner, it suffices to focus on how one particular entry, namely d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT is calculated:

d11=a11b11+a12b21++a1nbn1+c11subscript𝑑11subscript𝑎11subscript𝑏11subscript𝑎12subscript𝑏21subscript𝑎1𝑛subscript𝑏𝑛1subscript𝑐11d_{11}=a_{11}b_{11}+a_{12}b_{21}+...+a_{1n}b_{n1}+c_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT + italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT + … + italic_a start_POSTSUBSCRIPT 1 italic_n end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT italic_n 1 end_POSTSUBSCRIPT + italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT (2)

We now focus on each aspect of Equation 2, discuss tests that target the discovery of the details hidden behind the following features (and these details are the ones that will help contrast various GPUs): The following tests will now be detailed in their own sections. An important contribution we make is to orchestrate these tests according to the order in the flow-chart in Figure 3 so that some cases are eliminated or concluded early, allowing later tests to discriminate cases without ambiguity.999It is important to reiterate that the features discovered by these tests are largely undocumented or hard to find. Our tests provide a “one-stop shop**” experience for quickly determining them and corroborating with documentation (that may change over time.)

T_si_no:

subnormal in; normal out;” i.e., if a computation unit is fed subnormal inputs, can it handle it at the input (without zeroing it), and produce a normal output?

T_ni_so:

normal in; subnormal out;” i.e., if a computation unit is provided normal inputs, and the computation resulting in a subnormal, then can this subnormal be output (or will it get zeroed)?

T_sa:

“subnormal accumulation ok;” i.e., if a set of subnormals are being added, is the accumulation successful (or is the output getting zeroed)?

T_1_bit:

at least one extra bit;” i.e., is there at least one extra precision bit in the computation unit?

T_rnd_dir:

rounding direction;” i.e., determine the rounding direction based on the test outcome: possible test outcomes are to say whether to zero (truncate), down (to -\infty- ∞), RTN-TE, or up (to ++\infty+ ∞) are happening (Figure 2).

T_3_bits_fin_rnd:

three extra bits are provided, final rounding;” i.e., tests that locate if three extra precision bits are provided. It also determines the final rounding direction followed.

T_prod:

product rounding direction;” similar to T_rnd_dir𝑇_𝑟𝑛𝑑_𝑑𝑖𝑟T\_rnd\_diritalic_T _ italic_r italic_n italic_d _ italic_d italic_i italic_r except for the product terms during block FMA.

T_blk_fma_width:

block FMA width;” i.e., what is the unit-width for the block FMA operations?

T_pres_extra_acc:

preservation of extra bit during accumulation;” i.e., are the extra bits preserved during the accumulation of a block FMA unit.

T_acc_order:

accumulation order control;” i.e., can we determine the accumulation order being followed by the block FMA during its accumulation stage?

III-B Details of Each Test

Refer to caption
Figure 2: The logic for test T_rnd_dir are presented here. By setting the a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT product as well as the a12b21subscript𝑎12subscript𝑏21a_{12}b_{21}italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT product (alternatively the c11subscript𝑐11c_{11}italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT value) to the indicated value, the execution is carried out (all other inputs not mentioned are set to 00). Then by examining the d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT output, we can decide which case we fall into with respect to the rounding being used. A similar logic also underlies the Tprodsubscript𝑇𝑝𝑟𝑜𝑑T_{prod}italic_T start_POSTSUBSCRIPT italic_p italic_r italic_o italic_d end_POSTSUBSCRIPT test.
Refer to caption
Figure 3: Testing workflow that sharpens each later test based on the previous ones. First settle the rounding mode of the accumulator (T_rnd_dir). Then settle the presence of an extra bit; if so then determine the initial rounding mode; then settle the use of 3 extra bits (T_1_bit, T_rnd_dir, and T_3_bits_fin_rnd); if so, check for ties and sticky bit. Having concluded the rounding mode, switch to settling FMA properties. Then the extra bits preserved. At that time, we can determine the block FMA width, accumulation order control (T_blk_fma_width, T_acc_order), and settle whether normalization happens once.

T_si_no, T_ni_so, and T_sa:

The objective here is to discern whether the matrix unit can handle subnormal numbers, both as input and output. The tests below will assign specific values to the right-hand side of Equation 2, assume full IEEE-compatible subnormal support, and check for the expected results under this assumption. The three tests are now detailed:

T_si_no: Initialize a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT with an arbitrary subnormal number while ensuring that the product a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT yields a normal number; set all other input words in the d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT equation to 00. Now check whether d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT equals a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT; if so, the check passes; else, d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT is expected to emerge as zero, when the check fails.

T_ni_so: Initialize a11subscript𝑎11a_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT and b11subscript𝑏11b_{11}italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT with arbitrary normal numbers while ensuring that the product a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT is a subnormal number. Now examine whether d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT is a subnormal value (“pass”) or emerges as zero (“fail”).

T_sa: Assign an arbitrary subnormal to c11subscript𝑐11c_{11}italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT while kee** all other inputs at zero.The test observes whether d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT is this subnormal (“pass”) or 00 (“fail”).

T_1_bit, T_rnd_dir, and T_3_bits_fin_rnd:

These tests follow the logic in Figure 2 for the first two tests, and Figure 4 for the RTN-TE case and Figure 5 for the round to zero case. The tests are now detailed. Note that the tests can set either a12b21subscript𝑎12subscript𝑏21a_{12}b_{21}italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT or c11subscript𝑐11c_{11}italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT to 2222.

T_1_bit: We check the result of the operation 1(21×ulp)1superscript21𝑢𝑙𝑝1-(2^{-1}\times ulp)1 - ( 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × italic_u italic_l italic_p ) to check if at least one extra bit is provided. Aligning 21×ulpsuperscript21𝑢𝑙𝑝2^{-1}\times ulp2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × italic_u italic_l italic_p to 1 in its binary representation necessitates a shift amount equivalent to the mantissa bit length plus one. Consequently, if the resultant value remains 1(21×ulp)1superscript21𝑢𝑙𝑝1-(2^{-1}\times ulp)1 - ( 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × italic_u italic_l italic_p ), it implies the existence of an extra bit in computations.

T_rnd_dir: Upon confirming the presence of an extra bit, the accumulator’s rounding behavior is assessed (Figure 2) .

T_3_bits_fin_rnd: We finally proceed to determine if three extra bits are present, and also determine the final rounding modes supported. Its logic and implementation are now discussed.101010Details provided for the sake of completeness; reading can be postponed. The key aspect of our testing approach was the alignment of three bits during the accumulation process. The goal was to ascertain the effects of preserving one, two, or all three extra bits on the mantissa component. This nuanced behavior was attained via subtraction operations, the intricacies of which are detailed in Figures 4 and 5.

Building upon our preliminary understanding of the rounding direction, we embarked on a series of rigorous tests. The essence of these tests is encapsulated in Figures 4 and 5, which respectively illustrate the methodologies for rounding-to-nearest and rounding-to-zero modes.111111We highlight these two modes given their adoption in matrix accelerators.

T_pres_extra_acc:

The fact that the extra bits are retained during block-FMA accumulation can be confirmed using the expression 1+21ulp+21ulp+21ulp1superscript21𝑢𝑙𝑝superscript21𝑢𝑙𝑝superscript21𝑢𝑙𝑝1+2^{-1}\cdot ulp+2^{-1}\cdot ulp+2^{-1}\cdot ulp1 + 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p (by making the accumulation of a block-FMA perform this calculation).If intermediate accumulation steps maintain these extra bits, the ultimate result will be 1+ulp1𝑢𝑙𝑝1+ulp1 + italic_u italic_l italic_p; else it will emerge as 1111.

T_acc_ord:

The significance of accumulation order control primarily arises in scenarios employing rounding to zero with the preservation of just one extra bit. Contrarily, when three extra bits are enabled (which facilitates a sticky bit with the rounding-to-nearest mode), the results remain consistent irrespective of the accumulation order. If only one extra bit is maintained, we must test to ascertain the accumulation order. For this, one can test all permutations of the terms in the equation 1+22ulp+22ulp+22ulp+22ulp1superscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝1+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp1 + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p. Given that only one extra bit is retained in the rounding to zero case, the precision associated with the terms 22ulp+22ulp+22ulp+22ulpsuperscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝superscript22𝑢𝑙𝑝2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p tends to be lost, save for when it is computed first. That is, if we allow all the “small values” to add up first, then even with one extra bit, we will get the answer 1+ulp1𝑢𝑙𝑝1+ulp1 + italic_u italic_l italic_p. In other words, ff we can externally control the order of reduction by assigning these terms to specific positions within an FMA unit, there exists a output yields 1+ulp1𝑢𝑙𝑝1+ulp1 + italic_u italic_l italic_p and hence the reduction order is under user control; else not.

Data: Matrices a𝑎aitalic_a, b𝑏bitalic_b, c𝑐citalic_c, and d𝑑ditalic_d. assuperscript𝑎𝑠a^{\prime}sitalic_a start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT italic_s row’s length K𝐾Kitalic_K.
1 Initialize all values in a𝑎aitalic_a, b𝑏bitalic_b, c𝑐citalic_c to 0 c111.subscript𝑐111c_{11}\leftarrow 1.italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ← 1 . a11×b1121×ulpsubscript𝑎11subscript𝑏11superscript21ulpa_{11}\times b_{11}\leftarrow 2^{-1}\times\text{ulp}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ← 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × ulp for i1normal-←𝑖1i\leftarrow 1italic_i ← 1 to K𝐾Kitalic_K do
2       if i>1𝑖1i>1italic_i > 1 then
3             a1(i1)×b(i1)10subscript𝑎1𝑖1subscript𝑏𝑖110a_{1(i-1)}\times b_{(i-1)1}\leftarrow 0italic_a start_POSTSUBSCRIPT 1 ( italic_i - 1 ) end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT ( italic_i - 1 ) 1 end_POSTSUBSCRIPT ← 0
4      (a1i×bi1)(21×ulp)subscript𝑎1𝑖subscript𝑏𝑖1superscript21ulp(a_{1i}\times b_{i1})\leftarrow(2^{-1}\times\text{ulp})( italic_a start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT × italic_b start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT ) ← ( 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × ulp )
5       Call wmma(a𝑎aitalic_a, b𝑏bitalic_b, c𝑐citalic_c, d𝑑ditalic_d) if d11=1.subscript𝑑111d_{11}=1.italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT = 1 . then
6             break
7      
8if index <<< K then
9       min_preserve_uintindex𝑚𝑖𝑛_𝑝𝑟𝑒𝑠𝑒𝑟𝑣𝑒_𝑢𝑖𝑛𝑡𝑖𝑛𝑑𝑒𝑥min\_preserve\_uint\leftarrow indexitalic_m italic_i italic_n _ italic_p italic_r italic_e italic_s italic_e italic_r italic_v italic_e _ italic_u italic_i italic_n italic_t ← italic_i italic_n italic_d italic_e italic_x
10else
11       min_preserve_uint𝑚𝑖𝑛_𝑝𝑟𝑒𝑠𝑒𝑟𝑣𝑒_𝑢𝑖𝑛𝑡min\_preserve\_uintitalic_m italic_i italic_n _ italic_p italic_r italic_e italic_s italic_e italic_r italic_v italic_e _ italic_u italic_i italic_n italic_t is larger than K𝐾Kitalic_K
Algorithm 1 Test Minimum Unit for FMA property preservation. The idea is to assign a moving position the 21ulpsuperscript21𝑢𝑙𝑝2^{-1}ulp2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT italic_u italic_l italic_p value and when that position goes beyond the width of the block FMA, we get a 1111 output. That index is the FMA block width.

T_blk_fma_width:

To determine the block size of the block FMA unit, we execute a test loop given in Figure 1. Within a single FMA unit, precision remains intact throughout computation, and rounding occurs only at the concluding bit position. The key idea realized by this test is to load-up a 1-bit at a pair of moving positions denoted by a1isubscript𝑎1𝑖a_{1i}italic_a start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT and bi1subscript𝑏𝑖1b_{i1}italic_b start_POSTSUBSCRIPT italic_i 1 end_POSTSUBSCRIPT such that a1i×b_i1subscript𝑎1𝑖𝑏_𝑖1a_{1i}\times b\_{i1}italic_a start_POSTSUBSCRIPT 1 italic_i end_POSTSUBSCRIPT × italic_b _ italic_i 1 is ensured to be half a ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p (21×ulpsuperscript21𝑢𝑙𝑝2^{-1}\times ulp2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT × italic_u italic_l italic_p). We use the aforementioned equation and shift the last term, 21ulpsuperscript21𝑢𝑙𝑝2^{-1}\cdot ulp2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p, across the matrix multiplication positions as illustrated. A loss of precision at Line 1 (the half ulp𝑢𝑙𝑝ulpitalic_u italic_l italic_p vanishes) signals that the initialization occurred in a detached FMA unit where it suffers a rounding precision loss; the “end of a FMA unit” in effect gets detected when the final value is 1111.

T_prod:

Tests for the rounding mode of the product are only performed for FP64 and FP32 inputs. Assuming each multiplier has an n𝑛nitalic_n-bit mantissa, their products can occupy only 2n2𝑛2n2 italic_n bits. Taking this into consideration, the input formats such as FP16, TF32, and BF16—which respectively possess 10, 10, and 7-bit mantissa bits—do not experience precision loss when operating within an FP32 environment where a 23-bit mantissa is used.

To check product rounding, we can employ the same methodology used for rounding mode assessment during accumulation (Figure 2). Specifically, for the product a11b11subscript𝑎11subscript𝑏11a_{11}\cdot b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT ⋅ italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, if we set one term to be 1+2ulp+ulp12𝑢𝑙𝑝𝑢𝑙𝑝1+2\cdot ulp+ulp1 + 2 ⋅ italic_u italic_l italic_p + italic_u italic_l italic_p and the other as 1+221superscript221+2^{-2}1 + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT, the exact result is 1+22+3ulp+21ulp+22ulp1superscript223𝑢𝑙𝑝superscript21𝑢𝑙𝑝superscript22𝑢𝑙𝑝1+2^{-2}+3\cdot ulp+2^{-1}\cdot ulp+2^{-2}\cdot ulp1 + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT + 3 ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 1 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p + 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ italic_u italic_l italic_p, with a 110 suffix to the end of the mantissa bit. Similarly we can incorporate negative-number test scenarios (Figure 2) and referencing the rounding tests depicted there, we can deduce the rounding mode used in the multiplication operation.

Refer to caption
Figure 4: Binary Computation for Two Numbers Addition with Rounding to Nearest Mode. Here is how to read this figure. On the left, the situation of a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT (augend) with a specific input is shown. This value is aligned since the addend (a12b21subscript𝑎12subscript𝑏21a_{12}b_{21}italic_a start_POSTSUBSCRIPT 12 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 21 end_POSTSUBSCRIPT or c11subscript𝑐11c_{11}italic_c start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT) has the higher exponent. Alignment under one, two, or three extra bits is shown underneath a11b11subscript𝑎11subscript𝑏11a_{11}b_{11}italic_a start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT italic_b start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT. The result produced by “rtn” (RTN-TE) is shown as emitted by the bottom red oval.
Refer to caption
Figure 5: Binary Computation for Two Numbers Addition with Rounding to Zero Mode. Follow the reading suggestions as with Figure 4.

IV Feature Test Results

TABLE III: Results of Analyzing Matrix Accelerators. Here, “FMA unit size” is the number of words considered before rounding and normalization are performed (“Block FMA Size”). Note that V100 only support FP16. Further, FP64 is not supported in MI100. Last column: ”Case 1” is for add/accumulate, and ”Case 2” refers to the product test T_prod. ✓is yes, and ✗is no.
Inputs GPU
Subnormal
inputs
handled?
Subnormal
outputs
handled?
Extra bit
present?
How many?
Rounding
mode
exhibited
FMA
unit
width
Order
within one
FMA unit
is controllable?
Rounding mode for:
1. outputting FP16/BF16
(only for FP16/BF16
inputs)
2. product (only for
FP32/FP64 inputs)
FP16 V100 0 truncate 4 RTN-TE
A100 1 truncate 8 RTN-TE
H100 2absent2\geq 2≥ 2 truncate 16absent16\geq 16≥ 16 RTN-TE
MI100 3 RTN-TE*{}^{*}start_FLOATSUPERSCRIPT * end_FLOATSUPERSCRIPT 4 RTN-TE
MI250X 3 RTN-TE 1 N.A. RTN-TE
BF16 A100 1 truncate 8 N.A.**absent{}^{**}start_FLOATSUPERSCRIPT * * end_FLOATSUPERSCRIPT
H100 2absent2\geq 2≥ 2 truncate 16absent16\geq 16≥ 16 RTN-TE
MI100 3 RTN-TE 2 RTN-TE
MI250X 3 RTN-TE 1 N.A. RTN-TE
TF32(NVIDIA) FP32(AMD) A100 1 RTN-TE 4 N.A.
H100 2absent2\geq 2≥ 2 truncate 4 N.A.
MI100 3 RTN-TE 1 N.A. RTN-TE
MI250X 3 RTN-TE 1 N.A. RTN-TE
FP64 A100 3 RTN-TE 1 RTN-TE
H100 3 RTN-TE 1 RTN-TE
MI250X 3 RTN-TE 1 N.A. RTN-TE
* RTN-TE = round to nearest and round to even when tie.
* A100 doesn’t support BF16 output.

Table III presents our final compilation of results obtained from testing various GPUs. We discuss the results in detail below:

Subnormal Supports All the GPUs tested support subnormal numbers for inputs and outputs, with the exception of FP16 and BF16 formats of MI250X which does not. It is important to note that the absence of subnormal support could lead to the risk of generating exceptions such as division by zero as mentioned in §II.

Extra Bits for Computation AMD GPUs consistently use three extra bits for precise rounding. In contrast, NVIDIA GPUs have evolved across generations: the V100 does not include any extra bits, the A100 includes one, and the H100 includes at least two extra bits121212Due to our limited access to the H100, we can only test for more than 2 extra bits. We did not conduct further FMA unit width tests for the same reason. We can, however, easily expand our tests to include three extra bits.. For FP64 inputs, all GPUs incorporate an additional three bits.

Rounding Modes The chosen rounding mode is consistent across NVIDIA and AMD GPUs, with all models adhering to the chosen mode consistently across generations.

FMA Feature NVIDIA’s V100 has an FMA unit width of 4, and the A100 expands this to 8, as documented. The H100’s FMA unit width is suggested to be at least 16, a detail not officially confirmed. For TF32 inputs on NVIDIA GPUs, the FMA unit width is 4, which suits the 19-bit size of TF32. The AMD MI100 maintains FMA features with different widths for FP16 and BF16 inputs, but the MI250X lacks this feature. While FMA units can enhance accuracy, they may complicate the porting of CPU algorithms which do not typically support blocked FMA operations.

Rounding Mode for Outputting FP16 and BF16 We have examined the rounding mode used when GPUs output FP16 and BF16. All GPU models use the RTN-TE rounding mode. We hypothesize that the conversion to lower precision is performed after the computation at full precision.

Rounding Mode for Product For products involving FP32/FP64 inputs, all GPUs utilize the RTN-TE mode, demonstrating consistency in following IEEE floating-point arithmetic standards.

V Exhibiting Porting Danger in a Matrix Multiplication Routine

We now illustrate an example in which we perform a simple matrix multiplication to demonstrate how these subtle implementation difference in GPU architectures can vary the numerical outcomes. We analyze the matrix multiplication equation

D=αAB+βC𝐷𝛼𝐴𝐵𝛽𝐶D=\alpha\cdot A\cdot B+\beta\cdot Citalic_D = italic_α ⋅ italic_A ⋅ italic_B + italic_β ⋅ italic_C

with matrices A𝐴Aitalic_A and B𝐵Bitalic_B in FP16 format (sized mk𝑚𝑘m\cdot kitalic_m ⋅ italic_k and kn𝑘𝑛k\cdot nitalic_k ⋅ italic_n, respectively) and matrices C𝐶Citalic_C and D𝐷Ditalic_D in FP32 format (both sized mn𝑚𝑛m\cdot nitalic_m ⋅ italic_n). Here, α=1𝛼1\alpha=-1italic_α = - 1 and β=1𝛽1\beta=1italic_β = 1, and we set the matrix dimensions to m=n=k=213𝑚𝑛𝑘superscript213m=n=k=2^{13}italic_m = italic_n = italic_k = 2 start_POSTSUPERSCRIPT 13 end_POSTSUPERSCRIPT.

For matrix C𝐶Citalic_C, Cij=220subscript𝐶𝑖𝑗superscript220C_{ij}=2^{20}italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT for all i𝑖iitalic_i and j𝑗jitalic_j. In matrix A𝐴Aitalic_A, Ai0=210subscript𝐴𝑖0superscript210A_{i0}=2^{10}italic_A start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, Aij=22subscript𝐴𝑖𝑗superscript22A_{ij}=2^{-2}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT for odd j𝑗jitalic_j, and Aij=23subscript𝐴𝑖𝑗superscript23A_{ij}=2^{-3}italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT for even j𝑗jitalic_j (except j=0𝑗0j=0italic_j = 0). For matrix B𝐵Bitalic_B, B0j=210subscript𝐵0𝑗superscript210B_{0j}=2^{10}italic_B start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT = 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT, with other Bijsubscript𝐵𝑖𝑗B_{ij}italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT values set at 23superscript232^{-3}2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT. In this scenario, each element of matrix D𝐷Ditalic_D is calculated as follows (note that all Dijsubscript𝐷𝑖𝑗D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT will be the same):

Dijsubscript𝐷𝑖𝑗\displaystyle D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT =(Ai0B0j+j%2=1AijBij+j%2=0j0AijBij)+Cijabsentsubscript𝐴𝑖0subscript𝐵0𝑗subscriptpercent𝑗21subscript𝐴𝑖𝑗subscript𝐵𝑖𝑗superscriptsubscriptpercent𝑗20𝑗0subscript𝐴𝑖𝑗subscript𝐵𝑖𝑗subscript𝐶𝑖𝑗\displaystyle=-(A_{i0}\cdot B_{0j}+\sum_{j\%2=1}A_{ij}\cdot B_{ij}+\sum_{j\%2=% 0}^{j\neq 0}A_{ij}\cdot B_{ij})+C_{ij}= - ( italic_A start_POSTSUBSCRIPT italic_i 0 end_POSTSUBSCRIPT ⋅ italic_B start_POSTSUBSCRIPT 0 italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j % 2 = 1 end_POSTSUBSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT + ∑ start_POSTSUBSCRIPT italic_j % 2 = 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_j ≠ 0 end_POSTSUPERSCRIPT italic_A start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ⋅ italic_B start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT ) + italic_C start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT
=(210210212222321212323)+220absentsuperscript210superscript210subscriptsuperscript212superscript22superscript23subscriptsuperscript2121superscript23superscript23superscript220\displaystyle=-(2^{10}\cdot 2^{10}-\sum_{2^{12}}2^{-2}\cdot 2^{-3}-\sum_{2^{12% }-1}2^{-3}\cdot 2^{-3})+2^{20}= - ( 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT - ∑ start_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT 12 end_POSTSUPERSCRIPT - 1 end_POSTSUBSCRIPT 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ) + 2 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT
=27+2626191.99218absentsuperscript27superscript26superscript26191.99218\displaystyle=2^{7}+2^{6}-2^{-6}\approx 191.99218= 2 start_POSTSUPERSCRIPT 7 end_POSTSUPERSCRIPT + 2 start_POSTSUPERSCRIPT 6 end_POSTSUPERSCRIPT - 2 start_POSTSUPERSCRIPT - 6 end_POSTSUPERSCRIPT ≈ 191.99218

Note that the terms 2223superscript22superscript232^{-2}\cdot 2^{-3}2 start_POSTSUPERSCRIPT - 2 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT and 2323superscript23superscript232^{-3}\cdot 2^{-3}2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT - 3 end_POSTSUPERSCRIPT require bit shifts (25 bits or 26 bits) to align 220=210210superscript220superscript210superscript2102^{20}=2^{10}\cdot 2^{10}2 start_POSTSUPERSCRIPT 20 end_POSTSUPERSCRIPT = 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT ⋅ 2 start_POSTSUPERSCRIPT 10 end_POSTSUPERSCRIPT. Thus there would be precision loss for FP32 computation unit. This loss may vary depending on the number of extra bits preserved and the length of the FMA operation. This variation is what produces the sharp result-difference that we observed.

Specifically, we observed these (rather highly different) Dijsubscript𝐷𝑖𝑗D_{ij}italic_D start_POSTSUBSCRIPT italic_i italic_j end_POSTSUBSCRIPT values computed using a simple GEMM implementation on different GPUs for the very same A𝐴Aitalic_A, B𝐵Bitalic_B, and C𝐶Citalic_C matrix inputs: 00 on NVIDIA A100, V100, AMD MI250 and CPU; 255.875255.875255.875255.875 on AMD MI100; and 191.875191.875191.875191.875 on NVIDIA H100.

These discrepancies highlight the importance of understanding hardware-specific computational feature differences; ignoring these and porting across GPUs can vary results across this wide range.

Importance of This Pattern, Consequences: The pattern D=CAB𝐷𝐶𝐴𝐵D=C-A\cdot Bitalic_D = italic_C - italic_A ⋅ italic_B (where α=1𝛼1\alpha=-1italic_α = - 1 and β=1𝛽1\beta=1italic_β = 1) is closely related to trailing matrix updates Ai=AiPiTisubscript𝐴𝑖subscript𝐴𝑖subscript𝑃𝑖subscript𝑇𝑖A_{i}=A_{i}-P_{i}T_{i}italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT = italic_A start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT - italic_P start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT italic_T start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT used in a mixed-precision GMRES (Generalized Minimal Residual Method) iterative refinement algorithm [19, 20]. This approach is embedded in cuSolvers131313https://docs.nvidia.com/cuda/cusolver/index.html#cusolverirsrefinement-t. In general, a computation of the type D=CAB𝐷𝐶𝐴𝐵D=C-A\cdot Bitalic_D = italic_C - italic_A ⋅ italic_B is part of a standard BLAS (Basic Linear Algebra Subprograms) level 3 family. It is ubiquitous in various numerical linear algebra computations. This observation spawns future directions discussed in §VI.

VI Concluding Remarks

Matrix accelerators are an important emerging component in the computational landscape, yet very little can be easily determined about their numerical behavior. We study five such accelerators in this paper (with many more available for use, but with even less documented). One can learn very little by testing these units on random inputs. We observe that by exploiting our understanding of basic IEEE floating-point semantics and whatever is published about these units (e.g., that they perform block FMA), we can devise tests that target many key attributes such as subnormal support, rounding modes chosen, the number of hidden bits, accumulation order control, and width of the basic FMA blocks. Manufacturers select these features largely based on the cost of implementation; for instance, supporting one extra bit is cheaper; and that makes more demanding rounding modes (e.g., round-to-nearest) impossible to attain. Yet manufacturers may have targeted these accelerators for their own set of priority applications. For instance, they may have designed a set of features that make machine learning fast and efficient. They might view it as the HPC developer’s “fault” for using such matrix accelerators to perform HPC. On the other hand, HPC programmers are unaware of many of these dangers such as six orders of magnitude difference in results—unless they are lucky to choose such input matrices. By designing focused feature-targeted tests, we help foresee pitfalls.

Our tests are not foolproof. All we can say is that we assume significant levels of symmetry in design. For instance, if a manufacturer changes the precision of the diagonal outputs d11subscript𝑑11d_{11}italic_d start_POSTSUBSCRIPT 11 end_POSTSUBSCRIPT, d22subscript𝑑22d_{22}italic_d start_POSTSUBSCRIPT 22 end_POSTSUBSCRIPT, …, dNNsubscript𝑑𝑁𝑁d_{NN}italic_d start_POSTSUBSCRIPT italic_N italic_N end_POSTSUBSCRIPT for some reason (say because they determine the Eigenvalues in some cases), then all bets are off! Our tests are still valuable in flagging certain porting decisions as dangerous, thus adding to the overall porting strategies.

We do have more findings than have been highlighted. For example, we have discerned that the tensor cores operates with a width of 8. This observation is in agreement with the details revealed in NVIDIA’s white paper [21] that the tensor cores’ dimensions are 8×4×88488\times 4\times 88 × 4 × 8. Our tests also reveal that the hardware unit size for AMD GPUs, which is not documented as far as we know, is merely 1111. This can perhaps help explain rounding errors that might be higher (or might point to a hardware implementation decision taken by AMD). However, this agrees with what CPUs also follow. Additionally, the rounding mode employed by the AMD matrix accelerator adheres to the IEEE 754 standard. This is significant as it suggests a good possibility of producing results on AMD Matrix Cores that are consistent with those computed on many CPUs.

Another inference from our tests is that our conclusions on monotonicity match the observations in [8] about extra bit requirements. We can show that NVIDIA Tensor Cores can violate monotonicity, and for AMD GPUs, because of 3 extra bits, monotoncity will be preserved.

One exciting direction triggered by the trailing matrix updates pattern is that many other such patterns may exist in one’s implementation of linear-algebra routines as well as core routines in other areas. This suggests develo** tests for patterns found in other application spaces. We would also like to work on generalizing our tests using formal methods [22, 23] which may allow groups to check their tests for consistency and overlaps.

References

  • [1] Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li, Pavel Panchekha, Cindy Rubio-González, and Zachary Tatlock. Guarding numerics amidst rising heterogeneity. In 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, pages 9–15, 2021.
  • [2] NVIDIA. NVIDIA A100 Tensor Core GPU architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
  • [3] AMD. Amd cdna architecture. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf, 2020.
  • [4] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
  • [5] Pierre Blanchard, Nicholas J Higham, Florent Lopez, Théo Mary, and Srikara Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 2020.
  • [6] Jack Dongarra, Laura Grigori, and Nicholas Higham. Numerical algorithms for high-performance computational science. Phil. Trans. R. Soc. A.3782019006620190066, 2020. http://doi.org/10.1098/rsta.2019.0066.
  • [7] GOOGLE. Cloud tensor processing units. https://cloud.google.com/tpu/docs/tpus, 2022.
  • [8] Mantas Mikaitis. Monotonicity of multi-term floating-point adders, 2023.
  • [9] M. Fasiand N.J. Higham and M. Mikaitis and S.Pranesh. Numerical behavior of NVIDIA tensor cores. In PeerJ Comput Sci, February 2021.
  • [10] David Defour, Guillaume Hanrot, Vincent Lefèvre, Jean-Michel Muller, Nathalie Revol, and Paul Zimmermann. Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic. Research Report RR-5406, INRIA, 2004.
  • [11] Brian Gladman, Vincenzo Innocente, and Paul Zimmermann. Accuracy of mathematical functions in single, double, double extended, and quadruple precision, 2023.
  • [12] JaeHyuk Kwack, John Tramm, Colleen Bertoni, Yasaman Ghadar, Brian Homerding, Esteban Rangel, Christopher Knight, and Scott Parker. Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pages 45–56, 2021.
  • [13] Anton Rydahl, Joseph Huber, Ethan Luis Mcdonough, and Johannes Doerfert. Precision and performance analysis of c standard math library functions on gpus. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pages 892–903, 2023.
  • [14] Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Handbook of Floating-Point Arithmetic. Birkhäuser Basel, 2nd edition, 2018.
  • [15] NVIDIA. Cuda floating point and ieee 754, 2024. https://docs.nvidia.com/cuda/floating-point/index.html.
  • [16] AMD. ”amd instinct mi100” instruction set architecture reference guide. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi100-cdna1-shader-instruction-set-architecture.pdf, 2020. Accessed: 2023-12-17.
  • [17] Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
  • [18] Nvidia. Cuda binary utilities. https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf, 2023. Accessed: 2023-12-17.
  • [19] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE, 2018.
  • [20] Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. Mixed-precision iterative refinement using tensor cores on gpus to accelerate solution of linear systems. Proceedings of the Royal Society A, 476(2243):20200110, 2020.
  • [21] CUDA NVIDIA. Nvidia a100 tensor core gpu architecture. Volume 1.0: Whitepaper, Part, 1:82, 2020.
  • [22] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008.
  • [23] Sylvie Boldo, Jacques-Henri Jourdan, Xavier Leroy, and Guillaume Melquiond. Verified Compilation of Floating-Point Computations. Journal of Automated Reasoning (JAR), 54(2):135–163, 2015.