FTTN: Feature-Targeted Testing for Numerical Properties of NVIDIA & AMD Matrix Accelerators

Xinyi Li Kahlert School of Computing
University of Utah
USA
[email protected] Ang Li Pacific Northwest National
Laboratory
USA
[email protected] Bo Fang Pacific Northwest National
Laboratory
USA
[email protected] Katarzyna Swirydowicz Pacific Northwest National
Laboratory
USA
[email protected] Ignacio Laguna Lawrence Livermore National
Laboratory
USA
[email protected] Ganesh Gopalakrishnan Kahlert School of Computing
University of Utah
USA
[email protected]

Abstract

NVIDIA Tensor Cores and AMD Matrix Cores (together called Matrix Accelerators) are of growing interest in high-performance computing and machine learning owing to their high performance at low power consumption. Unfortunately, very few facts are publicly documented about some of their attributes that can affect answers computed on identical code. Examples of such features are the number of extra precision bits, accumulation order of addition, and predictable subnormal number handling during computations. We demonstrate that the lack of information on how these features differ across two matrix accelerators can make it impossible to reliably port codes across GPUs containing these differing accelerators. In response to this challenge, this paper offers a collection of tests that are based on a precise understanding of the IEEE floating-point standard as well as previously discovered formal results about the impact of floating-point features on numerical behavior. By running these tests on a large number of widely used and recent GPUs, we show that our tests can unearth feature differences that affect computed results. We exhibit these differences across five floating-point formats, four standard rounding modes and additional four feature combinations including those relating to rounding and preservation of extra precision bits. This extensive testing demonstrates the versatility of our tests in picking up salient differences that can affect numerical behavior across this space. As further proof of the discriminative power of our approach, we design a simple matrix-multiplication test with the matrix entries designed with insights gathered from our feature-tests. We executed this very simple test on five platforms, producing different answers: V100, A100, and MI250X produced 0, MI100 produced 255.875, and Hopper H100 produced 191.875. There is no prior work that shows that a simple test like this can produce three different answers on five different platforms—raising concern that one carefully understand Matrix Accelerator features before porting code across them.

Index Terms:

NVIDIA GPU, Tensor Cores, AMD GPU, Matrix Units, floating-point arithmetic, high performance computing, machine learning, Correctness Portability

I Introduction

We are in an era of rising computing hardware heterogeneity where many new CPU and GPU components are introduced in rapid succession [1], and are fueling performance advances in HPC and ML: from drug discovery to climate simulations and beyond. While no scientist aims to achieve higher performance at the expense of correctness, ensuring correctness has become a serious challenge given the sheer number of hardware units and the rapidity of their adoption. Specifically in the realm of GPU-based accelerators, programmers are interested in testing codes developed for NVIDIA GPUs on AMD GPUs that are becoming available: MI100 at first, MI250X now and soon MI300 that will be used in the upcoming El Capitan Exascale machine¹¹1https://www.llnl.gov/article/49131/llnl-scientists-eagerly-anticipate-el-capitans-potential-impact, with earlier AMD models already in use in Oakridge OLCF Frontier²²2https://www.olcf.ornl.gov/olcf-resources/compute-systems/frontier/. Unfortunately, documentation about many aspects of these GPUs is found seriously lacking in terms of numerical aspects. Unanswered questions not only pertain to particular behaviors such as precision loss for a specific operator but also important features such as the rounding modes supported, fused-multiply-addition (FMA) details, the number of extra precision bits held inside, the granularity of their block fused-multiply-add, etc. The presence or absence of these features significantly influences numerous high-stakes algorithms in fields like HPC and ML. At present, programmers resort to running many applications and monitoring the results; this process is not rigorous or scalable, and a new program might one day cause a serious result difference. It is highly desirable to have a set of straightforward tests that can quickly pick up salient feature differences between GPUs, but these do not exist. Our primary contribution in this paper is a rigorous methodology that has enabled us to create such discriminatory tests for NVIDIA and AMD GPUs, with the methodology generalizable and applicable to future GPUs.

Matrix Accelerators

While the general lack of information that affects reliable code porting across GPUs is well-acknowledged, matrix accelerators pose an even higher degree of difficulty because of their growing importance and even more dearth of information—hence forming the central focus of this paper. We use the term “Matrix Accelerator” as a generic term to refer to what NVIDIA calls “Tensor Cores [2]” and AMD calls “Matrix Cores [3].” Matrix Accelerators are indispensable for achieving today’s performance levels in ML. It is safe to say that language-level models (e.g. ChatGPT) will not have happened without Tensor Cores (ML training will take at least 10 times longer without the acceleration of Tensor Cores [4]). Naturally, matrix accelerators have caught the eye of HPC designers who see its 4 $\times$ speedup with 80% less energy consumption [5] a real avenue toward much faster and energy-efficient codes as promoted in prominent articles [6]. The work in this paper is designed to help programmers port code across matrix accelerators more reliably, based on the commonality of features that our tests help confirm.

Unfortunately, matrix accelerators are described in the literature mainly with respect to their usage and not numerical properties. With the growing number of these units in upcoming GPUs [7], this situation poses a serious impediment to those wanting to use them for HPC or port code across two different models. Additionally, designers have to keep in mind the variety of number formats supported by them, and variety of hardware features.

What makes this variety troubling is that many of these details (barring a few basic aspects of floating-point arithmetic supported) are not documented anywhere in a manner that is easily accessible to the general public. We show in this work that the undocumented feature differences can affect computational result portability in an extreme manner.

In today’s HPC software development approaches, not being able to use a new GPU or its matrix accelerator as soon as they become available is a serious handicap: one cannot plan code migration, evaluate previous applications on new machines, or provide timely feedback to vendors. Going “eye-ball result agreements” on test codes (often today’s approach) runs the risk of the unexamined cases. The community urgently needs approaches that can reveal the differences in terms of critical numerical features that are bound to affect some (future) program, thus serving as early warning. Such an approach applicable to CPUs or GPUs in general, and matrix accelerators in particular is the key goal of this paper.

To provide some more evidence, recall that if one uses 32-bit floating-point format (FP32), one can expect a result-difference in the 8th decimal fraction position; this is because FP32’s round-to-nearest rounding can guarantee 7 (fractional) digits of accuracy. The corresponding number for the 16-bit format (FP16, used internally by matrix accelerators) is three fractional digits being guaranteed. However, if one ports an FP16 implementation of an HPC routine across platforms and obtains a difference in the third decimal fraction position, then it is perhaps worth investigating what caused the error to exceed what precision arguments imply.

In fact, we could create a focused test that went much further! This test performed matrix multiplication based on our understanding of feature differences across matrix accelerators, obtaining the following answers: V100, A100, and MI250X produced 0; MI100 produced 255.875; and Hopper H100 produced 191.875. This error is clearly far more serious: not merely the fourth fractional digit but the hundredth digit has been affected—an error that is six orders of magnitude higher. Such results have never been demonstrated before, thus making the work in this paper—specifically our straightforward tests—important to put in the hands of today’s programmers (we will release all our tests upon acceptance).

Problem Addressed

The problem addressed is the design of a simple and practical approach to check whether critical feature differences lurk in execution units. The tests ought to be based on straightforward logic pertaining to basic floating-point facts, allowing them to be easily extended to newer model GPUs and Matrix Accelerators being developed by multiple companies (e.g., Google TPUs [7] models; such accelerators may be of interest to HPC designers for their power/performance advantages). Running larger test programs such as actual large-scale numerical solvers does not meet this need, as there is significant overhead associated with making programs run—especially on new hardware where libraries and compiler features might be missing.

Our feature-directed testing approach goes to a considerable distance in meeting these idealized goals. By showing the presence/absence of a feature, it helps answer how whole classes of programs can be affected, as will be shown (§V).

I-A Related Work and Improvements Over Them

Correctness of GPU numerical behavior is a vast topic; given space restrictions, we study only those that target matrix accelerators or contrast different GPUs. In [8, 9], the concept of monotonicity (applicable to vector reductions) was studied using a testing-guided approach (the mention of monotonicity as a desired property appears even earlier [10]). Monotonicity is true with respect to two vectors A and B of equal length if the result of adding all elements of A must not exceeding that produced by all the elements of B provided $A$ is position-wise less than or equal to B (i.e., $A[i]\leq B[i]$ ). Monotonicity can be violated if (1) the reduction order is not externally controllable, and (2) there aren’t enough extra precision bits (specifically, 3 bits) provided within the arithmetic unit [8]. In this sense, their work set the stage for focusing on feature differences; we advance this direction significantly in this paper.

The question of numerical portability between various GPUs including AMD GPUs has been studied in [11]. They target math function results produced by earlier generation GPUs and libraries, with no coverage of matrix accelerators. The paper [12] targets performance portability across NVIDIA and AMD MI-100 GPUs The paper [13] tests and reports the extent to which numercial errors exhibited by standard library math functions vary across platforms.

TABLE I: Floating-Point Format Comparison (TF32 is a proprietary format and can be implementation-dependent). The more the mantissa bits, the lesser the

ulp

. Notice how BF16 sacrifices mantissae for higher dynamic range (larger exponent size)

Format	Sign Bit (S)	Exponent Bits (E)	Mantissa Bits (F)	Min. Exponent ( $e_{min}$ )	Max. Exponent ( $e_{max}$ )	$ulp$ , i.e. ulp(1)
FP16	1	5	10	-14	15	$2^{-10}$
FP32	1	8	23	-126	127	$2^{-23}$
FP64	1	11	52	-1022	1023	$2^{-52}$
BF16	1	8	7	-63	63	$2^{-7}$
TF32	1	8	10	-126	127	$2^{-10}$

Summary of Contributions:

•

We offer a novel set of feature-targeted tests, with clear evidence that if such features are not preserved across the source and target platform, execution results may seriously differ.
•

Our tests form a pipeline with earlier tests confirming/denying certain features and later feature tests taking advantage of it to unambiguously confirm/deny a second feature, and so on.
•

Our tests are simple enough to be run on early access machines that may not have full-fledged libraries or runtimes, yet powerful enough to confirm (or refute) the status of higher level features supported in hardware.
•

With the availability of AMD machines, our work meets a critical need of moving code from NVIDIA to AMD for full comparisons. We also point out the dangers of doing porting in reverse, when subnormal support is missing (this relates to support for hardware trap** of exceptions).
•

This is the most extensive testing of matrix accelerators to date that we are aware of, including the first results on H100 about its numeric features.
•

Our tests indicate that AMD MI250X is closer to CPU behavior, thus perhaps requiring fewer changes during CPU-to-GPU porting.

Roadmap: We provide self-contained background crucial to understand the testing approaches in this paper (§II). The main technical part of this paper is higher-level feature testing as applied to matrix accelerators (§III). The extent to which feature differences caused the results of a basic matrix multiplication routine to jump across three values on five GPUs is then presented (§V). How our tests characterized three GPUs across five precision formats is summarized in a comprehensive results table (§IV). Conclusions follow (§VI).

II Background

Floating-point arithmetic is a vast domain, and our objective here is to provide a high level overview of facts crucial to understand how we designed our tests.

II-A Floating-point background

A floating-point number [14] $x=(s,e,m)$ consists of a single sign bit $s$ , a mantissa (also called significand) $m$ (of 23 bits) representing a value in the real interval $(0,2)$ and an exponent $e$ (of 8 bits, typically presented as a biased integer). Regard $m$ and $e$ as the intended (i.e., ignore the bias in $e$ ) real-numbered. Then the value of the floating-point number is

\text{fp}\_\text{value}(x)=(-1)^{s}\cdot m\cdot 2^{e}

(see Table I for other pertinent details). Aiming for a unique and convenient representation, the mantissa $m$ remains in the range of $[1,2)$ whenever $e>e_{min}$ , and hence can be expressed as a fraction $1.(..23bits..)$ which is called the normalized representation. For cases where $e=e_{min}$ , the mantissa falls within the open interval $(0,1)$ , and then represent subnormal numbers.³³3Both $+0$ and $-0$ are supported, but neither is a subnormal number. However note that $ulp$ and half of $ulp$ are both normal numbers. We use $ulp$ as an abbreviation for units in the last place and represents $fp\_value(x)$ when $s=0,e=0$ and only the LSB of $m$ is set. The IEEE standard also details the specifications for 16-bit, 32-bit, and 64-bit floating-point numbers, which are referred as FP16, FP32, and FP64 respectively. With the rise in demand for less but acceptable precision in deep learning, Google introduced the brain-float 16⁴⁴4https://en.wikipedia.org/wiki/Bfloat16_floating-point_format format or BF16. Additionally, NVIDIA unveiled a custom format, notably TensorFloat32⁵⁵5https://blogs.nvidia.com/blog/2020/05/14/tensorfloat-32-precision-format/, tailored for matrix multiplication. This format optimizes for both precision and range, specifically for their tensor cores.

Behavioral Portability Issues due to Subnormals

Our feature-targeted tests include testing for subnormal support. To motivate reasons for such tests in a general context, consider two floating-point normal values $a$ and $b$ are close together but not individually equal to $0$ . Suppose we now have an expression $E_{1}=c/(a-b)$ where $E_{1}$ is some expression and $c$ is a normal number. If the result of $(a-b)$ is a subnormal number as per an infinite-precision calculation but the hardware does not provide support for subnormals, then the hardware turns the denominator into $0$ causing a division-by-zero exception.⁶⁶6Assuming that pertinent compiler flags are applied.

The first problem posed by this situation is that some GPUs do not have hardware traps for exceptions ( [15] confirms this for NVIDIA). Now if we have another expression $E_{2}$ similar to $E_{1}$ and we have $E_{1}/E_{2}$ ; then the resulting $\infty/\infty$ results in a NaN (“not a number”) exception—for which also GPUs lack adequate exception-trap** support in hardware. Also note that AMD provides some support for exception trap** [16], and thus porting from AMD to NVIDIA will turn the lack of subnormal support into a significant exception-behavior difference. Thus, a subnormal-targeted test (which we provide) can meaningfully distinguish between GPU behaviors.

Rounding Mechanisms in Floating-Point Arithmetic

Given the constraints of the floating-point format, rounding becomes essential when a value surpasses its bounds. The IEEE prescribes that rounding should emulate an intermediate result that is infinitely precise and possesses an unbounded range (this is called correct rounding). To realize this ideal, supplementary bits (guard or G, rounding or R, and sticky or S, collectively called “extra bits”, see Table II) are incorporated in the design of IEEE-compliant hardware, with G, R, and S having lower significance (in that order) than the mantissa least significant bit (LSB). These bits are set when operation results are normalized via a right shift (see below for an illustration). Higham [17] notes that a single additional bit will not consistently yield the same outcome as obtaining the precise result followed by rounding. However, incorporating a second guard bit and a third sticky bit (which is the logical OR of all bits that are shifted through the S position) permits correct rounding. To realize correct rounding [17], there are two requirements: (1) employ three extra precision bits, and (2) employ round-to nearest with ties to even.

TABLE II: Rounding Rules of FP Arithmetic. To read this table, first locate the GRS bits. Then decide the result sign and the rounding mode desired. Finally, for all but truncate, add the specified bit to the mantissa least significant bit (LSB). For truncate, set the LSB as per this value.

The three extra bits GRS where

(x\vee y)=1

Result sign

New value for mantissa LSB (add this bit to

m

’s LSB except for truncate it is assigned to LSB

Round

(tow.

+\infty

)

Round

down

(tow.

-\infty)

RTN-TE

Round to zero (truncate)

0xy

100

1

1

1xy

Description of Rounding For clarity, let us walk through a simple example which will help read the rest of this paper with more assurance:

•

Align: (If necessary), make the exponent of the two numbers to be added the same by right-shifting the number with the smaller exponent.
•

Operate, normalize, set extra bits: Perform the addition, and then normalize the result; specifically, if the result mantissa is 2 or more in value, bring it within $[1,2)$ by right-shifting the mantissa, suitably adjusting the exponent. This right shift sets through and sets the extra bits.
•

Round as per rules, normalize again if needed: Consult Table II to round or truncate.

An Example Consider an FP scheme with one bit mantissa and suppose the result after calculation is positive $1.1100$ in binary or 1.75 in decimal ( $GRS=100$ is attached at the end), and let $e=0$ . This cannot be represented using one mantissa bit, and so we must round. For RTN-TE⁷⁷7RTN-TE stands for ”Round to Nearest, Ties to Even,” which is a rounding method commonly used in floating-point arithmetic. When a number falls exactly halfway between two possible rounded values, this method rounds the number to the nearest even value. the mantissa LSB resulting in $10.0100$ . This needs normalization, and after that, the result is $1.0010$ (and $e=1$ )—i.e., 2 in decimal. The answer for truncate is 1.

Fused Multiply-Add (FMA) Operation

In contemporary computational architectures, certain machines incorporate hardware components specifically designed to facilitate the Fused Multiply-Add (FMA) operation. As per the IEEE 754 standards, this operation computes $c+(a\cdot b)$ by ensuring two pivotal conditions: (1) computation is performed as though it has infinite precision and an unbounded range, and (2) rounding is applied only once, after the completion of both ‘*’ and ‘+’. These are referred to as FMA conventions in this paper. Matrix multiplication, represented as $A\cdot B+C$ , can be conceptualized as a series of blocked Multiply-Add Operations. This operation is natively supported by GPU architectures from both AMD and NVIDIA, as elaborated in Section II-B. Given this context, we hypothesize these matrix accelerators also adhere to these FMA conventions.

II-B Matrix Acceleration

Implementing matrix operations efficiently benefits a plethora of numerical algorithms underlying HPC and ML [6]. Acknowledging this, NVIDIA and AMD have developed specialized compute units. NVIDIA’s Tensor Cores and AMD’s Matrix Cores are designed to optimize matrix operations, enhancing computational speed and efficiency. We use the neutral term matrix accelerator when referring to either. Matrix multiplication, represented by the equation $D=A\cdot B+C$ , is a foundational primitive in Linear Algebra (it is a BLAS level 3 operation). Equation 1 for all $i,j$ in the allowed range of matrix indices $1\dots Size$ governs the behavior of matrix accelerators:

d_{ij}=a_{i1}*b_{j1}+a_{i2}b_{2j}+...+a_{in}b_{nj}+c_{ij}.

(1)

II-C Block FMA

Existing public documentation on matrix accelerators [5, 6] indicates that they employ the so-called block FMA where the calculation in Equation 1 is achieved in parallel, essentially suffering rounding error comparable to doing one scalar FMA. In other words, if one unrolls Equation 1 into a serial loop and determines the rounding error, then it is clear that each add-multiply step can incur a half $ulp$ error in RTN-TE, thus making the worst-case error grow with $Size$ . This is avoided in block-FMA; we assume block-FMA in the rest of this paper.

II-D Coding Matrix Acceleration

There are mainly two ways to invoke matrix accelerators:

1.

Via High-Level APIs: One can make use of high-level C++ APIs such as nvcuda::wmma for NVIDIA and nrocmwmma:wmma for AMD. These APIs provide a structured and relatively user-friendly interface to interact with Tensor and Matrix cores, respectively.
2.

Intermediate-Level Assembly Manipulation: For those delving deeper into the architecture, direct interaction with matrix accelerators unit is feasible through specific instruction sets. Within the NVIDIA platform, this is achieved by the PTX instruction set wmma operations. In contrast, AMD offers compiler intrinsic instructions such as __builtin_amdgcn_mfma_ are tailored for matrix operations.

In our work, we employed the high-level API directly, abiding by all the requirements for its invocation such as meeting dimensionality restrictions published by manufacturers. To double-check that matrix accelerator units will be active during operation, we check the underlying code. For NVIDIA, the presence of HMMA/DMMA in the SASS code indicates that the Tensor Cores will be activated. For AMD, spotting MFMA in the LLVM intermediate representation indicates the use of their Matrix Computing Units ⁸⁸8NVIDIA’s official documentation highlights the roles of HMMA and DMMA operations in Tensor Core operations [18] Similarly, AMD describes the role of MFMA in their official documentation [16]. These documents also mention the conditions to be met before these units are activated. Leaving nothing to chance, we check for these instructions explicitly. .

III Numerical Behaviors of Matrix Accelerators

Refer to caption — Figure 1: Matrix Unit Testing Approach. For the matrix accelerator under test, all the properties shown on the right are checked (all but monotonicity) or implied (monotonicity) by our tests.

Matrix computing units are now an indispensable part of GPU usage in machine learning while also attacting considerable interest from HPC developers [6]. This work aims to close significant gaps in the official documentation of NVIDIA and AMD detailing their numerical behaviors by designing tests that highlight specific differences. Our overall testing plan is illustrated in Figure 1, and detailed in subquent sections. Given our coverage of close to a dozen high-level features, this section will be hierarchically organized where sections detailing specific tests may be skipped on first reading.

III-A High-Level Testing Plan

The overall goal of a matrix accelerator is to efficiently support the calculations in producing the $D$ matrix where $D=AB+C$ , with $A,B$ and $C$ also being matrices. Since all $D$ entries are calculated in an identical manner, it suffices to focus on how one particular entry, namely $d_{11}$ is calculated:

d_{11}=a_{11}b_{11}+a_{12}b_{21}+...+a_{1n}b_{n1}+c_{11}

(2)

We now focus on each aspect of Equation 2, discuss tests that target the discovery of the details hidden behind the following features (and these details are the ones that will help contrast various GPUs): The following tests will now be detailed in their own sections. An important contribution we make is to orchestrate these tests according to the order in the flow-chart in Figure 3 so that some cases are eliminated or concluded early, allowing later tests to discriminate cases without ambiguity.⁹⁹9It is important to reiterate that the features discovered by these tests are largely undocumented or hard to find. Our tests provide a “one-stop shop**” experience for quickly determining them and corroborating with documentation (that may change over time.)

T_si_no:: “subnormal in; normal out;” i.e., if a computation unit is fed subnormal inputs, can it handle it at the input (without zeroing it), and produce a normal output?
T_ni_so:: “normal in; subnormal out;” i.e., if a computation unit is provided normal inputs, and the computation resulting in a subnormal, then can this subnormal be output (or will it get zeroed)?
T_sa:: “subnormal accumulation ok;” i.e., if a set of subnormals are being added, is the accumulation successful (or is the output getting zeroed)?
T_1_bit:: “at least one extra bit;” i.e., is there at least one extra precision bit in the computation unit?
T_rnd_dir:: “rounding direction;” i.e., determine the rounding direction based on the test outcome: possible test outcomes are to say whether to zero (truncate), down (to $-\infty$ ), RTN-TE, or up (to $+\infty$ ) are happening (Figure 2).
T_3_bits_fin_rnd:: “three extra bits are provided, final rounding;” i.e., tests that locate if three extra precision bits are provided. It also determines the final rounding direction followed.
T_prod:: “product rounding direction;” similar to $T\_rnd\_dir$ except for the product terms during block FMA.
T_blk_fma_width:: “block FMA width;” i.e., what is the unit-width for the block FMA operations?
T_pres_extra_acc:: “preservation of extra bit during accumulation;” i.e., are the extra bits preserved during the accumulation of a block FMA unit.
T_acc_order:: “accumulation order control;” i.e., can we determine the accumulation order being followed by the block FMA during its accumulation stage?

III-B Details of Each Test

T_si_no, T_ni_so, and T_sa:

The objective here is to discern whether the matrix unit can handle subnormal numbers, both as input and output. The tests below will assign specific values to the right-hand side of Equation 2, assume full IEEE-compatible subnormal support, and check for the expected results under this assumption. The three tests are now detailed:

T_si_no: Initialize $a_{11}$ with an arbitrary subnormal number while ensuring that the product $a_{11}b_{11}$ yields a normal number; set all other input words in the $d_{11}$ equation to $0$ . Now check whether $d_{11}$ equals $a_{11}b_{11}$ ; if so, the check passes; else, $d_{11}$ is expected to emerge as zero, when the check fails.

T_ni_so: Initialize $a_{11}$ and $b_{11}$ with arbitrary normal numbers while ensuring that the product $a_{11}b_{11}$ is a subnormal number. Now examine whether $d_{11}$ is a subnormal value (“pass”) or emerges as zero (“fail”).

T_sa: Assign an arbitrary subnormal to $c_{11}$ while kee** all other inputs at zero.The test observes whether $d_{11}$ is this subnormal (“pass”) or $0$ (“fail”).

T_1_bit, T_rnd_dir, and T_3_bits_fin_rnd:

These tests follow the logic in Figure 2 for the first two tests, and Figure 4 for the RTN-TE case and Figure 5 for the round to zero case. The tests are now detailed. Note that the tests can set either $a_{12}b_{21}$ or $c_{11}$ to $2$ .

T_1_bit: We check the result of the operation $1-(2^{-1}\times ulp)$ to check if at least one extra bit is provided. Aligning $2^{-1}\times ulp$ to 1 in its binary representation necessitates a shift amount equivalent to the mantissa bit length plus one. Consequently, if the resultant value remains $1-(2^{-1}\times ulp)$ , it implies the existence of an extra bit in computations.

T_rnd_dir: Upon confirming the presence of an extra bit, the accumulator’s rounding behavior is assessed (Figure 2) .

T_3_bits_fin_rnd: We finally proceed to determine if three extra bits are present, and also determine the final rounding modes supported. Its logic and implementation are now discussed.¹⁰¹⁰10Details provided for the sake of completeness; reading can be postponed. The key aspect of our testing approach was the alignment of three bits during the accumulation process. The goal was to ascertain the effects of preserving one, two, or all three extra bits on the mantissa component. This nuanced behavior was attained via subtraction operations, the intricacies of which are detailed in Figures 4 and 5.

Building upon our preliminary understanding of the rounding direction, we embarked on a series of rigorous tests. The essence of these tests is encapsulated in Figures 4 and 5, which respectively illustrate the methodologies for rounding-to-nearest and rounding-to-zero modes.¹¹¹¹11We highlight these two modes given their adoption in matrix accelerators.

T_pres_extra_acc:

The fact that the extra bits are retained during block-FMA accumulation can be confirmed using the expression $1+2^{-1}\cdot ulp+2^{-1}\cdot ulp+2^{-1}\cdot ulp$ (by making the accumulation of a block-FMA perform this calculation).If intermediate accumulation steps maintain these extra bits, the ultimate result will be $1+ulp$ ; else it will emerge as $1$ .

T_acc_ord:

The significance of accumulation order control primarily arises in scenarios employing rounding to zero with the preservation of just one extra bit. Contrarily, when three extra bits are enabled (which facilitates a sticky bit with the rounding-to-nearest mode), the results remain consistent irrespective of the accumulation order. If only one extra bit is maintained, we must test to ascertain the accumulation order. For this, one can test all permutations of the terms in the equation $1+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp$ . Given that only one extra bit is retained in the rounding to zero case, the precision associated with the terms $2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp+2^{-2}\cdot ulp$ tends to be lost, save for when it is computed first. That is, if we allow all the “small values” to add up first, then even with one extra bit, we will get the answer $1+ulp$ . In other words, ff we can externally control the order of reduction by assigning these terms to specific positions within an FMA unit, there exists a output yields $1+ulp$ and hence the reduction order is under user control; else not.

Data: Matrices

a

b

c

, and

d

a^{\prime}s

row’s length

K

1 Initialize all values in

a

b

c

to 0

c_{11}\leftarrow 1.

a_{11}\times b_{11}\leftarrow 2^{-1}\times\text{ulp}

for $i\leftarrow 1$ to $K$ do

2 if $i>1$ then

a_{1(i-1)}\times b_{(i-1)1}\leftarrow 0

(a_{1i}\times b_{i1})\leftarrow(2^{-1}\times\text{ulp})

5 Call wmma(

a

b

c

d

) if $d_{11}=1.$ then

6 break

8if index $<$ K then

min\_preserve\_uint\leftarrow index

10else

min\_preserve\_uint

is larger than

K

Algorithm 1 Test Minimum Unit for FMA property preservation. The idea is to assign a moving position the

2^{-1}ulp

value and when that position goes beyond the width of the block FMA, we get a

1

output. That index is the FMA block width.

T_blk_fma_width:

To determine the block size of the block FMA unit, we execute a test loop given in Figure 1. Within a single FMA unit, precision remains intact throughout computation, and rounding occurs only at the concluding bit position. The key idea realized by this test is to load-up a 1-bit at a pair of moving positions denoted by $a_{1i}$ and $b_{i1}$ such that $a_{1i}\times b\_{i1}$ is ensured to be half a $ulp$ ( $2^{-1}\times ulp$ ). We use the aforementioned equation and shift the last term, $2^{-1}\cdot ulp$ , across the matrix multiplication positions as illustrated. A loss of precision at Line 1 (the half $ulp$ vanishes) signals that the initialization occurred in a detached FMA unit where it suffers a rounding precision loss; the “end of a FMA unit” in effect gets detected when the final value is $1$ .

T_prod:

Tests for the rounding mode of the product are only performed for FP64 and FP32 inputs. Assuming each multiplier has an $n$ -bit mantissa, their products can occupy only $2n$ bits. Taking this into consideration, the input formats such as FP16, TF32, and BF16—which respectively possess 10, 10, and 7-bit mantissa bits—do not experience precision loss when operating within an FP32 environment where a 23-bit mantissa is used.

To check product rounding, we can employ the same methodology used for rounding mode assessment during accumulation (Figure 2). Specifically, for the product $a_{11}\cdot b_{11}$ , if we set one term to be $1+2\cdot ulp+ulp$ and the other as $1+2^{-2}$ , the exact result is $1+2^{-2}+3\cdot ulp+2^{-1}\cdot ulp+2^{-2}\cdot ulp$ , with a 110 suffix to the end of the mantissa bit. Similarly we can incorporate negative-number test scenarios (Figure 2) and referencing the rounding tests depicted there, we can deduce the rounding mode used in the multiplication operation.

IV Feature Test Results

TABLE III: Results of Analyzing Matrix Accelerators. Here, “FMA unit size” is the number of words considered before rounding and normalization are performed (“Block FMA Size”). Note that V100 only support FP16. Further, FP64 is not supported in MI100. Last column: ”Case 1” is for add/accumulate, and ”Case 2” refers to the product test T_prod. ✓is yes, and ✗is no.

Inputs

GPU

Subnormal

inputs

handled?

Subnormal

outputs

handled?

Extra bit

present?

How many?

Rounding

mode

exhibited

FMA

unit

width

Order

within one

FMA unit

is controllable?

Rounding mode for:

1. outputting FP16/BF16

(only for FP16/BF16

inputs)

2. product (only for

FP32/FP64 inputs)

FP16

V100

✓

truncate

✗

RTN-TE

A100

✓

truncate

✗

RTN-TE

H100

✓

\geq 2

truncate

\geq 16

✗

RTN-TE

MI100

✓

RTN-TE

{}^{*}

✗

RTN-TE

MI250X

✗

RTN-TE

N.A.

RTN-TE

BF16

A100

✓

truncate

✗

N.A.

{}^{**}

H100

✓

\geq 2

truncate

\geq 16

✗

RTN-TE

MI100

✓

RTN-TE

✗

RTN-TE

MI250X

✗

RTN-TE

N.A.

RTN-TE

TF32(NVIDIA) FP32(AMD)

A100

✓

RTN-TE

✗

N.A.

H100

✓

\geq 2

truncate

✗

N.A.

MI100

✓

RTN-TE

N.A.

RTN-TE

MI250X

✓

RTN-TE

N.A.

RTN-TE

FP64

A100

✓

RTN-TE

✗

RTN-TE

H100

✓

RTN-TE

✗

RTN-TE

MI250X

✓

RTN-TE

N.A.

RTN-TE

* RTN-TE = round to nearest and round to even when tie.

* A100 doesn’t support BF16 output.

Table III presents our final compilation of results obtained from testing various GPUs. We discuss the results in detail below:

Subnormal Supports All the GPUs tested support subnormal numbers for inputs and outputs, with the exception of FP16 and BF16 formats of MI250X which does not. It is important to note that the absence of subnormal support could lead to the risk of generating exceptions such as division by zero as mentioned in §II.

Extra Bits for Computation AMD GPUs consistently use three extra bits for precise rounding. In contrast, NVIDIA GPUs have evolved across generations: the V100 does not include any extra bits, the A100 includes one, and the H100 includes at least two extra bits¹²¹²12Due to our limited access to the H100, we can only test for more than 2 extra bits. We did not conduct further FMA unit width tests for the same reason. We can, however, easily expand our tests to include three extra bits.. For FP64 inputs, all GPUs incorporate an additional three bits.

Rounding Modes The chosen rounding mode is consistent across NVIDIA and AMD GPUs, with all models adhering to the chosen mode consistently across generations.

FMA Feature NVIDIA’s V100 has an FMA unit width of 4, and the A100 expands this to 8, as documented. The H100’s FMA unit width is suggested to be at least 16, a detail not officially confirmed. For TF32 inputs on NVIDIA GPUs, the FMA unit width is 4, which suits the 19-bit size of TF32. The AMD MI100 maintains FMA features with different widths for FP16 and BF16 inputs, but the MI250X lacks this feature. While FMA units can enhance accuracy, they may complicate the porting of CPU algorithms which do not typically support blocked FMA operations.

Rounding Mode for Outputting FP16 and BF16 We have examined the rounding mode used when GPUs output FP16 and BF16. All GPU models use the RTN-TE rounding mode. We hypothesize that the conversion to lower precision is performed after the computation at full precision.

Rounding Mode for Product For products involving FP32/FP64 inputs, all GPUs utilize the RTN-TE mode, demonstrating consistency in following IEEE floating-point arithmetic standards.

V Exhibiting Porting Danger in a Matrix Multiplication Routine

We now illustrate an example in which we perform a simple matrix multiplication to demonstrate how these subtle implementation difference in GPU architectures can vary the numerical outcomes. We analyze the matrix multiplication equation

D=\alpha\cdot A\cdot B+\beta\cdot C

with matrices $A$ and $B$ in FP16 format (sized $m\cdot k$ and $k\cdot n$ , respectively) and matrices $C$ and $D$ in FP32 format (both sized $m\cdot n$ ). Here, $\alpha=-1$ and $\beta=1$ , and we set the matrix dimensions to $m=n=k=2^{13}$ .

For matrix $C$ , $C_{ij}=2^{20}$ for all $i$ and $j$ . In matrix $A$ , $A_{i0}=2^{10}$ , $A_{ij}=2^{-2}$ for odd $j$ , and $A_{ij}=2^{-3}$ for even $j$ (except $j=0$ ). For matrix $B$ , $B_{0j}=2^{10}$ , with other $B_{ij}$ values set at $2^{-3}$ . In this scenario, each element of matrix $D$ is calculated as follows (note that all $D_{ij}$ will be the same):

	$\displaystyle D_{ij}$	$\displaystyle=-(A_{i0}\cdot B_{0j}+\sum_{j\%2=1}A_{ij}\cdot B_{ij}+\sum_{j\%2=% 0}^{j\neq 0}A_{ij}\cdot B_{ij})+C_{ij}$
		$\displaystyle=-(2^{10}\cdot 2^{10}-\sum_{2^{12}}2^{-2}\cdot 2^{-3}-\sum_{2^{12% }-1}2^{-3}\cdot 2^{-3})+2^{20}$
		$\displaystyle=2^{7}+2^{6}-2^{-6}\approx 191.99218$

Note that the terms $2^{-2}\cdot 2^{-3}$ and $2^{-3}\cdot 2^{-3}$ require bit shifts (25 bits or 26 bits) to align $2^{20}=2^{10}\cdot 2^{10}$ . Thus there would be precision loss for FP32 computation unit. This loss may vary depending on the number of extra bits preserved and the length of the FMA operation. This variation is what produces the sharp result-difference that we observed.

Specifically, we observed these (rather highly different) $D_{ij}$ values computed using a simple GEMM implementation on different GPUs for the very same $A$ , $B$ , and $C$ matrix inputs: $0$ on NVIDIA A100, V100, AMD MI250 and CPU; $255.875$ on AMD MI100; and $191.875$ on NVIDIA H100.

These discrepancies highlight the importance of understanding hardware-specific computational feature differences; ignoring these and porting across GPUs can vary results across this wide range.

Importance of This Pattern, Consequences: The pattern $D=C-A\cdot B$ (where $\alpha=-1$ and $\beta=1$ ) is closely related to trailing matrix updates $A_{i}=A_{i}-P_{i}T_{i}$ used in a mixed-precision GMRES (Generalized Minimal Residual Method) iterative refinement algorithm [19, 20]. This approach is embedded in cuSolvers¹³¹³13https://docs.nvidia.com/cuda/cusolver/index.html#cusolverirsrefinement-t. In general, a computation of the type $D=C-A\cdot B$ is part of a standard BLAS (Basic Linear Algebra Subprograms) level 3 family. It is ubiquitous in various numerical linear algebra computations. This observation spawns future directions discussed in §VI.

VI Concluding Remarks

Matrix accelerators are an important emerging component in the computational landscape, yet very little can be easily determined about their numerical behavior. We study five such accelerators in this paper (with many more available for use, but with even less documented). One can learn very little by testing these units on random inputs. We observe that by exploiting our understanding of basic IEEE floating-point semantics and whatever is published about these units (e.g., that they perform block FMA), we can devise tests that target many key attributes such as subnormal support, rounding modes chosen, the number of hidden bits, accumulation order control, and width of the basic FMA blocks. Manufacturers select these features largely based on the cost of implementation; for instance, supporting one extra bit is cheaper; and that makes more demanding rounding modes (e.g., round-to-nearest) impossible to attain. Yet manufacturers may have targeted these accelerators for their own set of priority applications. For instance, they may have designed a set of features that make machine learning fast and efficient. They might view it as the HPC developer’s “fault” for using such matrix accelerators to perform HPC. On the other hand, HPC programmers are unaware of many of these dangers such as six orders of magnitude difference in results—unless they are lucky to choose such input matrices. By designing focused feature-targeted tests, we help foresee pitfalls.

Our tests are not foolproof. All we can say is that we assume significant levels of symmetry in design. For instance, if a manufacturer changes the precision of the diagonal outputs $d_{11}$ , $d_{22}$ , …, $d_{NN}$ for some reason (say because they determine the Eigenvalues in some cases), then all bets are off! Our tests are still valuable in flagging certain porting decisions as dangerous, thus adding to the overall porting strategies.

We do have more findings than have been highlighted. For example, we have discerned that the tensor cores operates with a width of 8. This observation is in agreement with the details revealed in NVIDIA’s white paper [21] that the tensor cores’ dimensions are $8\times 4\times 8$ . Our tests also reveal that the hardware unit size for AMD GPUs, which is not documented as far as we know, is merely $1$ . This can perhaps help explain rounding errors that might be higher (or might point to a hardware implementation decision taken by AMD). However, this agrees with what CPUs also follow. Additionally, the rounding mode employed by the AMD matrix accelerator adheres to the IEEE 754 standard. This is significant as it suggests a good possibility of producing results on AMD Matrix Cores that are consistent with those computed on many CPUs.

Another inference from our tests is that our conclusions on monotonicity match the observations in [8] about extra bit requirements. We can show that NVIDIA Tensor Cores can violate monotonicity, and for AMD GPUs, because of 3 extra bits, monotoncity will be preserved.

One exciting direction triggered by the trailing matrix updates pattern is that many other such patterns may exist in one’s implementation of linear-algebra routines as well as core routines in other areas. This suggests develo** tests for patterns found in other application spaces. We would also like to work on generalizing our tests using formal methods [22, 23] which may allow groups to check their tests for consistency and overlaps.

References

[1] Ganesh Gopalakrishnan, Ignacio Laguna, Ang Li, Pavel Panchekha, Cindy Rubio-González, and Zachary Tatlock. Guarding numerics amidst rising heterogeneity. In 2021 IEEE/ACM 5th International Workshop on Software Correctness for HPC Applications, pages 9–15, 2021.
[2] NVIDIA. NVIDIA A100 Tensor Core GPU architecture. https://images.nvidia.com/aem-dam/en-zz/Solutions/data-center/nvidia-ampere-architecture-whitepaper.pdf, 2020.
[3] AMD. Amd cdna architecture. https://www.amd.com/system/files/documents/amd-cdna-whitepaper.pdf, 2020.
[4] Deepak Narayanan, Mohammad Shoeybi, Jared Casper, Patrick LeGresley, Mostofa Patwary, Vijay Anand Korthikanti, Dmitri Vainbrand, Prethvi Kashinkunti, Julie Bernauer, Bryan Catanzaro, Amar Phanishayee, and Matei Zaharia. Efficient large-scale language model training on gpu clusters using megatron-lm, 2021.
[5] Pierre Blanchard, Nicholas J Higham, Florent Lopez, Théo Mary, and Srikara Pranesh. Mixed Precision Block Fused Multiply-Add: Error Analysis and Application to GPU Tensor Cores. SIAM Journal on Scientific Computing, 2020.
[6] Jack Dongarra, Laura Grigori, and Nicholas Higham. Numerical algorithms for high-performance computational science. Phil. Trans. R. Soc. A.3782019006620190066, 2020. http://doi.org/10.1098/rsta.2019.0066.
[7] GOOGLE. Cloud tensor processing units. https://cloud.google.com/tpu/docs/tpus, 2022.
[8] Mantas Mikaitis. Monotonicity of multi-term floating-point adders, 2023.
[9] M. Fasiand N.J. Higham and M. Mikaitis and S.Pranesh. Numerical behavior of NVIDIA tensor cores. In PeerJ Comput Sci, February 2021.
[10] David Defour, Guillaume Hanrot, Vincent Lefèvre, Jean-Michel Muller, Nathalie Revol, and Paul Zimmermann. Proposal for a Standardization of Mathematical Function Implementation in Floating-Point Arithmetic. Research Report RR-5406, INRIA, 2004.
[11] Brian Gladman, Vincenzo Innocente, and Paul Zimmermann. Accuracy of mathematical functions in single, double, double extended, and quadruple precision, 2023.
[12] JaeHyuk Kwack, John Tramm, Colleen Bertoni, Yasaman Ghadar, Brian Homerding, Esteban Rangel, Christopher Knight, and Scott Parker. Evaluation of performance portability of applications and mini-apps across amd, intel and nvidia gpus. In 2021 International Workshop on Performance, Portability and Productivity in HPC (P3HPC), pages 45–56, 2021.
[13] Anton Rydahl, Joseph Huber, Ethan Luis Mcdonough, and Johannes Doerfert. Precision and performance analysis of c standard math library functions on gpus. In Proceedings of the SC’23 Workshops of The International Conference on High Performance Computing, Network, Storage, and Analysis, pages 892–903, 2023.
[14] Jean-Michel Muller, Nicolas Brunie, Florent de Dinechin, Claude-Pierre Jeannerod, Mioara Joldes, Vincent Lefvre, Guillaume Melquiond, Nathalie Revol, and Serge Torres. Handbook of Floating-Point Arithmetic. Birkhäuser Basel, 2nd edition, 2018.
[15] NVIDIA. Cuda floating point and ieee 754, 2024. https://docs.nvidia.com/cuda/floating-point/index.html.
[16] AMD. ”amd instinct mi100” instruction set architecture reference guide. https://www.amd.com/content/dam/amd/en/documents/instinct-tech-docs/instruction-set-architectures/instinct-mi100-cdna1-shader-instruction-set-architecture.pdf, 2020. Accessed: 2023-12-17.
[17] Nicholas J Higham. Accuracy and stability of numerical algorithms. SIAM, 2002.
[18] Nvidia. Cuda binary utilities. https://docs.nvidia.com/cuda/pdf/CUDA_Binary_Utilities.pdf, 2023. Accessed: 2023-12-17.
[19] Azzam Haidar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. Harnessing gpu tensor cores for fast fp16 arithmetic to speed up mixed-precision iterative refinement solvers. In SC18: International Conference for High Performance Computing, Networking, Storage and Analysis, pages 603–613. IEEE, 2018.
[20] Azzam Haidar, Harun Bayraktar, Stanimire Tomov, Jack Dongarra, and Nicholas J Higham. Mixed-precision iterative refinement using tensor cores on gpus to accelerate solution of linear systems. Proceedings of the Royal Society A, 476(2243):20200110, 2020.
[21] CUDA NVIDIA. Nvidia a100 tensor core gpu architecture. Volume 1.0: Whitepaper, Part, 1:82, 2020.
[22] Leonardo De Moura and Nikolaj Bjørner. Z3: An efficient smt solver. In International conference on Tools and Algorithms for the Construction and Analysis of Systems, pages 337–340. Springer, 2008.
[23] Sylvie Boldo, Jacques-Henri Jourdan, Xavier Leroy, and Guillaume Melquiond. Verified Compilation of Floating-Point Computations. Journal of Automated Reasoning (JAR), 54(2):135–163, 2015.