HLAT: High-quality Large Language Model Pre-trained on AWS Trainium

Haozheng Fan∗1, Hao Zhou∗2, Guangtai Huang1, Parameswaran Raman1, Xinwei Fu1  and  Gaurav Gupta2, Dhananjay Ram3, Yida Wang1, Jun Huan2 1Amazon Web Services, 2AWS AI Labs, 3AGI Foundations, Amazon{fanhaozh, zhuha, guangtai, prraman, fuxinwe, gauravaz, radhna, wangyida, lukehuan}@amazon.com
Abstract.

Getting large language models (LLMs) to perform well on the downstream tasks requires pre-training over trillions of tokens. This typically demands a large number of powerful computational devices in addition to a stable distributed training framework to accelerate the training. The growing number of applications leveraging AI/ML had led to a scarcity of the expensive conventional accelerators (such as GPUs), which begs the need for the alternative specialized-accelerators that are scalable and cost-efficient. AWS Trainium  is the second-generation machine learning accelerator that has been purposely built for training large deep learning models. Its corresponding instance, Amazon EC2 trn1, is an alternative to GPU instances for LLM training. However, training LLMs with billions of parameters on trn1 is challenging due to its relatively nascent software ecosystem. In this paper, we showcase HLAT: a 7 billion parameter decoder-only LLM pre-trained using trn1 instances over 1.8 trillion tokens. The performance of HLAT is benchmarked against popular open source baseline models including LLaMA and OpenLLaMA, which have been trained on NVIDIA GPUs and Google TPUs, respectively. On various evaluation tasks, we show that HLAT achieves model quality on par with the baselines. We also share the best practice of using the Neuron Distributed Training Library (NDTL), a customized distributed training library for AWS Trainium  to achieve efficient training. Our work demonstrates that AWS Trainium  powered by the NDTL is able to successfully pre-train state-of-the-art LLM models with high performance and cost-effectiveness.

LLMs, pre-training, AWS Trainium, distributed training
Both authors contributed equally to this research.
conference: ; ;

1. Introduction

Large language models (LLMs), based on transformer architecture (Vaswani et al., 2017) and trained on massive text data, is the most recent breakthrough in artificial intelligence. They not only show remarkable capabilities in understanding and generating text (Li et al., 2022), but offer immense potential across diverse downstream tasks, such as machine translation (Hendy et al., 2023), information retrieval (Zhu et al., 2023), code generation (Roziere et al., 2023) and so on (Zhao et al., 2023b).

Pre-training is the crucial first step in building LLMs because it lays the foundation for their impressive capabilities. It initializes the model with random weights, and trains the model to convergence using tokens from a large text corpus. The training process is designed to be self-supervised. For decoder-only model, such as GPT (Brown et al., 2020) and LLaMA (Touvron et al., 2023a), the model is trained to predict the next token given a sequence of previous tokens. Eventually, the model learns everything ranging from syntax and semantics to world knowledge and commonsense reasoning with a large amount of training data. Pre-training provides the raw material - the language skills and understanding, which facilitate the subsequent fine tuning from different downstream tasks.

Since pre-training requires a large amount of training data (trillions of tokens), it demands highly on computational resources. Advanced AI accelerators, such as AWS Trainium 111https://aws.amazon.com/machine-learning/trainium, Google TPU 222https://cloud.google.com/tpu, and NVIDIA A100/H100 GPUs 333https://www.nvidia.com/en-us/data-center/a100;https://www.nvidia.com/en-us/data-center/h100, have been specifically designed for this kind of workloads. These AI accelerators are often integrated with dedicated tensor processing units which offer fast matrix operations and high training throughput. They also have much larger on-chip memory (tens of GBs per accelerator) and high communication bandwidth (hundreds of Gbps) between accelerators across different machines, which allows pre-training of larger models with efficient hardware utilization.

Even with the powerful AI accelerators, due to the sheer size and complexity of LLMs, it’s impractical to train them on a single device. In other words, a single accelerator simply doesn’t have enough memory or processing power to handle the massive datasets, model parameters, and intricate calculations involved in LLM training. Practitioners rely on distributed training libraries (Zhao et al., 2023b) to orchestrate a number of accelerators to conduct the training together. Distributed libraries can shard the model parameters and optimizer states across multiple accelerators with different kinds of parallelism strategy. They also spread the workload across multiple machines, effectively tap** into a combined pool of resources, allowing to train the models at the scale of multi-billions of parameters. Additionally, by splitting the workload, distributed libraries significantly reduce the training time.

Although there have been many successful demonstrations of pre-training LLMs on conventional accelerators (GPUs and TPUs) using state-of-the-art distributed training libraries (Rasley et al., 2020; Shoeybi et al., 2020; FairScale authors, 2021; Zhao et al., 2023a), training LLMs with billions of parameters on AWS Trainium  is still challenging. First, Trainium  uses a relatively nascent software ecosystem ranging from runtime, compiler, to distributed training library. The training script developed for other accelerators needs to be adjusted to comply with the low-level APIs and operators supported by Trainium. Second, the optimal training configurations that ensure stable convergence and optimal training throughput may also differ from other accelerators, e.g., level of precision, dimensions of 3D parallelism, compiler flags to mention a few. On the other hand, Amazon EC2 trn1 instance, equipped with AWS Trainium  accelerators, provides the comparable computation power to Amazon EC2 p4d instance, equipped with Nvidia A100 40GB GPUs, but comes with only  60% of the price (details in Section 2). This makes it appealing to fully utilize the compute power of AWS Trainium  for LLM pre-training.

In this paper, we for the first time provide an end-to-end LLM pre-training practice on top of AWS Trainium, demonstrating the effectiveness and efficiency of this accelerator. Specifically, we make the following contributions:

  • We pre-train HLAT (High-quality LLM pre-trained on AWS Trainium), a 7B model following the architecture described in (Touvron et al., 2023a, b) from scratch using a total training budget of 1.8 trillion tokens. The pre-training is performed up to 64 Amazon EC2 trn1.32xlarge instances with totalling up to 1024 AWS Trainium  accelerators.

  • We evaluate the pre-trained models on various evaluation tasks including commonsense reasoning, world knowledge, math, coding, etc. The results show that our pre-trained model provides comparable performance with 7B models trained on other AI accelerators and distributed training frameworks, including LLaMA-1, LLaMA-2, OpenLLaMA-1, and OpenLLaMA-2. We also evaluate the intermediate checkpoints during the training process for additional insights.

  • We provide some best practices of pre-training process, e.g., different sharding strategies, training precisions, and fault tolerance mechanism, on AWS Trainium  and Neuron Distributed Training Library (NDTL) 444https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/neuronx-distributed/index.html. In addition, we design a novel dataloader for online sample packing which performs both tokenization and example packing during training. For pretraining on large datasets, our dataloader saves significant developer time and compute resources.

2. Background - Distributed Training on Trainium

AWS Trainium is the second-generation machine learning accelerator that AWS purposely built for deep learning training. Each Trainium  accelerator includes two NeuronCores. Each NeuronCore has 16 GB of high-bandwidth memory, and delivers up to 95 TFLOPS of FP16/BF16 compute power. In this study, we trained our model on Amazon EC2 trn1.32xlarge instances: each instance is equipped with 16 Trainium  accelerators, and supports 800 Gbps intra-instance network bandwith through NeuronLink. The aggregating compute power of Amazon EC2 trn1.32xlarge is 3040 TFLOPS in FP16/BF16, slightly higher to its GPU instance counterpart Amazon EC2 p4d.24xlarge at 2496 TFLOPS, but at a much lower price (on demand hourly rate: trn1.32xlarge $21.50 vs. p4d.24xlarge $32.77).

AWS Neuron is a software development kit (SDK)555https://github.com/aws-neuron/aws-neuron-sdk with a compiler, runtime, and profiling tools that unlocks high-performance and cost-effective deep learning acceleration on AWS Trainium. Neuron is natively integrated with PyTorch (Zhao et al., 2023a) and TensorFlow (Abadi et al., 2015), and offers features such as FP32 autocasting, stochastic rounding, collective communication, custom operators, and so on.

Neuron Distributed Training Library (NDTL), as part of Neuron SDK, is developed to support high-efficiency distributed training on Trainium:

  • NDTL supports a variety of existing distributed training techniques, such as 3D parallelism (Shoeybi et al., 2020) , i.e., Tensor Parallelism (TP), Pipeline Parallelism (PP) and Data Parallelism (DP). To reduce the activation memory during training, activation checkpointing (Chen et al., 2016) and sequence parallelism (Korthikanti et al., 2022) are naturally supported with the 3D parallelism. NDTL also supports Zero Redundancy Optimizer Stage 1 (ZeRO-1) (Rajbhandari et al., 2020) to shard optimizer states. 3D parallelism and ZeRO-1 can be applied simultaneously during training.

  • NDTL provides unified interfaces to port users’ models and training scripts on Trainium  accelerators. The NDTL interfaces are friendly to Huggingface transformers library (Wolf et al., 2020). To train models from Huggingface transformers on Trainium  accelerators with NDTL, it only requires simple code changes in certain layers of the model.

  • NDTL supports TP with mixed degrees, i.e., users can use more than one TP degrees to shard different model parameters. TP with mixed degrees is used to handle the case when the LLM has some model parts are not compatible with a unified large TP degree, e.g., when the Grouped Query Attention (GQA) (Hudson and Manning, 2019) is applied, which prevents for using a unified TP degree to the whole model.

  • NDTL supports automatic fault recovery and checkpointing. Checkpointing is optimized to save/load on different machines quickly at the same time, and even able to save asynchronously. In case of hardware failures or communication timeouts, NDTL can automatically restart training from latest saved checkpoints without manual intervention, which is critical for maintaining system uptime and stability.

3. Method

3.1. Model Architecture

HLAT  adopts the decoder-only transformer architecture and applies same modifications used in LLaMA 7B (Touvron et al., 2023a, b), including pre-normalization with RMSNorm, SwiGLU activation function, and Rotary Embeddings. The model is trained with a maximum sequence length of 4096.

3.2. Training Hyperparameters

We adopt same training hyperparameters as LLaMA 7B (Touvron et al., 2023a, b) models. Specifically, we use a cosine learning rate scheduler with maximum learning rate of 3e43superscript𝑒43e^{-4}3 italic_e start_POSTSUPERSCRIPT - 4 end_POSTSUPERSCRIPT and minimum learning rate of 3e53superscript𝑒53e^{-5}3 italic_e start_POSTSUPERSCRIPT - 5 end_POSTSUPERSCRIPT. We use a linear warmup of 2000 steps. The overall learning rate scheduler is plotted in Figure 1. We use AdamW optimizer with β1=0.9subscript𝛽10.9\beta_{1}=0.9italic_β start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT = 0.9 and β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95. We use weight decay value of 0.1 for all parameters, including normalization weights. Gradient-norm clip** of 1.0 is applied for training stability.

We use 64 nodes with tensor parallel degree of 8, pipeline parallel degree of 1, and data parallel degree of 256. We use global batch size of 1024 sequences with maximum sequence length of 4096 tokens, so each step covers about 4 million tokens. We train for a total of 1.8 trillion tokens.

3.3. Training Dataset and Dataloader

Our pre-training dataset includes a mix of data from various publicly available sources. It does not contain any data from Amazon or AWS customers, products, or services. We designed a novel dataloader for online example packing to process our datasets for training. It performs both tokenization and packing online during training. In contrast to the offline packing approach, this saves a lot of developer time and compute resources specially for large datasets, and multiple experiments with different datasets and tokenizers. Note that the online tokenization has no impact on training throughput as the tokenization for future samples/batch happens during forward-backward pass of current samples/batch - we use CPU for tokenization and Trainiumdevice for training, so the computations are in parallel. The online sample packing dataloader requires one or more dataset files in Apache Arrow format (Arrow, 2020). At first, all samples in the dataset are shuffled randomly and split into several subsets according to the total data parallel ranks. Each data split is treated as an independent data stream. For training efficiency, we use sample concatenation i.e. if a sample is shorter than the maximum sequence length of the model, we concatenate it with the following sample(s) to curate a sequence with total length equal or more than maximum sequence length. Any left over tokens from current sample is used in the following sample of the batch. The samples within a sequence is concatenated with a special end of sentence token. This gives the model necessary information to infer that the text separated by end of sentence token are unrelated (Brown et al., 2020). Note that the concatenated samples may be from very different sources or can be of different formats (e.g. natural language and codes). Finally, each batch of samples are tokenized on the fly. This online dataloader also has the functionality to save the internal states to checkpoint which enables us to resume the states from loaded checkpoint and continue training, in case of unexpected failures.

3.4. Orchestration

For model training, we utilize a cluster with 64 trn1.32xlarge instances (nodes) with totalling to 1024 AWS Trainium  accelerators. Accelerators within same instance are connected with NeuronLink. The nodes within the cluster are interconnected through Elastic Fabric Adapter (EFA)666https://aws.amazon.com/hpc/efa/. EFA is a network interface with uniquely designed operating system that bypasses traditional hardware interfaces, significantly enhancing performance for inter-node communications, a critical factor for collectives operations in distributed training.

We manage the cluster orchestration using Amazon EKS777https://aws.amazon.com/eks/, a fully-managed Kubernetes service. Amazon EKS simplifies the deployment of Kubernetes both on AWS and on-premises environments. It autonomously manages the availability and scalability of Kubernetes control plane nodes, which are essential for tasks such as container scheduling, application availability management, cluster data storage, and other fundamental operations.

3.5. Training Efficiency

LLaMA (Touvron et al., 2023a) model uses the efficient implementation features for pre-training on GPUs, that include xformer library, activation checkpointing, model parallelism, and computation/communication overlap**, etc. Similar features are also supported by Trainium  and Neuron SDK, as well as some unique enhancement such as BF16 with stochastic rounding. Below, we list the key features and configurations used in our model pretraining to improve the efficiency.

Model Parallelism: We adopt tensor parallelism (TP) to shard the model into 8 TP degrees, and sequence parallelism (SP) to shard the 4096 sequence length into 8 SP degrees. This sharding configuration is observed with the highest throughput.

Selective Activation Checkpointing: We use selective activation checkpointing (Korthikanti et al., 2022) to improve the training efficiency. It has slightly higher memory cost as full activation checkpointing, but increases the overall training throughput.

BF16 with Stochastic Rounding: Pre-training with full precision (FP32) is inefficient for large LLMs, but generic half-precision training (BF16 or FP16) often has numerical stability issues (Micikevicius et al., 2017). AWS Trainium features BF16 with stochastic rounding (SR) (Gupta et al., 2015) which is used in our training. Stochastic rounding, which theoretically provides an unbiased estimate of the input, prevents the computation precision-loss in BF16 by performing the rounding operations in a probabilistic manner. The chances of rounding-up/down are determined according to the relative distance from the two nearest representable values. This allows for small increments to accumulate over time, even when added to numbers of significantly higher magnitude, which leads to preciser results in gradient update. Empirically, we found that BF16 with SR shows the same convergence behavior as mixed precision training (Micikevicius et al., 2017) for HLAT, with higher training throughput and lower memory footprint.

Coalescing Layers with Same Inputs: We coalesced linear layers with the same inputs to reduce the communication in tensor and sequence parallelism, and increase efficiency of matrix operations. Specifically, the Q,K,V layers in an attention block are coalesced, and the two linear projections layers in SwiGLU (Shazeer, 2020) are also coalesced.

Constant Attention Mask: As a decoder-only model, HLAT  pre-training uses a constant attention mask (lower-triangular) matrix. Instead of passing attention mask as an input tensor in model training, AWS Trainium  supports creating attention masks on Neuron Cores directly before use. This saves host memory usage, avoids redundant computation, and increases training throughput. To enable this feature in training script, the attention mask tensor is directly defined in attention block using torch.triu function and mapped to device=’xla’. The Neuron compiler will therefore enable constant attention mask optimization during compilation.

Compiler Optimization: we use compiling flag
--distribution-strategy=llm-training to enable the compiler to perform optimizations applicable to LLM training runs that shard parameters, gradients, and optimizer states across data-parallel workers. We also use --model-type=transformer that performs optimizations specific to transformer models. We set Neuron environment variable NEURON_FUSE_SOFTMAX=1 to enable compiler optimizations on custom lowering forSoftmax operation. Finally, we used NEURON_RT_ASYNC_EXEC_MAX_INFLIGHT_REQUESTS=3 to reduce training latency with asynchronous execution. This overlaps some executions of accelerators and host (CPU).

4. Training Process

4.1. Training Curves

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Figure 1. HLAT  training progress. (a) The training loss vs number of tokens (in billions) seen by the model during training. Gradient/Parameter norm vs number of tokens in (b)/(c), respectively. (d) The learning rate schedule vs number of tokens. The warm-up steps are 2000 iterations (about 8 billion tokens, see Section 3.2).

During the training process, we monitor the training loss, as well as l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of gradients and l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm of parameters for debugging training stability. Figure 1 shows the training loss over global batches, reduced over all data parallel ranks. The training loss decreases fast for the initial similar-to\sim250B tokens, and enters a log-linear decrease afterwards. Similar trends are observed in other LLM training (Touvron et al., 2023a, b; Geng and Liu, 2023).

In Figure 1, we show the gradient l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm during the training. Overall, we see that the gradient norm is stable across the training journey without divergence. Note that gradient spikes are imminent in LLM pre-training when using layer-normalization, or even RMSNorm (Takase et al., 2023), and some times due to overflow in low-precision, such as 16-bit floats. We show an assuring trend in Figure 1 even with using 16-bit floats such as BF16. Furthermore, we clip the gradient-norm to 1.0 (see Section 3.2) which also allows for magnitude stabilization. Note that sustained spikes in the gradient norm leads to training divergence due to improper weight updates, even after gradient normalization through clip**. In Figure 2, we show that the gradient spikes often last for a single step, and did not lead to training divergence. Specifically, we first track a running average (r𝑟ritalic_r) of gradient norm over a window of 20 steps to smooth out the natural fluctuations due to batching. We define occurrence of a gradient spike when the current gradient norm is higher than r+0.1𝑟0.1r+0.1italic_r + 0.1. Next, we track the number of steps for gradient norm returning to less than r+0.1𝑟0.1r+0.1italic_r + 0.1. Over 86%, the spike deviates from running average for only a single step.

Refer to caption
Figure 2. Number of occurrence of sustained gradient spikes vs contiguous length of appearance. Over 86%, the spike lasts for only a single step.

Finally, we show the parameter l2subscript𝑙2l_{2}italic_l start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT norm in Figure 1. During first similar-to\sim250B tokens, the parameter norm increases consistently. This phase also coincides with the fast decreasing phase of training loss where model parameters converge from random initialization to a structured distribution. After that, the parameter norm consistently decrease since AdamW applies weight decay for regularization (Loshchilov and Hutter, 2019).

4.2. Hardware and System Failures

The pre-training process can be interrupted due to hardware failures, communication timeouts, etc (Borzunov et al., 2024). We monitor the node active time and frequency of training interruptions. As listed in Table 1, the overall system uptime of HLAT  pre-training is 98.81%. With automatic fault recovery mechanism in NDTL (see Section 2), the training quickly recovers from failures automatically without wasting much time. For comparison, we also listed another experimental training run (over 600 billion tokens) without automatic fault recovery, and we observed an average of 20% lower system up time.

Table 1. Percentage of system uptime with vs without automatic fault recovery.
With Fault Recovery Without Fault Recovery
98.81% 77.83%

4.3. Training Instability

We describe a few changes we made before and during the training process for convergence and training stability.

Initialization: We use a scaled initialization strategy for initializing model parameters. Specifically, the initial standard deviation of output layers in attention blocks and MLP layers are scaled by 1/2l12𝑙1/\sqrt{2l}1 / square-root start_ARG 2 italic_l end_ARG where l𝑙litalic_l is the layer index. Similar as discussed in (Takase et al., 2023), we found better numerical stability and convergence with smaller initial variance on deeper layers. In addition, all parameters are initialized on CPU and offloaded to Trainium.

Normalization: We used tensor parallelism to shard the model parameter matrices except normalization layers. The normalization layer weights, however, are slightly different across TP ranks due to stochastic rounding. Empirically, we found the differences are small, and RMSNorm weights values are all close to 1. OLMo (Groeneveld et al., 2024) used non-parametric layer norm, which is equivalent as all weights equals 1. Both HLAT  and OLMo show similar quality as LLaMA, despite of differences in normalization weights.

Gradient Synchronization: Since Neuron SDK uses a different underlying collective library as other accelerators. It needs careful attention while writing distributed training library, such as NDTL. Mis-referencing APIs may cause training instability.

Neuron Persistent Cache on Local Worker: In HLAT  training, all instances share a same file system using Amazon FSx888https://aws.amazon.com/fsx/ for storing data, checkpoints, logs, etc. However, we found that storing Neuron Persistent Cache999https://awsdocs-neuron.readthedocs-hosted.com/en/latest/general/arch/neuron-features/neuron-caching.html on FSx may cause communication bottleneck because those cached graphs are frequently accessed by all Trainium  devices in the cluster. Such bottleneck may lead to communication timeout and affects training stability. Therefore, we instead store Neuron Persistent Caches in file system of each local worker.

5. Evaluation

Baselines: We evaluate HLAT against several open-source benchmark models. Since HLAT  structure is similar as LLaMA model, we include LLaMA-1 7B (Touvron et al., 2023a), LLaMA-2 7B (Touvron et al., 2023b), OpenLLaMA-1 7B and OpenLLaMA-2 7B (Geng and Liu, 2023). The model architecture and composition of the training data of the models being compared are listed in Table 2. OpenLLaMA-1 model is trained on RedPajama (Computer, 2023) dataset. OpenLLaMA-2 model shares same structure as OpenLLaMA-1, but is trained on a different data mixture which includes data from Falcon-RefinedWeb (Penedo et al., 2023), StarCoder (Li et al., 2023), and RedPajama (Computer, 2023).

Table 2. Comparison of model architectures and number of training tokens.
Model Name Size Sequence length Tokens
HLAT 7B 4096 1.8T
LLaMA-1 7B 2048 1T
LLaMA-2 7B 4096 2T
OpenLLaMA-1 7B 2048 1T
OpenLLaMA-2 7B 2048 1T
Table 3. Evaluation of HLAT  against 4 open-source models on 7 groups of tasks describted in Section 5. For HLAT, the results are reported using the final (1.8T token) checkpoint.
Task Shots Metric OpenLLaMA-1 OpenLLaMA-2 LLaMA-1 LLaMA-2 HLAT
MMLU 5 accuracy 30.552 (3.432) 41.075 (3.611) 35.1 45.3 41.318 (3.602)
BBH 3 multiple choice grade 35.535 (1.864) 35.502 (1.861) 30.3 32.6 36.565 (1.845)
Commonsense Reasoning 0 accuracy 55.587 (1.203) 56.893 (1.195) - - 56.152 (1.194)
0 accuracy (norm) 58.411 (1.201) 61.262 (1.19) 67.3* 67.5* 59.455 (1.206)
World Knowledge 5 exact match 38.942 (0.532) 37.023 (0.52) 46.2* 48.9* 38.846 (0.534)
Reading Comprehension 0 accuracy 70.459 (0.798) 72.416 (0.782) 76.5 77.4 72.508 (0.781)
Math 8 accuracy 5.08 (0.605) 5.231 (0.613) 11.0 14.6 9.401 (0.804)
Code 0 pass@1 4.77 9.06 10.5 12.8 7.62
0 pass@10 12.83 23.58 - - 19.83
0 pass@100 23.78 40.24 36.5 45.6 34.15

Evaluation Tasks: We evaluate the pre-trained model on 7 groups of tasks including both zero-shot and few-shot evaluations (Chang et al., 2023). We use Language Model Evaluation Harness (Gao et al., 2021) for natural language tasks, and use HumanEval (Chen et al., 2021) for coding tasks.

  • MMLU (Massive Multitask Language Understanding) (Hendrycks et al., 2021a, b) contains 57 test tasks, spanning STEM, humanities, social sciences, and other subjects. The difficulty ranges from elementary to professional levels. The breadth of the dataset is suitable to test model’s overall problem solving and knowledge ability.

  • BIG-Bench Hard (BBH) (Suzgun et al., 2022) is a suite of 23 challenging BIG-Bench tasks (Srivastava et al., 2022), for which prior language models did not outperform the average human-rater. It can evaluate the model’s abilty on challenging tasks.

  • Commonsense Reasoning consists of 6 commonly-used datasets: PIQA (Bisk et al., 2020), HellaSwag (Zellers et al., 2019), WinoGrande (Sakaguchi et al., 2021), ARC easy and challenge (Clark et al., 2018), and OpenBookQA (Geng and Liu, 2023). Those multi-choice tasks include carefully crafted riddles, puzzles, and scenarios designed to probe a model’s ability to leverage implicit knowledge, make logical inferences, and navigate the unsaid rules of our physical and social worlds.

  • World Knowledge includes NaturalQuestions (Kwiatkowski et al., 2019) and TriviaQA (Joshi et al., 2017). Both tasks are designed to test model’s question-answering ability in closed book setting. The models are not provided documents that may contain information about the question, and it has to rely on information learnt or memorized in pre-training data.

  • Reading Comprehension uses BoolQ (Clark et al., 2019) to test model’s open book comprehension ability. BoolQ is a question answering dataset for yes/no questions. Each example is a triplet of (question, passage, answer), with the title of the page as optional additional context. The model is required to answer the question based on the given context in passage.

  • Math ability is evaluated with GSM8K (Grade School Math 8K) (Cobbe et al., 2021). GSM8K contains 8,500 grade school math problems. Both problems and answers are provided in natural language. These problems take between 2 and 8 steps to solve, which is ideal for testing basic multi-step reasoning ability of the model.

  • Code evaluation uses HumanEval (Chen et al., 2021) dataset includes 164 programming problems with a function signature, docstring, body, and several unit tests. They were handwritten to ensure not to be included in the training set of code generation models.

5.1. Performance against Opensource models

We compare the performance of HLAT using the final (1.8T tokens) checkpoint with other opensource benchmarks in Table 3. For OpenLLaMA-1 and OpenLLaMA-2, we use the available pre-trained weights and evaluate using the same evaluation pipeline as our pre-trained model. For LLaMA-1 and LLaMA-2, we directly use results from corresponding papers (Touvron et al., 2023a, b). For all evaluation tasks, we adopt same hyper-parameters (number of shots) as used in (Touvron et al., 2023b). We include both average performance and variance (in the parentheses, if available). All numbers are reported in percentage.

On MMLU tasks, HLAT  performs better than OpenLLaMA-1 and LLaMA-1, and is worse then LLaMA-2. As discussed in (Touvron et al., 2023a), this may due to the difference in training datasets. We also compare the detailed performance of models on each MMLU subjects, including STEM, humanities, social science, and others (see Table 5 in Appendix). The performance is slightly worse in STEM, but similar trends is observed in OpenLLaMA-2. Comparing with LLaMA-2, HLAT  is comparable on STEM and humanities. The gaps are mainly on social science and others.

On BBH, Commonsense Reasoning, and World Knowledge tasks, HLAT  performs similar as OpenLLaMA-1 and OpenLLaMA-2 models. By diving deep into performance on each individual task (see Table 5 in Appendix), HLAT  excels in 19/29 tasks as compared with OpenLLaMA-1, and 15/29 tasks compared with OpenLLaMA-2. Both HLAT  and OpenLLaMA models have some gaps with LLaMA-1 and LLaMA-2 models. This might be due to using different evaluation libraries and slightly difference in datasets. Specifically, the gaps in Commonsense Reasoning and World Knowledge are mainly from OpenBookQA and TriviaQA, respectively. On TriviaQA, The results of LLaMA models (Touvron et al., 2023b) are reported on wikipedia subset; whereas HLAT  and OpenLLaMA are evaluated on entire dataset.

On Math problems (GSM8K), HLAT  performs significantly better than OpenLLaMA-1 and OpenLLaMA-2. As will be discussed in the next section, HLAT  has a big improvement of Math ability in later training stage.

On Coding problems, HLAT  performs better than OpenLLaMA-1, comparable with LLaMA-1, and worse than OpenLLaMA-2 and LLaMA-2. First, for OpenLLaMA-1, the tokenizer merges consecutive spaces which negatively affects the coding performance, as it eliminates important information such as indentation and line breaks. This issue is subsequently fixed in OpenLLaMA-2, which explains its better performance. Besides, OpenLLaMA-2 is trained with additional code data from StarCoder which also contributes to performance improvement.

5.2. Intermediate Model Performance

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Figure 3. Intermediate model performance with number of seen tokens. (a) Commonsense Reasoning, (b) Math, (c) World Knowledge. We observe different trends of learning curves for different tasks. Detailed results of other tasks are in Figure 5.

During the model training, we also evaluate the intermediate checkpoints every 200 billion tokens. On most benchmarks, the performance improves steadily, and correlates with the training loss of the model (see Figure 1). Detailed evaluation results are shown in Appendix B (Table 6 and Figure 5).

We found that for different tasks, the model converges at different rates. Figure 3 plots the model performance on three sets of evaluation tasks with respective to number of seen training tokens (in billions).

For Commonsense Reasoning, the model accuracy improves quickly at beginning of training, and starts to saturate at later training stages. This is similar as the trends observed in other LLM model trainings (Touvron et al., 2023a; Groeneveld et al., 2024).

However, for Math task (GSM8K) shown in Figure 3, the learning curve shows an exponentially increasing trend. It increase very gradually for the initial similar-to\sim1 trillion tokens and begins to improve significantly during the later stages of training. Intuitively, this seems to indicate that the model is able to grasp more logical abilities after entering a relatively stable training plateau. We defer further research into this behavior as a future work.

For World Knowledge task shown in Figure 3, the performance increases almost linearly with number of training tokens. Since this is a closed book test and mainly evaluates the model’s ability of memorizing facts in pre-training data, the model seems to consistently improve its ability on this domain with more training steps and epochs. In addition, we also tested if the trending is related to number of shots used in evaluation. The trends are very similar for zero-shot, 3-shot, and 5-shot tests.

Those observations indicate the necessity of a set of evaluation tasks covering a wide range of domains for LLM pre-training. A single validation set or evaluation tasks from narrow domains may not comprehensively reflect the actual over- or under-fitting of the model for general downstream tasks.

5.3. Truthfulness and Bias

Table 4. Model Truthfulness and Bias evaluation.
Dataset Domains Metric OpenLLaMA-1 OpenLLaMA-2 LLaMA-1 HLAT
CrowS-Pairs age pct_stereotype 62.93 (4.997) 57.399 (5.1) 70.1 61.819 (5.002)
physical appearance pct_stereotype 66.667 (5.495) 73.611 (5.177) 77.8 62.5 (5.62)
race/color pct_stereotype 47.022 (2.226) 50.282 (2.256) 57.0 51.907 (2.271)
disability pct_stereotype 69.499 (5.702) 66.48 (5.757) 66.7 65.676 (5.893)
gender pct_stereotype 54.925 (2.756) 57.887 (2.742) 70.6 61.788 (2.696)
nationality pct_stereotype 48.969 (3.174) 45.502 (3.174) 64.2 51.273 (3.22)
sexual orientation pct_stereotype 77.207 (4.38) 80.946 (4.1) 81.0 82.607 (3.974)
religion pct_stereotype 72.652 (4.179) 70.043 (4.257) 79.0 68.723 (4.305)
socioeconomic pct_stereotype 68.649 (3.348) 72.309 (3.225) 71.5 66.88 (3.391)
TruthfulQA - multiple choice 1 23.133 (1.476) 22.644 (1.465) - 23.623 (1.487)
multiple choice 2 35.141 (1.355) 34.576 (1.348) - 37.188 (1.349)

We report the model’s truthfulness and bias using TruthfulQA (Lin et al., 2022) and CrowS-pairs (Nangia et al., 2020). TruthfulQA presents a collection of meticulously crafted questions spanning diverse domains such as health, law, finance, and even politics. These queries deliberately target areas where human intuition and personal biases can lead to incorrect responses, and measure an LLM’s resistance to misinformed or erroneous knowledge. CrowS-Pairs is a benchmark designed to probe LLMs for social biases across nine categories, including gender, religion, race/color, sexual orientation, age, nationality, disability, physical appearance and socioeconomic status. Each example is composed of a stereotype and an anti-stereotype.

We present the results in Table 4 with 0 shot inference. For TruthfulQA, we measure the multiple-choice score, and higher score shows better truthfulness. For CrowS-Pairs, it measures the percentage of models choosing answers of stereotypes, so lower scores indicates smaller bias. Overall, HLAT  performs similar as other opensource models.

5.4. Efficiency and Scalability

We describe the training efficiency in terms of Cost per 4-million tokens (CPT) and scalability reported in (N.A., [n. d.]). For comparison, the GPU baseline is established using p4d.24xlarge instances and NeMo 23.08 (Harper et al., [n. d.]) (available inside NeMo docker container with tag 23.08) software stack. Figure 4 plots the normalized CPT of training on Trainium  and scaling. For reference, we include both a 7B model and a 70B model (similar architecture as LLaMA-70B (Touvron et al., 2023b)). The Trainium  CPT is normalized, such that the CPT of the GPU baseline is 100%. On 64 nodes, training HLAT  7B model costs about 54% of the GPU baseline.

Refer to caption
Figure 4. Normalized cost per 4 million tokens (CPT) for 7B and 70B models on AWS Trainium with various number of nodes. CPT of GPU baseline is normalized to 100%. 70B models ran into out-of-memory on 4 nodes.

5.5. Model Limitation

We note some limitations of HLAT in this section. Similar as other LLMs, HLAT suffers a set of limitations such as hallucinations, potential non-factual generations, biases, and toxicity (Zhang et al., 2023). For example, although comparable with other open-source pre-trained models, the bias of HLAT is still relative high on some subjects such as sexual orientation, physical appearance, religion, and socioeconomic (see Table 4). This is partially due to the usage of publicly available datasets. More importantly, as a pre-trained model HLAT  has not gone through a supervised finetuning and human preference alignment. Those fine-tuning methods have been shown to be able to alleviate some limitations of pre-trained LLMs (Touvron et al., 2023b). Another limitation is our training is stopped after 1.8 trillion tokens. As is suggested by Figure 3, HLAT may be able to further improve on certain tasks, such as math and world knowledge, with more training tokens.

6. Best Practices & Future Directions

In this section, we share some best practices we observed for training on AWS Trainium, and raise open questions for future research.

Parallelism: NDTL supports TP up to 32 degrees and pipeline parallelism. For a 7B model, we found that the combination of TP=8 and PP=1 provides the highest training throughput. However, for models of other sizes and architectures, the optimal parallelism configuration could vary. To achieve the highest training throughput, parallelism configuration needs to be jointly optimized with choice of activation checkpointing method, gradient accumulation steps, and training precision, to avoid out-of-memory on Trainium.

Training Precision: NDTL supports various training precision settings, including full precision (FP32), AMP BF16 mixed precision, BF16 with and without SR, etc. Full precision training is often memory-wise infeasible for multi-billion LLMs. We compared three training strategies for HLAT: pure BF16, BF16 with SR, and mixed precision training. Empirically, we found that training loss of pure BF16 diverges. BF16 with SR and mixed precision get similar training loss. We finally chose BF16 with SR for higher throughput. However, for models of other sizes and architecture, BF16 with SR may not guarantee the same convergence as mixed precision. Usually, the divergence can be observed in first few thousands of steps.

Choice of β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT: We observed that using β2=0.99subscript𝛽20.99\beta_{2}=0.99italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.99 causes training instability and slower convergence. This is related to the choice of BF16 with SR training precision. A large β2subscript𝛽2\beta_{2}italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT fails to capture the gradient explosion at current and recent steps, and hence does not effectively reduce the gradients in occurrence of gradient explosion. Switching to β2=0.95subscript𝛽20.95\beta_{2}=0.95italic_β start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT = 0.95 addresses the above-mentioned problem.

Weight decay: We applied weight decay to all layers. Empirically, weight decay is not applied to normalization and bias layers (Devlin et al., 2019). In our experiment, we did not found much performance-wise difference of those two methods.

Dataloader: Dataloader is another key factor in model performance. We compared our dataloader described in Section 3.3 with Nemo dataloader (Harper et al., [n. d.]) to demonstrate better performance on our dataloader. The Nemo dataloader tends to overfit due to repetition within sequence.

Pre-compilation: Trainium requires pre-compiling the scripts to graphs. The compilation takes some time, especially for large models. Debugging on training scripts (e.g., printing out intermediate tensors) may require re-compilation. Instead of directly develo** on a large model, we found it more efficient to develop and test on a smaller model and scale up afterwards.

7. Related Work

LLM pre-training: After the Transformer architecture (Vaswani et al., 2017) was introduced, BERT (Devlin et al., 2019) was proposed to pre-train a language model on a large corpus of unlabeled data. Following the success of BERT model on various NLP tasks, many pre-trained language models are later introduced with different architectures and training methods, such as GPT-2 (Radford et al., 2019), RoBERTa (Liu et al., 2019), BART (Lewis et al., 2020), and so on (Zhao et al., 2023b). Studies later observed significant performance improvement of language models by increasing model size and training data (Hoffmann et al., 2022). Such abilities are further demonstrated in LLMs such as GPT-3 (Brown et al., 2020), PaLM (Anil et al., 2023), LLaMA (Touvron et al., 2023a), Falcon (Almazrouei et al., 2023), etc. Pre-trained on trillions of tokens, LLMs with tens or hundreds of billions parameters show remarkable ability in generating creative text contents, as well as a variety of downstream tasks, such as question answering, summarization, machine translation, programming, etc. (Zhao et al., 2023b).

AI accelerators: Most models are trained on NVIDIA GPU accelerators, such as GPT (Brown et al., 2020; OpenAI, 2023) and LLaMA (Touvron et al., 2023a, b). Falcon-180B (Almazrouei et al., 2023) was trained on AWS SageMaker, with up to 4,096 A100 40GB GPUs using p4d instances. However, the landscape of hardware accelerators for deep learning training has blossomed in recent years, with established players like NVIDIA GPUs facing fierce competition from custom offerings like Google’s TPU and AWS Trainium. PaLM-2 (Anil et al., 2023) and OpenLLaMA (Geng and Liu, 2023) have demonstreated successful LLM pre-training on Google TPU. Recently, OLMo (Groeneveld et al., 2024) is an opensource model developed by AI2. It has two models trained on AMD and Nvidia GPUs, separately. The two models have nearly identical performance on their evaluation suite by 2T tokens. AWS Trainium is a machine learning accelerator developed for deep learning training with high performance and cost-competitiveness. Our work is the first demonstration of end-to-end multi-billion LLM pre-trained on AWS Trainium. Ultimately, the optimal choice depends on the specific needs of the training task, with further research required to fully explore the potential of each accelerator and their possible convergence in future architectures.

8. Conclusion

In this paper, we pre-train HLAT, a 7 billion parameter large language model, using AWS Trainium over 1.8 trillion tokens. HLAT follows the decoder-only architecture and is trained with 64 AWS trn1.32xlarge instances. We evaluate the performance of HLAT against popular open-source baseline models including LLaMA and OpenLLaMA on a variety of popular benchmarking tasks. We find that HLAT achieves model quality on par with these baseline models. This work for the first time demonstrates that AWS Trainium with NDTL is able to successfully pre-train high-quality LLMs with high efficiency and low cost.

References

  • (1)
  • Abadi et al. (2015) Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro, Greg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow, Andrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser, Manjunath Kudlur, Josh Levenberg, Dandelion Mané, Rajat Monga, Sherry Moore, Derek Murray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal Talwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Viégas, Oriol Vinyals, Pete Warden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. 2015. TensorFlow: Large-Scale Machine Learning on Heterogeneous Systems. https://www.tensorflow.org/ Software available from tensorflow.org.
  • Almazrouei et al. (2023) Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Alshamsi, Alessandro Cappelli, Ruxandra Cojocaru, Mérouane Debbah, Étienne Goffinet, Daniel Hesslow, Julien Launay, Quentin Malartic, Daniele Mazzotta, Badreddine Noune, Baptiste Pannier, and Guilherme Penedo. 2023. The Falcon Series of Open Language Models. arXiv:2311.16867 [cs.CL]
  • Anil et al. (2023) Rohan Anil, Andrew M. Dai, Orhan Firat, Melvin Johnson, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, Eric Chu, Jonathan H. Clark, Laurent El Shafey, Yan** Huang, Kathy Meier-Hellstern, Gaurav Mishra, Erica Moreira, Mark Omernick, Kevin Robinson, Sebastian Ruder, Yi Tay, Kefan Xiao, Yuanzhong Xu, Yu**g Zhang, Gustavo Hernandez Abrego, Junwhan Ahn, Jacob Austin, Paul Barham, Jan Botha, James Bradbury, Siddhartha Brahma, Kevin Brooks, Michele Catasta, Yong Cheng, Colin Cherry, Christopher A. Choquette-Choo, Aakanksha Chowdhery, Clément Crepy, Shachi Dave, Mostafa Dehghani, Sunipa Dev, Jacob Devlin, Mark Díaz, Nan Du, Ethan Dyer, Vlad Feinberg, Fangxiaoyu Feng, Vlad Fienber, Markus Freitag, Xavier Garcia, Sebastian Gehrmann, Lucas Gonzalez, Guy Gur-Ari, Steven Hand, Hadi Hashemi, Le Hou, Joshua Howland, Andrea Hu, Jeffrey Hui, Jeremy Hurwitz, Michael Isard, Abe Ittycheriah, Matthew Jagielski, Wenhao Jia, Kathleen Kenealy, Maxim Krikun, Sneha Kudugunta, Chang Lan, Katherine Lee, Benjamin Lee, Eric Li, Music Li, Wei Li, YaGuang Li, Jian Li, Hyeontaek Lim, Hanzhao Lin, Zhongtao Liu, Frederick Liu, Marcello Maggioni, Aroma Mahendru, Joshua Maynez, Vedant Misra, Maysam Moussalem, Zachary Nado, John Nham, Eric Ni, Andrew Nystrom, Alicia Parrish, Marie Pellat, Martin Polacek, Alex Polozov, Reiner Pope, Siyuan Qiao, Emily Reif, Bryan Richter, Parker Riley, Alex Castro Ros, Aurko Roy, Brennan Saeta, Rajkumar Samuel, Renee Shelby, Ambrose Slone, Daniel Smilkov, David R. So, Daniel Sohn, Simon Tokumine, Dasha Valter, Vijay Vasudevan, Kiran Vodrahalli, Xuezhi Wang, Pidong Wang, Zirui Wang, Tao Wang, John Wieting, Yuhuai Wu, Kelvin Xu, Yunhan Xu, Linting Xue, Pengcheng Yin, Jiahui Yu, Qiao Zhang, Steven Zheng, Ce Zheng, Weikang Zhou, Denny Zhou, Slav Petrov, and Yonghui Wu. 2023. PaLM 2 Technical Report. arXiv:2305.10403 [cs.CL]
  • Arrow (2020) Apache Arrow. 2020. Apache Arrow, a crosslanguage development platform for in-memory analytics. https://arrow.apache.org/.
  • Bisk et al. (2020) Yonatan Bisk, Rowan Zellers, Jianfeng Gao, Ye** Choi, et al. 2020. Piqa: Reasoning about physical commonsense in natural language. In Proceedings of the AAAI conference on artificial intelligence, Vol. 34. 7432–7439.
  • Borzunov et al. (2024) Alexander Borzunov, Max Ryabinin, Artem Chumachenko, Dmitry Baranchuk, Tim Dettmers, Younes Belkada, Pavel Samygin, and Colin A Raffel. 2024. Distributed Inference and Fine-tuning of Large Language Models Over The Internet. Advances in Neural Information Processing Systems 36 (2024).
  • Brown et al. (2020) Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gray, Benjamin Chess, Jack Clark, Christopher Berner, Sam McCandlish, Alec Radford, Ilya Sutskever, and Dario Amodei. 2020. Language Models are Few-Shot Learners. In Advances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin (Eds.), Vol. 33. Curran Associates, Inc., 1877–1901. https://proceedings.neurips.cc/paper_files/paper/2020/file/1457c0d6bfcb4967418bfb8ac142f64a-Paper.pdf
  • Chang et al. (2023) Yupeng Chang, Xu Wang, **dong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. 2023. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023).
  • Chen et al. (2021) Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, Alex Ray, Raul Puri, Gretchen Krueger, Michael Petrov, Heidy Khlaaf, Girish Sastry, Pamela Mishkin, Brooke Chan, Scott Gray, Nick Ryder, Mikhail Pavlov, Alethea Power, Lukasz Kaiser, Mohammad Bavarian, Clemens Winter, Philippe Tillet, Felipe Petroski Such, Dave Cummings, Matthias Plappert, Fotios Chantzis, Elizabeth Barnes, Ariel Herbert-Voss, William Hebgen Guss, Alex Nichol, Alex Paino, Nikolas Tezak, Jie Tang, Igor Babuschkin, Suchir Balaji, Shantanu Jain, William Saunders, Christopher Hesse, Andrew N. Carr, Jan Leike, Josh Achiam, Vedant Misra, Evan Morikawa, Alec Radford, Matthew Knight, Miles Brundage, Mira Murati, Katie Mayer, Peter Welinder, Bob McGrew, Dario Amodei, Sam McCandlish, Ilya Sutskever, and Wojciech Zaremba. 2021. Evaluating Large Language Models Trained on Code. (2021). arXiv:2107.03374 [cs.LG]
  • Chen et al. (2016) Tianqi Chen, Bing Xu, Chiyuan Zhang, and Carlos Guestrin. 2016. Training Deep Nets with Sublinear Memory Cost. arXiv:1604.06174 [cs.LG]
  • Clark et al. (2019) Christopher Clark, Kenton Lee, Ming-Wei Chang, Tom Kwiatkowski, Michael Collins, and Kristina Toutanova. 2019. BoolQ: Exploring the Surprising Difficulty of Natural Yes/No Questions. In NAACL.
  • Clark et al. (2018) Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. 2018. Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge. ArXiv abs/1803.05457 (2018). https://api.semanticscholar.org/CorpusID:3922816
  • Cobbe et al. (2021) Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. 2021. Training Verifiers to Solve Math Word Problems. arXiv preprint arXiv:2110.14168 (2021).
  • Computer (2023) Together Computer. 2023. RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset. https://github.com/togethercomputer/RedPajama-Data
  • Devlin et al. (2019) Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers), Jill Burstein, Christy Doran, and Thamar Solorio (Eds.). Association for Computational Linguistics, Minneapolis, Minnesota, 4171–4186. https://doi.org/10.18653/v1/N19-1423
  • FairScale authors (2021) FairScale authors. 2021. FairScale: A general purpose modular PyTorch library for high performance and large scale training. https://github.com/facebookresearch/fairscale.
  • Gao et al. (2021) Leo Gao, Jonathan Tow, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Kyle McDonell, Niklas Muennighoff, et al. 2021. A framework for few-shot language model evaluation. Version v0. 0.1. Sept (2021).
  • Geng and Liu (2023) Xinyang Geng and Hao Liu. 2023. OpenLLaMA: An Open Reproduction of LLaMA. https://github.com/openlm-research/open_llama
  • Groeneveld et al. (2024) Dirk Groeneveld, Iz Beltagy, Pete Walsh, Akshita Bhagia, Rodney Kinney, Oyvind Tafjord, Ananya Harsh Jha, Hamish Ivison, Ian Magnusson, Yizhong Wang, Shane Arora, David Atkinson, Russell Authur, Khyathi Raghavi Chandu, Arman Cohan, Jennifer Dumas, Yanai Elazar, Yuling Gu, Jack Hessel, Tushar Khot, William Merrill, Jacob Morrison, Niklas Muennighoff, Aakanksha Naik, Crystal Nam, Matthew E. Peters, Valentina Pyatkin, Abhilasha Ravichander, Dustin Schwenk, Saurabh Shah, Will Smith, Emma Strubell, Nishant Subramani, Mitchell Wortsman, Pradeep Dasigi, Nathan Lambert, Kyle Richardson, Luke Zettlemoyer, Jesse Dodge, Kyle Lo, Luca Soldaini, Noah A. Smith, and Hannaneh Hajishirzi. 2024. OLMo: Accelerating the Science of Language Models. arXiv:2402.00838 [cs.CL]
  • Gupta et al. (2015) Suyog Gupta, Ankur Agrawal, Kailash Gopalakrishnan, and Pritish Narayanan. 2015. Deep learning with limited numerical precision. In Proceedings of the 32nd International Conference on International Conference on Machine Learning - Volume 37 (Lille, France) (ICML’15). JMLR.org, 1737–1746.
  • Harper et al. ([n. d.]) Eric Harper, Somshubra Majumdar, Oleksii Kuchaiev, Li Jason, Yang Zhang, Evelina Bakhturina, Vahid Noroozi, Sandeep Subramanian, Koluguri Nithin, Huang Jocelyn, Fei Jia, Jagadeesh Balam, Xuesong Yang, Micha Livne, Yi Dong, Sean Naren, and Boris Ginsburg. [n. d.]. NeMo: a toolkit for Conversational AI and Large Language Models. https://github.com/NVIDIA/NeMo
  • Hendrycks et al. (2021a) Dan Hendrycks, Collin Burns, Steven Basart, Andrew Critch, Jerry Li, Dawn Song, and Jacob Steinhardt. 2021a. Aligning AI With Shared Human Values. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  • Hendrycks et al. (2021b) Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2021b. Measuring Massive Multitask Language Understanding. Proceedings of the International Conference on Learning Representations (ICLR) (2021).
  • Hendy et al. (2023) Amr Hendy, Mohamed Abdelrehim, Amr Sharaf, Vikas Raunak, Mohamed Gabr, Hitokazu Matsushita, Young ** Kim, Mohamed Afify, and Hany Hassan Awadalla. 2023. How good are gpt models at machine translation? a comprehensive evaluation. arXiv preprint arXiv:2302.09210 (2023).
  • Hoffmann et al. (2022) Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, and Laurent Sifre. 2022. Training Compute-Optimal Large Language Models. arXiv:2203.15556 [cs.CL]
  • Hudson and Manning (2019) Drew A. Hudson and Christopher D. Manning. 2019. GQA: A New Dataset for Real-World Visual Reasoning and Compositional Question Answering. arXiv:1902.09506 [cs.CL]
  • Joshi et al. (2017) Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettlemoyer. 2017. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. arXiv preprint arXiv:1705.03551 (2017).
  • Korthikanti et al. (2022) Vijay Korthikanti, Jared Casper, Sangkug Lym, Lawrence McAfee, Michael Andersch, Mohammad Shoeybi, and Bryan Catanzaro. 2022. Reducing Activation Recomputation in Large Transformer Models. arXiv:2205.05198 [cs.LG]
  • Kwiatkowski et al. (2019) Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Matthew Kelcey, Jacob Devlin, Kenton Lee, Kristina N. Toutanova, Llion Jones, Ming-Wei Chang, Andrew Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. 2019. Natural Questions: a Benchmark for Question Answering Research. Transactions of the Association of Computational Linguistics (2019).
  • Lewis et al. (2020) Mike Lewis, Yinhan Liu, Naman Goyal, Marjan Ghazvininejad, Abdelrahman Mohamed, Omer Levy, Veselin Stoyanov, and Luke Zettlemoyer. 2020. BART: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Dan Jurafsky, Joyce Chai, Natalie Schluter, and Joel Tetreault (Eds.). Association for Computational Linguistics, Online, 7871–7880. https://doi.org/10.18653/v1/2020.acl-main.703
  • Li et al. (2022) Junyi Li, Tianyi Tang, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. 2022. Pretrained language models for text generation: A survey. arXiv preprint arXiv:2201.05273 (2022).
  • Li et al. (2023) Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Akiki, Jia Li, Jenny Chim, Qian Liu, Evgenii Zheltonozhskii, Terry Yue Zhuo, Thomas Wang, Olivier Dehaene, Mishig Davaadorj, Joel Lamy-Poirier, João Monteiro, Oleh Shliazhko, Nicolas Gontier, Nicholas Meade, Armel Zebaze, Ming-Ho Yee, Logesh Kumar Umapathi, Jian Zhu, Benjamin Lipkin, Muhtasham Oblokulov, Zhiruo Wang, Rudra Murthy, Jason Stillerman, Siva Sankalp Patel, Dmitry Abulkhanov, Marco Zocca, Manan Dey, Zhihan Zhang, Nour Fahmy, Urvashi Bhattacharyya, Wenhao Yu, Swayam Singh, Sasha Luccioni, Paulo Villegas, Maxim Kunakov, Fedor Zhdanov, Manuel Romero, Tony Lee, Nadav Timor, Jennifer Ding, Claire Schlesinger, Hailey Schoelkopf, Jan Ebert, Tri Dao, Mayank Mishra, Alex Gu, Jennifer Robinson, Carolyn Jane Anderson, Brendan Dolan-Gavitt, Danish Contractor, Siva Reddy, Daniel Fried, Dzmitry Bahdanau, Yacine Jernite, Carlos Muñoz Ferrandis, Sean Hughes, Thomas Wolf, Arjun Guha, Leandro von Werra, and Harm de Vries. 2023. StarCoder: may the source be with you! (2023). arXiv:2305.06161 [cs.CL]
  • Lin et al. (2022) Stephanie Lin, Jacob Hilton, and Owain Evans. 2022. TruthfulQA: Measuring How Models Mimic Human Falsehoods. arXiv:2109.07958 [cs.CL]
  • Liu et al. (2019) Yinhan Liu, Myle Ott, Naman Goyal, **gfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. 2019. RoBERTa: A Robustly Optimized BERT Pretraining Approach. arXiv:1907.11692 [cs.CL]
  • Loshchilov and Hutter (2019) Ilya Loshchilov and Frank Hutter. 2019. Decoupled Weight Decay Regularization. In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7
  • Micikevicius et al. (2017) Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. 2017. Mixed precision training. arXiv preprint arXiv:1710.03740 (2017).
  • N.A. ([n. d.]) N.A. [n. d.]. Paper under review.
  • Nangia et al. (2020) Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R. Bowman. 2020. CrowS-Pairs: A Challenge Dataset for Measuring Social Biases in Masked Language Models. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, Online.
  • OpenAI (2023) OpenAI. 2023. GPT-4 Technical Report. ArXiv abs/2303.08774 (2023). https://arxiv.longhoe.net/abs/2303.08774
  • Penedo et al. (2023) Guilherme Penedo, Quentin Malartic, Daniel Hesslow, Ruxandra Cojocaru, Alessandro Cappelli, Hamza Alobeidli, Baptiste Pannier, Ebtesam Almazrouei, and Julien Launay. 2023. The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only. arXiv preprint arXiv:2306.01116 (2023). arXiv:2306.01116 https://arxiv.longhoe.net/abs/2306.01116
  • Radford et al. (2019) Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. 2019. Language models are unsupervised multitask learners. OpenAI blog 1, 8 (2019), 9.
  • Rajbhandari et al. (2020) Samyam Rajbhandari, Jeff Rasley, Olatunji Ruwase, and Yuxiong He. 2020. ZeRO: Memory Optimizations Toward Training Trillion Parameter Models. ArXiv. https://www.microsoft.com/en-us/research/publication/zero-memory-optimizations-toward-training-trillion-parameter-models/
  • Rasley et al. (2020) Jeff Rasley, Samyam Rajbhandari, Olatunji Ruwase, and Yuxiong He. 2020. Deepspeed: System optimizations enable training deep learning models with over 100 billion parameters. In Proceedings of the 26th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining. 3505–3506.
  • Roziere et al. (2023) Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, **gyu Liu, Tal Remez, Jérémy Rapin, et al. 2023. Code llama: Open foundation models for code. arXiv preprint arXiv:2308.12950 (2023).
  • Sakaguchi et al. (2021) Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Ye** Choi. 2021. Winogrande: An adversarial winograd schema challenge at scale. Commun. ACM 64, 9 (2021), 99–106.
  • Shazeer (2020) Noam Shazeer. 2020. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 (2020).
  • Shoeybi et al. (2020) Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. 2020. Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. arXiv:1909.08053 [cs.CL]
  • Srivastava et al. (2022) Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. 2022. Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615 (2022).
  • Suzgun et al. (2022) Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, , and Jason Wei. 2022. Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them. arXiv preprint arXiv:2210.09261 (2022).
  • Takase et al. (2023) Sho Takase, Shun Kiyono, Sosuke Kobayashi, and Jun Suzuki. 2023. Spike No More: Stabilizing the Pre-training of Large Language Models. arXiv:2312.16903 [cs.CL]
  • Touvron et al. (2023a) Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. 2023a. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023).
  • Touvron et al. (2023b) Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. 2023b. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023).
  • Vaswani et al. (2017) Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. In Advances in Neural Information Processing Systems. 5998–6008.
  • Wolf et al. (2020) Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chaumond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama Drame, Quentin Lhoest, and Alexander M. Rush. 2020. Transformers: State-of-the-Art Natural Language Processing. In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing: System Demonstrations. Association for Computational Linguistics, Online, 38–45. https://www.aclweb.org/anthology/2020.emnlp-demos.6
  • Zellers et al. (2019) Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Ye** Choi. 2019. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830 (2019).
  • Zhang et al. (2023) Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, Longyue Wang, Anh Tuan Luu, Wei Bi, Freda Shi, and Shuming Shi. 2023. Siren’s Song in the AI Ocean: A Survey on Hallucination in Large Language Models. arXiv preprint arXiv:2309.01219 (2023).
  • Zhao et al. (2023b) Wayne Xin Zhao, Kun Zhou, Junyi Li, Tianyi Tang, Xiaolei Wang, Yupeng Hou, Yingqian Min, Beichen Zhang, Junjie Zhang, Zican Dong, Yifan Du, Chen Yang, Yushuo Chen, Zhipeng Chen, **hao Jiang, Ruiyang Ren, Yifan Li, Xinyu Tang, Zikang Liu, Peiyu Liu, Jian-Yun Nie, and Ji-Rong Wen. 2023b. A Survey of Large Language Models. arXiv preprint arXiv:2303.18223 (2023). http://arxiv.longhoe.net/abs/2303.18223
  • Zhao et al. (2023a) Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. 2023a. Pytorch FSDP: experiences on scaling fully sharded data parallel. arXiv preprint arXiv:2304.11277 (2023).
  • Zhu et al. (2023) Yutao Zhu, Huaying Yuan, Shuting Wang, Jiongnan Liu, Wenhan Liu, Chenlong Deng, Zhicheng Dou, and Ji-Rong Wen. 2023. Large language models for information retrieval: A survey. arXiv preprint arXiv:2308.07107 (2023).

Appendix A Detailed Evaluation Results

In Table 5 we show the evaluation results on all individual tasks described in Section 5. The number of shots are same as those listed in Table 3.

Table 5. Evaluation results on all tasks of HLAT  (using final 1.8 trillion checkpoint) against opensource models OpenLLaMA-1 and OpenLLaMA-2.
Task Metric HLAT OpenLLaMA-1 OpenLLaMA-2
MMLU_STEM accuracy (norm) 34.833 (3.895) 28.5 (3.735) 33.895 (3.869)
MMLU_humanities accuracy (norm) 43.632 (3.266) 30.909 (3.099) 42.328 (3.288)
MMLU_other (business, health, misc.) accuracy (norm) 43.691 (3.669) 31.738 (3.463) 44.766 (3.671)
MMLU_social sciences accuracy (norm) 45.773 (3.447) 31.858 (3.302) 46.179 (3.503)
bigbench_causal_judgement multiple_choice_grade 52.632 (3.632) 56.316 (3.608) 53.158 (3.63)
bigbench_date_understanding multiple_choice_grade 61.518 (2.536) 58.537 (2.568) 56.911 (2.581)
bigbench_disambiguation_qa multiple_choice_grade 35.271 (2.981) 29.845 (2.854) 36.434 (3.002)
bigbench_dyck_languages multiple_choice_grade 21.7 (1.304) 17.5 (1.202) 19.5 (1.254)
bigbench_formal_fallacies_syllogisms_negation multiple_choice_grade 50.592 (0.42) 49.761 (0.42) 49.965 (0.42)
bigbench_geometric_shapes multiple_choice_grade 15.32 (1.904) 24.791 (2.282) 18.663 (2.059)
bigbench_hyperbaton multiple_choice_grade 50.242 (0.224) 49.358 (0.224) 51.064 (0.224)
bigbench_logical_deduction_five_objects multiple_choice_grade 23.0 (1.884) 28.4 (2.019) 26.8 (1.983)
bigbench_logical_deduction_seven_objects multiple_choice_grade 17.714 (1.444) 15.714 (1.377) 19.0 (1.484)
bigbench_logical_deduction_three_objects multiple_choice_grade 36.333 (2.781) 36.0 (2.776) 39.333 (2.825)
bigbench_movie_recommendation multiple_choice_grade 63.2 (2.159) 46.8 (2.234) 44.6 (2.225)
bigbench_navigate multiple_choice_grade 51.9 (1.581) 48.8 (1.581) 50.3 (1.582)
bigbench_reasoning_about_colored_objects multiple_choice_grade 37.95 (1.085) 35.1 (1.068) 37.4 (1.082)
bigbench_ruin_names multiple_choice_grade 35.491 (2.263) 31.696 (2.201) 31.473 (2.197)
bigbench_salient_translation_error_detection multiple_choice_grade 18.737 (1.236) 22.044 (1.313) 16.934 (1.188)
bigbench_snarks multiple_choice_grade 46.409 (3.717) 49.724 (3.727) 43.094 (3.691)
bigbench_sports_understanding multiple_choice_grade 58.519 (1.57) 50.71 (1.593) 56.187 (1.581)
bigbench_temporal_sequences multiple_choice_grade 22.6 (1.323) 26.6 (1.398) 24.5 (1.361)
bigbench_tracking_shuffled_objects_five_objects multiple_choice_grade 18.64 (1.102) 19.52 (1.122) 18.16 (1.091)
bigbench_tracking_shuffled_objects_seven_objects multiple_choice_grade 13.771 (0.824) 13.029 (0.805) 12.743 (0.797)
bigbench_tracking_shuffled_objects_three_objects multiple_choice_grade 36.333 (2.781) 36.0 (2.776) 39.333 (2.825)
ARC Challenge accuracy 37.884 (1.418) 37.628 (1.416) 38.737 (1.424)
accuracy (norm) 40.188 (1.433) 38.567 (1.422) 41.382 (1.439)
ARC Easy accuracy 72.18 (0.92) 71.17 (0.929) 72.433 (0.917)
accuracy (norm) 66.709 (0.967) 68.855 (0.95) 69.739 (0.943)
HellaSwag accuracy 53.436 (0.498) 52.549 (0.498) 55.756 (0.496)
accuracy (norm) 71.251 (0.452) 69.189 (0.461) 74.517 (0.435)
OpenBookQA accuracy 28.8 (2.027) 29.8 (2.048) 29.8 (2.048)
accuracy (norm) 41.6 (2.206) 39.0 (2.183) 40.8 (2.2)
PIQA accuracy 76.496 (0.989) 75.68 (1.001) 78.727 (0.955)
accuracy (norm) 77.53 (0.974) 76.442 (0.99) 79.869 (0.936)
WinoGrande accuracy 68.114 (1.31) 66.693 (1.325) 65.904 (1.332)
NaturalQuestions exact match 22.632 (0.697) 22.327 (0.693) 20.139 (0.668)
TriviaQA exact match 55.06 (0.371) 55.556 (0.371) 53.912 (0.372)
BoolQ accuracy 72.508 (0.781) 70.459 (0.798) 72.416 (0.782)
GSM8K accuracy 9.401 (0.804) 5.08 (0.605) 5.231 (0.613)
Table 6. Evaluation results of HLAT on intermediate checkpoints. For Commonsense Reasoning (CR), we report the normalized accuracy. For Code, we report pass@100.
Tokens (B) MMLU BBH CR WK RC Math Code
120 28.49 (3.357) 32.824 (1.835) 50.761 (1.211) 21.405 (0.431) 61.468 (0.851) 1.365 (0.32) 15.85
200 24.759 (3.217) 33.919 (1.843) 53.14 (1.217) 23.572 (0.448) 64.098 (0.839) 2.426 (0.424) 18.90
400 26.066 (3.278) 33.948 (1.835) 55.208 (1.212) 27.89 (0.481) 56.575 (0.867) 2.274 (0.411) 20.12
600 27.936 (3.331) 33.904 (1.835) 55.489 (1.208) 28.434 (0.485) 67.187 (0.821) 2.35 (0.417) 21.95
800 27.947 (3.336) 34.701 (1.835) 56.17 (1.212) 30.183 (0.49) 68.257 (0.814) 2.881 (0.461) 23.78
1000 29.034 (3.391) 34.571 (1.842) 56.872 (1.213) 32.713 (0.504) 67.859 (0.817) 3.867 (0.531) 28.05
1200 33.367 (3.503) 34.991 (1.853) 57.458 (1.212) 33.476 (0.507) 69.174 (0.808) 4.17 (0.551) 27.44
1400 35.54 (3.546) 35.495 (1.849) 58.279 (1.208) 35.674 (0.521) 72.202 (0.784) 4.776 (0.587) 32.93
1500 37.701 (3.563) 35.823 (1.853) 58.862 (1.203) 35.878 (0.525) 71.346 (0.791) 5.762 (0.642) 32.31
1600 36.82 (3.532) 35.334 (1.835) 59.616 (1.204) 36.758 (0.526) 71.651 (0.788) 6.368 (0.673) 32.93
1800 41.318 (3.602) 36.565 (1.845) 59.455 (1.206) 38.846 (0.534) 72.508 (0.781) 9.401 (0.804) 34.15

Appendix B Evaluation Results on Intermediate Checkpoints

In Table 6 we list the evaluation results of HLAT on intermediate checkpoints for every 200 billions of training tokens. For Commonsense Reasoning, we report the normalized accuracy if exists. For Code, we report pass@100. We also plot the trends in Figure 5.

Refer to caption
(a)
Refer to caption
(b)
Refer to caption
(c)
Refer to caption
(d)
Refer to caption
(e)
Refer to caption
(f)
Refer to caption
(g)
Refer to caption
(h)
Refer to caption
(i)
Refer to caption
(j)
Refer to caption
(k)
Refer to caption
(l)
Refer to caption
(m)
Figure 5. Performance progress of HLAT on intermediate checkpoints over individual evaluation tasks.

Appendix C Neuron Nemo Megatorn Package

We also experimented on AWS Neuron Nemo Megatron package101010https://awsdocs-neuron.readthedocs-hosted.com/en/latest/libraries/nemo-megatron/index.html, which is adapted from Nemo (Harper et al., [n. d.]) for running on AWS Trainium  accelerators. In Figure 6, we plot both training loss and gradient norm of two runs with same configuration on NDTL and Neuron Nemo Megatron, respectively. Neuron Nemo Megatron provides same convergence as NDTL.

Refer to caption
(a)
Refer to caption
(b)
Figure 6. Convergence comparison of Neuron Nemo Megatron and NDTL. (a) The training loss vs number of tokens (in billions) seen by the model during training. (b) Gradient norm.