License: arXiv.org perpetual non-exclusive license
arXiv:2304.06875v4 [cs.CL] 06 Apr 2024

nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales

Yiqun Yao    Siqi Fan    Xiusheng Huang    Xuezhi Fang    Xiang Li    Ziyi Ni    Xin Jiang    Xuying Meng    Peng Han    Shuo Shang    Kang Liu    Aixin Sun    Yequan Wang
Abstract

As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that accurately predicts certain metrics for large models without training them. Existing scaling laws require hyperparameter search on the largest models, limiting their predicative capability. In this paper, we present an approach (namely μ𝜇\muitalic_μScaling) to predict the pre-training loss, based on our observations that Maximal Update Parametrization (μ𝜇\muitalic_μP) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space. With μ𝜇\muitalic_μScaling, different model designs can be compared on large scales by training only their smaller counterparts. Further, we introduce nanoLM: an affordable LLM pre-training benchmark that facilitates this new research paradigm. With around 14%percent1414\%14 % of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.

Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models. We also aspire for our benchmark to serve as a bridge between the academic community and the industry.

Machine Learning, ICML

1 Introduction

Table 1: Current LLMs. We summarize five popular Transformer models. GPT3 and MT-NLG have a model size exceeding 100B parameters, with training data around 300B tokens. In contrast, other models are smaller in size but trained on data surpassing a trillion tokens.
Model Params Tokens
GPT-3 175 B 300 B
MT-NLG 530 B 270 B
Chinchilla 70 B 1.4 T
Llama 2 7/13/70 B 2 T
Falcon 7/40/180 B 1.5/1 T

Large Language Models (LLMs) pre-trained on Web-scale data have demonstrated impressive performance on various downstream tasks under a variety of evaluation protocols such as zero-shot, few-shot, and fine-tuning.

Modern LLMs are based on the Transformer architecture (Vaswani et al., 2017), and can be trained with unsupervised objectives including causal language modeling (Brown et al., 2020), masked language modeling, among others (Wang et al., 2022). Since researches on scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) reveal the potential of improving model performance by increasing the total computation, the community have been scaling up both the model sizes and training data (Brown et al., 2020; Smith et al., 2022; Touvron et al., 2023b; Almazrouei et al., 2023), as briefly summarized in Table 1.

However, this trend makes it increasingly difficult for average researchers to verify research ideas or search for hyperparameters (HPs): first, conclusions achieved with small models do not trivially transfer to larger ones; on the other hand, directly training large models for multiple times is a costly endeavor. For instance, as reported by Llama-2 (Touvron et al., 2023b), time to train the 7B, 13B, and 70B models on roughly 2 trillion tokens is 184k, 368k, and 1.7M GPU hours with A100-80GB, respectively. An entry point as such prevents the community from swiftly making improvements that are reliable on large scales, and results in enormous waste of computational resources and intellectual efforts. Thus, it is necessary to establish a pipeline to compare different LLM structures, algorithms, and hyperparameters with limited computational resources (i.e., small models), while making sure the results are instructive for any model scale. This is the goal of this paper.

As a possible solution to this issue, the technical report of GPT-4 (OpenAI, 2023) showed that some behaviors of large models can be predicted before the training starts (with unpublished methods). In this paper, we start by proposing a method that yields accurate loss prediction, namely μ𝜇\muitalic_μScaling (a compound word of μ𝜇\muitalic_μP (Yang et al., 2022) and Scaling Laws (Kaplan et al., 2020)), with experimental results supporting its correctness. Based on μ𝜇\muitalic_μScaling, we establish a new paradigm for meaningful research on large models without actually training them. Finally, we propose an affordable benchmark for LLM pre-training studies, namely nanoLM, which facilities this research paradigm for the community.

Contributions.

We substantiate our contributions as follows:

  • We propose μ𝜇\muitalic_μScaling, a loss prediction method based on μ𝜇\muitalic_μP and modified scaling laws. For hyperparameters (HPs) in the common loss basins, the training loss can be accurately predicted by a power-law function w.r.t. model sizes, which includes embedding sizes, in contrast to existing methods. This method requires searching for the optimal HP only once to predict loss in arbitrary model scale.

  • We unlock a new LLM study paradigm that can directly compare the loss for different model designs on large scales without direct training. We facilitate this paradigm by proposing nanoLM, an affordable benchmark. nanoLM is compatible with mainstream Transformer architectures, including decoder-only structures (e.g., GPT, Llama), encoder-only structures (e.g., BERT), and encoder-decoder structures (e.g., T5), and supports data parallelism strategies (Section 3.1). For benchmark evaluation, we publicly release a pre-training dataset with 100B/400B/1T/2T tokens, chosen from existing sources and categorized into various specialized domains (Section 3.2).

  • Effectiveness: Empirically, we successfully utilized our method to forecast the loss for 12-layer GPT, Llama, BERT, and T5 models on the C4 and MC4 dataset. For more expansive models, we experiment on GPT models with 32 and 64 layers, culminating in sizes of 26B and 52B, respectively. Results indicate that the actual loss remains predictable (Section 4.2).

  • Efficiency: By generating a series of small proxy models with sizes ranging from 38M to 3.4B, predicting loss using μ𝜇\muitalic_μScaling incurs only 13.1%,14.2%percent13.1percent14.213.1\%,14.2\%13.1 % , 14.2 % of the one-time pre-training cost for 26B and 52B models, respectively. This demonstrates that nanoLM can help researchers make affordable and meaningful comparisons between different model designs and serve as a new benchmark for LLM study (Section 4.3).

To foster reproducibility, we will open-source all our code and data of nanoLM benchmark. Part of our code is attached in “Supplementary Material”.

Refer to caption
(a) GPT.
Refer to caption
(b) BERT.
Refer to caption
(c) T5.
Figure 1: Pre-training loss w.r.t. learning rate for different Transformer architectures (i.e., GPT, BERT, T5) across widths under μ𝜇\muitalic_μP. The result shows that the loss landscapes are aligned for models with different widths.

2 Preliminaries

We first introduce our task setting (Section 2.1), and then discuss the pros and cons of current hyperparameter estimation methods (Section 2.2) and scaling laws (Section 2.3), as well as other related work (Section 2.4).

2.1 Task Setting

Our task is to do research with small models and safely generalize the conclusions to large-scaled models. This is not a trivial problem. Suppose a small model M1subscript𝑀1M_{1}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT can be trained with an optimal HP H1subscript𝐻1H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT to achieve a loss L1subscript𝐿1L_{1}italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT, while research suggests another model design, M2subscript𝑀2M_{2}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT, has a better L2<L1subscript𝐿2subscript𝐿1L_{2}\!<\!L_{1}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT < italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT with hyperparameter H2subscript𝐻2H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Then, for a LARGER version of both models (M1superscriptsubscript𝑀1M_{1}^{\prime}italic_M start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and M2superscriptsubscript𝑀2M_{2}^{\prime}italic_M start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT), typically, we have H1H1superscriptsubscript𝐻1subscript𝐻1H_{1}^{\prime}\!\neq\!H_{1}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and H2H2superscriptsubscript𝐻2subscript𝐻2H_{2}^{\prime}\!\neq\!H_{2}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ≠ italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT. Thus, both H1superscriptsubscript𝐻1H_{1}^{\prime}italic_H start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT and H2superscriptsubscript𝐻2H_{2}^{\prime}italic_H start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT must be re-searched by training the large models. Moreover, there is also no guarantee that L2<L1superscriptsubscript𝐿2superscriptsubscript𝐿1L_{2}^{\prime}\!<\!L_{1}^{\prime}italic_L start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT < italic_L start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT still holds. We use these notations throughout our paper.

2.2 Estimating HPs for Large Models

Maximal Update Parametrization (μ𝜇\muitalic_μP, (Yang et al., 2022)) is a transferring function for certain classes of HPs (i.e. μ𝜇\muitalic_μTransferable HPs, including learning rate, initialization variance, and multipliers) w.r.t. model widths (e.g. the hidden size of Transformers). Theories (Golikov & Yang, 2022; Littwin & Yang, 2023; Yang & Hu, 2021) suggest that for two models with the only difference being their widths w𝑤witalic_w and wsuperscript𝑤w^{\prime}italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, their optimal μ𝜇\muitalic_μTransferable HPs H𝐻Hitalic_H and Hsuperscript𝐻H^{\prime}italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT satisfy H=μP(H,r)superscript𝐻𝜇𝑃𝐻𝑟H^{\prime}\!=\!\mu P(H,r)italic_H start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT = italic_μ italic_P ( italic_H , italic_r ), where r=w/w𝑟superscript𝑤𝑤r\!=\!w^{\prime}/witalic_r = italic_w start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT / italic_w. In other words, under such transfer function, the loss landscape (w.r.t. HPs) of models in different widths are aligned. See Figure 1 for a demonstration based on our reproduction. Please refer to the original papers for theoretical derivations.

For the μ𝜇\muitalic_μTransferable HPs, μ𝜇\muitalic_μP enables grid-searching on small models and directly transferring to large ones. However, it only provides insights on which HPs are better, but can not predict the loss values themselves under these HPs. Meanwhile, may studies actually care about different model structures (e.g. GPT vs. Llama) and non-μ𝜇\muitalic_μTransferable HPs (e.g. layer numbers). Since μ𝜇\muitalic_μP does not cover these cases, there are very limited reliable approaches to know “how many layers should I use for a 100B model” unless we could somehow know the final training loss for each layer number. These cases are what we actually mean by model designs in Section 2.1 and elsewhere.

2.3 Estimating Loss for Large Models

The term “scaling laws” in modern deep learning stands for the relations between the performance of models (usually the test loss or some other performance metrics) and computational scales (like model size, dataset size, or training compute) (Hestness et al., 2017; Kaplan et al., 2020; Villalobos, 2023; Rosenfeld et al., 2020). These relations give researchers insight on trade-offs between model capabilities and computational budgets. However, these relations must be established through HP tuning across all model scales. Often, the optimal HP choices are unknown before the training of large models finish. Thus, existing scaling laws have limited power for “loss prediction”.

2.4 Other Related Work

Some existing work explores HP transfer learning among different tasks or datasets (Perrone et al., 2018; Stoll et al., 2020; Salinas et al., 2020). In contrast, we focus on implementing loss prediction via HP transfer across different model widths, under the same task or data.

Our nanoLM benchmark is also related to current evaluation datasets for LLMs (Srivastava et al., 2022; Hendrycks et al., 2020). A critical difference is that nanoLM does not aim at directly evaluating pre-trained model checkpoints, but rather encourages the users to pre-train new models from scratch, and use the pre-training loss as metric for comparison. Thus, nanoLM contains pre-training code with implementations of our proposed μ𝜇\muitalic_μScaling on different models, in addition to data processing functions. Pre-training on nanoLM is made affordable via loss prediction, which avoids running the largest models. This is orthogonal to the optimization of parallelism mechanics and hardware usage (Shoeybi et al., 2019; Dao, 2023; Rajbhandari et al., 2020).

Table 2: μ𝜇\muitalic_μP function for a model Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT that is r times the widths of M. If a parameter tensor has 2 dimensions that goes infinite when the model width goes infinite, it is “matrix-like” (e.g., a fully-connected hidden layer); if the number is 1 or 0, it belongs to the “others” class. Note that embedding layers are “others”. “Output” means the layer that maps an infinite dimension to a finite dimension, which is the word decoding layer (lm_head𝑙𝑚_𝑒𝑎𝑑lm\_headitalic_l italic_m _ italic_h italic_e italic_a italic_d) in Transformers. A multiplier is a constant multiplied by a parameter tensor, which has a similar function to softmax temperature.
Hyperparameter (weight) M𝑀Mitalic_M Mrsimilar-tosuperscript𝑀𝑟M^{\prime}\sim ritalic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT ∼ italic_r
AdamW learning rate (matrix-like) l𝑙litalic_l l/r𝑙𝑟l/ritalic_l / italic_r
AdamW learning rate (others) l𝑙litalic_l l𝑙litalic_l
Initialization variance (matrix-like) σ𝜎\sigmaitalic_σ σ/r𝜎𝑟\sigma/ritalic_σ / italic_r
Initialization variance (others) σ𝜎\sigmaitalic_σ σ𝜎\sigmaitalic_σ
Multiplier (output) τ𝜏\tauitalic_τ τ/r𝜏𝑟\tau/ritalic_τ / italic_r
Multiplier (others) τ𝜏\tauitalic_τ τ𝜏\tauitalic_τ

3 nanoLM with μ𝜇\muitalic_μScaling

In this section, we introduce our generic benchmark for LLM study without full-scaled training, namely nanoLM. Figure 2 provides the overview of our benchmark. Section 3.1 contains details of our μ𝜇\muitalic_μScaling method and the inducted new research paradigm, and Section 4 introduces the components of our benchmark.

Refer to caption
Figure 2: Illustration of standard LLM pre-training vs. nanoLM. Top: Directly pre-training LLM with high computational cost, large data, and distributed training. Down: loss prediction with μ𝜇\muitalic_μScaling. In this scenario, two different model designs, M𝑀Mitalic_M and Msuperscript𝑀M^{\prime}italic_M start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, are being compared. The process unfolds into four phases: 1) Shrink the width of the two models to base dimensions, denoted by w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT and w1superscriptsubscript𝑤1w_{1}^{\prime}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT ′ end_POSTSUPERSCRIPT, for grid search on μ𝜇\muitalic_μTransferable HPs. 2) Choose a series of proxy models with small widths, then apply μ𝜇\muitalic_μP for zero-shot HP Transfer for each model. 3) Train these proxy models, record the losses, and fit the scaling law. 4) Directly predict the loss at any given width without training large LLMs.

3.1 Unlocking a New LLM Research Paradigm

Pre-training Loss as an Indicator of Model Performance.

LLMs’ training involves cross-entropy objectives such as predicting subsequent or masked tokens. Since typically the training data is processed for less than 1 epoch, this loss also strongly correlates with the perplexity on validation set. Besides, recent researches (Touvron et al., 2023b; Schaeffer et al., 2023b) suggest a link between pre-training loss and downstream task performance, although this is challenging to assess comprehensively. In short, for the same data, pre-training loss is a reasonable indicator of model capabilities.

μ𝜇\muitalic_μScaling for Loss Prediction.

We explore the problem of predicting the loss of a target (large) model with a series of small proxy models. Based on the discussions in Sections 2.2 and 2.3, we observe that in this problem setting, μ𝜇\muitalic_μP and scaling laws are complementary: first, scaling law offers a formal function to directly calculate the loss values with model scales, but the optimal HP of each model should be searched separately; second, μ𝜇\muitalic_μP provides a unified HP for multiple models varying only in width, but does not guarantee the exact loss values except knowing “the wider, the better”. Thus, we develop a loss prediction approach, namely μ𝜇\muitalic_μScaling, as follows (Figure 2, bottom):

First, we generate a series of proxy models {M(w0),M(w1),M(w2),,M(wn)}𝑀subscript𝑤0𝑀subscript𝑤1𝑀subscript𝑤2𝑀subscript𝑤𝑛\{M(w_{0}),M(w_{1}),M(w_{2}),...,M(w_{n})\}{ italic_M ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) , italic_M ( italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT ) , italic_M ( italic_w start_POSTSUBSCRIPT 2 end_POSTSUBSCRIPT ) , … , italic_M ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) } with MB=M(w0)subscript𝑀𝐵𝑀subscript𝑤0M_{B}\!=\!M(w_{0})italic_M start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT = italic_M ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) being the base model and MT=M(wn)subscript𝑀𝑇𝑀subscript𝑤𝑛M_{T}\!=\!M(w_{n})italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_M ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) being the target large model. They differ only in widths {w0,w1,,wn}subscript𝑤0subscript𝑤1subscript𝑤𝑛\{w_{0},w_{1},...,w_{n}\}{ italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , … , italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT }.

Second, we grid-search for the optimal μ𝜇\muitalic_μTransferable HPs H𝐻Hitalic_H = (learning rate, init variance, multipliers) on MBsubscript𝑀𝐵M_{B}italic_M start_POSTSUBSCRIPT italic_B end_POSTSUBSCRIPT. μ𝜇\muitalic_μP concludes that H(i)=μP(H(0),wi/w0)𝐻𝑖𝜇𝑃𝐻0subscript𝑤𝑖subscript𝑤0H(i)\!=\!\mu P(H(0),w_{i}/w_{0})italic_H ( italic_i ) = italic_μ italic_P ( italic_H ( 0 ) , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT ) are also the optimal HPs for each M(wi),i[0,n]𝑀subscript𝑤𝑖𝑖0𝑛M(w_{i}),i\!\in\![0,n]italic_M ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) , italic_i ∈ [ 0 , italic_n ]. We illustrate the μ𝜇\muitalic_μP function we use in Table 2, which corresponds to Table 8 in (Yang et al., 2022).

Third, we select some small wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT-s, train the models M(wi)𝑀subscript𝑤𝑖M(w_{i})italic_M ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ) with H(i)𝐻𝑖H(i)italic_H ( italic_i ), and achieve loss L(i)𝐿𝑖L(i)italic_L ( italic_i ). Based on these points, we fit a power-law function L(C)=aCb+c𝐿𝐶𝑎superscript𝐶𝑏𝑐L(C)=aC^{b}+citalic_L ( italic_C ) = italic_a italic_C start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_c w.r.t. the number of parameters C𝐶Citalic_C, and expect that this function accurately predicts LT=L(n)subscript𝐿𝑇𝐿𝑛L_{T}\!=\!L(n)italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_L ( italic_n ) that could be achieved by training MT=M(wn)subscript𝑀𝑇𝑀subscript𝑤𝑛M_{T}\!=\!M(w_{n})italic_M start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_M ( italic_w start_POSTSUBSCRIPT italic_n end_POSTSUBSCRIPT ) with HP HT=H(n)subscript𝐻𝑇𝐻𝑛H_{T}\!=\!H(n)italic_H start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT = italic_H ( italic_n ).

The correctness of the third step is critical and has not been studied yet, but we show by our comprehensive experiments (Section 4) that it is effective for HPs within the common loss basins in the HP space, and with our modification on scaling laws (below). Note that the formulation of scaling law is empirical and should be validated by experiments (Kaplan et al., 2020; Hoffmann et al., 2022; OpenAI, 2023).

Embedding counts as model size. An interesting difference between μ𝜇\muitalic_μScaling and mainstream scaling laws is we find that μ𝜇\muitalic_μScaling fits better if the size of embedding is counted in the model size, while methods like (Kaplan et al., 2020) benefit from excluding it. We explain this in Section 4.4.

New Paradigm: Research without Re-search.

Based on μ𝜇\muitalic_μScaling, the μ𝜇\muitalic_μTransferable HPs are actually “tunnels” towards directly comparing different model designs on large scales via loss prediction. For each model, we predict its loss via μ𝜇\muitalic_μScaling by shrinking its width into a series of small ones and do not need to actually train the large model with full width. In this process, only one HP search is needed for each model design. We formalize this new paradigm in Algorithm 1, and highlight its feature as “research without re-search”. This makes LLM studies more affordable, while still reaching meaningful conclusions for model scales beyond 50B.

\mathcal{H}caligraphic_H: all combinations of μ𝜇\muitalic_μTransferable HPs; 𝒲𝒲\mathcal{W}caligraphic_W: all possible widths; 𝒫𝒫\mathcal{P}caligraphic_P: all possible model designs, which are the objectives of research.
p*𝒫superscript𝑝𝒫p^{*}\in\mathcal{P}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ caligraphic_P: the best model found; h*superscripth^{*}\in\mathcal{H}italic_h start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ∈ caligraphic_H: the optimal μ𝜇\muitalic_μTransferable HPs; wT𝒲subscript𝑤𝑇𝒲w_{T}\in\mathcal{W}italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_W: the target width to predict; 𝒟𝒟\mathcal{D}caligraphic_D: training data.

1:  for p𝑝pitalic_p in 𝒫𝒫\mathcal{P}caligraphic_P do
2:     Select the base and target widths w0,wT𝒲subscript𝑤0subscript𝑤𝑇𝒲w_{0},w_{T}\in\mathcal{W}italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ∈ caligraphic_W;
3:     h0*grid_search(w0,p,)superscriptsubscript0𝑔𝑟𝑖𝑑_𝑠𝑒𝑎𝑟𝑐subscript𝑤0𝑝h_{0}^{*}\leftarrow grid\_search(w_{0},p,\mathcal{H})italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ← italic_g italic_r italic_i italic_d _ italic_s italic_e italic_a italic_r italic_c italic_h ( italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT , italic_p , caligraphic_H );
4:     for some small wisubscript𝑤𝑖w_{i}italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT in 𝒲𝒲\mathcal{W}caligraphic_W do
5:        hi*μP(h0*,wi/w0)superscriptsubscript𝑖𝜇𝑃superscriptsubscript0subscript𝑤𝑖subscript𝑤0h_{i}^{*}\leftarrow\mu P(h_{0}^{*},w_{i}/w_{0})italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ← italic_μ italic_P ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT / italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT );
6:        Litrain(wi,p,hi*,𝒟)subscript𝐿𝑖𝑡𝑟𝑎𝑖𝑛subscript𝑤𝑖𝑝superscriptsubscript𝑖𝒟L_{i}\leftarrow train(w_{i},p,h_{i}^{*},\mathcal{D})italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_t italic_r italic_a italic_i italic_n ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_p , italic_h start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , caligraphic_D );
7:        Cicount_params(wi)subscript𝐶𝑖𝑐𝑜𝑢𝑛𝑡_𝑝𝑎𝑟𝑎𝑚𝑠subscript𝑤𝑖C_{i}\leftarrow count\_params(w_{i})italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT ← italic_c italic_o italic_u italic_n italic_t _ italic_p italic_a italic_r italic_a italic_m italic_s ( italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT );
8:     end for
9:     Fit L=aCb+c𝐿𝑎superscript𝐶𝑏𝑐L=aC^{b}+citalic_L = italic_a italic_C start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_c with all (Lisubscript𝐿𝑖L_{i}italic_L start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT, Cisubscript𝐶𝑖C_{i}italic_C start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT)s;
10:     CTcount_params(wT)subscript𝐶𝑇𝑐𝑜𝑢𝑛𝑡_𝑝𝑎𝑟𝑎𝑚𝑠subscript𝑤𝑇C_{T}\leftarrow count\_params(w_{T})italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ← italic_c italic_o italic_u italic_n italic_t _ italic_p italic_a italic_r italic_a italic_m italic_s ( italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT );
11:     LT(p)aCTb+csubscript𝐿𝑇𝑝𝑎superscriptsubscript𝐶𝑇𝑏𝑐L_{T}(p)\leftarrow aC_{T}^{b}+citalic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p ) ← italic_a italic_C start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_c; ##\## Loss Prediction
12:     hT*μP(h0*,wT/w0)superscriptsubscript𝑇𝜇𝑃superscriptsubscript0subscript𝑤𝑇subscript𝑤0h_{T}^{*}\leftarrow\mu P(h_{0}^{*},w_{T}/w_{0})italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ← italic_μ italic_P ( italic_h start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT / italic_w start_POSTSUBSCRIPT 0 end_POSTSUBSCRIPT );
13:     if LT(p)<L(p*)subscript𝐿𝑇𝑝𝐿superscript𝑝L_{T}(p)<L(p^{*})italic_L start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT ( italic_p ) < italic_L ( italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ) then
14:        p*psuperscript𝑝𝑝p^{*}\leftarrow pitalic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT ← italic_p;
15:     end if
16:  end for
17:  Conclusion: model p*superscript𝑝p^{*}italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT is better according to the predicted loss on (wTsubscript𝑤𝑇w_{T}italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT, hT*superscriptsubscript𝑇h_{T}^{*}italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT).
18:  Optional: train(wT,p*,hT*,𝒟)𝑡𝑟𝑎𝑖𝑛subscript𝑤𝑇superscript𝑝superscriptsubscript𝑇𝒟train(w_{T},p^{*},h_{T}^{*},\mathcal{D})italic_t italic_r italic_a italic_i italic_n ( italic_w start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT , italic_p start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , italic_h start_POSTSUBSCRIPT italic_T end_POSTSUBSCRIPT start_POSTSUPERSCRIPT * end_POSTSUPERSCRIPT , caligraphic_D ).
Algorithm 1 New Paradigm: Research without Re-search

3.2 The nanoLM benchmark

We introduce nanoLM, an implementation of our research paradigm on different Transformer-based architectures, as well as a curated dataset serving as a benchmark for training and loss comparison.

3.2.1 Supported Architectures.

nanoLM is based on PyTorch***https://pytorch.org/. and supports three popular and remarkable architectures: decoder-only structures (e.g., GPT (Brown et al., 2020) and Llama (Touvron et al., 2023a)), encoder-only structures (e.g., BERT (Devlin et al., 2019)), and encoder-decoder structures (e.g., T5 (Raffel et al., 2020)). Furthermore, considering the potential GPU memory overflow due to excessive model parameters and sequence lengths, nanoLM integrates Fully Sharded Data Parallel (FSDP, (Zhao et al., 2023)), which shards the optimizer states, gradients, and parameters across multiple GPUs.

3.2.2 Pre-training Data

To facilitate researchers in using nanoLM for comparative analysis across different model designs, we carefully construct a curated pre-training datasets from those of existing large-scale models (i.e., Llama, Falcon, GPT-3). It covers diverse domains to improve the generalization capabilities of the resultant models.

Data Statistics.

Our pre-training dataset, as detailed in Appendix Table 5, comprises a rich blend of various open-source materials that span a wide range of domains. Largely, we repurpose data sources previously utilized for training other LLMs, adhering strictly to the criteria that the data must be publicly accessible and conducive to open-sourcing. Our training data contains 4 tracks with 100B, 400B, 1T, or 2T tokens, representing different settings of data scales. All versions are sampled according to the proportions in Table 5.

Categorized Domains.

We add widely-ranged domain data to increase the diversity. We categorize the domains as follows.

WebText: We encompass Falcon RefinedWeb (Penedo et al., 2023), a substantial English web dataset derived from CommonCrawl, featuring rigorous filtering and extensive deduplication. We also include OpenWebText2 (Gao et al., 2020), a sizeable dataset of filtered text documents, collected from URLs in Reddit submissions.

Professional Knowledge: We select Books3 (Gao et al., 2020), which consists of a mix of fiction and nonfiction books.

World Knowledge: We add English Wikipedia to our training dataset, which is a standard source of high-quality text for language modeling.

Code: We include the public GitHub dataset and hope to improve downstream performance on code-related tasks.

Academic: For scientific knowledge, we include arXiv for the training dataset, which consists of preprint research papers (Lewkowycz et al., 2022). These papers are mainly in the fields of math, computer science, and physics.

Question Answering: We include a dump of Stack Exchange, a website of high-quality questions and answers that covers diverse domains ranging from computer science to chemistry.

4 Experiments

First, we outline our experimental setup in Section 4.1. Then, we present empirical loss prediction results in Section 4.2. Last, we delve into cost analysis in Section 4.3. For additional analysis, please refer to Section 4.4 and Appendix A.

Table 3: Model parameters and loss across various widths and architectures. Note: loss values in bold represent predicted values.
(a) 12-Layer models: parameter count (in millions) and loss value.
Width 128 256 384 512 640 768 896 1024
GPT Size / M 8.82 22.36 40.61 63.59 91.28 123.69 160.82 202.67
Loss@20k 4.45 4.20 4.05 3.94 3.90 3.87 3.85 3.84 (3.810)
Llama Size / M 9.59 25.47 47.64 76.10 110.85 151.90 199.24 252.86
Loss@20k 4.49 4.31 4.24 4.20 4.18 4.17 4.11 4.10 (4.112)
BERT Size / M 22.52 51.28 86.33 127.67 175.31 229.24 289.45 355.96
Loss@20k 3.99 3.15 2.95 2.88 2.83 2.81 2.78 2.77 (2.782)
T5 Size / M 26.46 67.03 121.75 190.63 273.66 370.85 482.20 607.70
Loss@20k 4.75 4.66 4.60 4.58 4.52 4.47 4.42 4.40 (4.415)
(b) 32-layer and 64-layer models: parameter count (in billions) and loss value.
Width 256 384 512 640 768 896 1024 2048 8192
32-layer GPT Size / B 0.038 0.076 0.126 0.189 0.265 0.353 0.454 1.714 26.185
Loss@7k 3.92 3.76 3.65 3.59 3.54 3.49 3.47 3.45 3.41 (3.381)
64-layer GPT Size / B 0.077 0.153 0.254 0.381 0.532 0.709 0.911 3.432 52.385
Loss@10k 3.656 3.389 3.298 3.215 3.198 3.087 3.080 2.958 2.883 (2.861)
Refer to caption
(a) 12-layer GPT and Llama.
Refer to caption
(b) 12-layer BERT.
Refer to caption
(c) 12-layer T5.
Refer to caption
(d) Training loss on MC4 dataset.
Figure 3: Fitting result with μ𝜇\muitalic_μP and without μ𝜇\muitalic_μP: The dots illustrate the training loss across different small widths while incorporating μ𝜇\muitalic_μP. In contrast, the yellow cross points display the training loss at those very widths but without employing μ𝜇\muitalic_μP. We fit these dots to adapt the scaling law, aiming to ascertain if the loss of the final one models is consistent with this trend. The red star denotes the actual loss values from our training of the predicted wider models.

4.1 Setup

Model and Training Details.

Our main goal is to justify the third step in μ𝜇\muitalic_μScaling (Section 3.1), as well as the usability of the nanoLM benchmark. To validate the loss prediction capability with different data scales, models, and computational resources, we experiment with the following three separate settings:

Single GPU. Firstly, we test loss prediction on Transformer architectures (i.e., GPT, BERT, T5) with 12 layers using the C4 (Raffel et al., 2020) dataset under a single-GPU setup. We utilize a base width of 256 to conduct a grid search for the μ𝜇\muitalic_μTransferable HPs, primarily targeting the learning rate, with values ranging from 5e45𝑒45e-45 italic_e - 4 to 1e11𝑒11e-11 italic_e - 1The μ𝜇\muitalic_μTransferable HPs include learning rate, initialization standard deviation, and multipliers. Due to computational resources, we focus on the learning rate. However, researchers can still explore the other μ𝜇\muitalic_μTransferable HPs on nanoLM to achieve even better results.. The number of parameters for the model series ranges from 8M to 700M with different widths, as detailed in Table 3(a). The target width for prediction is 1024 Given that this group of experiments target on only one single GPU, both BERT and T5 ran out of memory when expanded to a width of 2048. In these cases, we leave the results for future.. The batch size was established at 16, and the sequence length was set to 2048. According to (Yang et al., 2022), when the training steps surpass 5k, the μ𝜇\muitalic_μP transfer tend to stabilize. To this end, we choose to fit the loss at 7k and 20k steps and study its influences on the predicative performances.

Multi-GPU with FSDP. Secondly, to experiment with larger models and more data, We employ Fully Sharded Data Parallel (FSDP) (Zhao et al., 2023) to effectively address the training challenges. We conduct loss prediction with 32-layer GPTs on nanoLM pre-training data (Section 3.2). We utilize a base width of 256 to conduct a grid search for the learning rate, ranging from 5e45𝑒45e-45 italic_e - 4 to 1e11𝑒11e-11 italic_e - 1. The number of parameters ranges from 38M to 26B (Table 3(b)). The batch size was established at 512, and the sequence length was set to 512. We use proxy models with widths under 2048 to predict the loss of the target model with width 8192.

Multiple Machines with Megatron. Lastly, we further validate the feasibility of μ𝜇\muitalic_μP and μ𝜇\muitalic_μScaling with extremely large widths. We conduct the experiments for the 64-layer GPTs on Megatron (Narayanan et al., 2021). It provides efficient tensor, pipeline, and sequence-based model parallelism for pre-training LLMs. We utilize a base width of 256 to conduct a grid search for the learning rate, ranging from 1e41𝑒41e-41 italic_e - 4 to 1e21𝑒21e-21 italic_e - 2. The number of parameters ranges from 77M to 52B (Table 3(b)). The batch size was established at 512, and the sequence length was set to 2048. We exit training and fit the loss at 10k steps, with 10.49B tokens consumed (10%percent\%% of the 100B track of nanoLM pre-training data, Section 3.2).

All of our experiments are run on A100 GPUs. For each series of models used for loss prediction, the batches are fed into the models in the same order. More hyperparameters can be found in Appendix C.

μ𝜇\muitalic_μP and μ𝜇\muitalic_μScaling Settings.

All of our models are trained from scratch. Additionally, we adhere to the recommendation by (Yang et al., 2022) to initialize the output word embeddings and query parameters as all-zero to avoid the Gaussian Process §§§Such Gaussian Process may cause misalignment of landscapes between small and large models.. We maintain a consistent head dimension of 64 for each attention head and scale up the number of heads with model widths. The dropout and weight decay are set to 0. We employ the AdamW optimizer (Loshchilov & Hutter, 2019) with its default configurations. The a,b,c𝑎𝑏𝑐{a,b,c}italic_a , italic_b , italic_c coefficients in power law L=aCb+c𝐿𝑎superscript𝐶𝑏𝑐L=aC^{b}+citalic_L = italic_a italic_C start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_c, as well as their standard deviations are computed with scipy.optimize.curve_fit()formulae-sequence𝑠𝑐𝑖𝑝𝑦𝑜𝑝𝑡𝑖𝑚𝑖𝑧𝑒𝑐𝑢𝑟𝑣𝑒_𝑓𝑖𝑡scipy.optimize.curve\_fit()italic_s italic_c italic_i italic_p italic_y . italic_o italic_p italic_t italic_i italic_m italic_i italic_z italic_e . italic_c italic_u italic_r italic_v italic_e _ italic_f italic_i italic_t ( ) https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html.

4.2 Main results: Fits &\&& Extrapolation of μ𝜇\muitalic_μScaling

We use a base width of 256 to grid-search for the optimal μ𝜇\muitalic_μTransferable HPs. For 12-layer models, we found the optimal (learning rate, initialization standard deviation, output multipliers) being (3e2,0.05,5.0)3𝑒20.055.0(3e-2,0.05,5.0)( 3 italic_e - 2 , 0.05 , 5.0 ). For 32-layer and 64-layer GPTs, the optimal HPs are (3e2,0.04,4.0)3𝑒20.044.0(3e-2,0.04,4.0)( 3 italic_e - 2 , 0.04 , 4.0 ) and (1e3,0.01,1.0)1𝑒30.011.0(1e-3,0.01,1.0)( 1 italic_e - 3 , 0.01 , 1.0 ), respectively. These HPs are μ𝜇\muitalic_μTransferred to other widths by Table 2. Some specific loss values can be found in Table 3(b), while the corresponding fitted curves are illustrated in Figure 3. For more numerical loss values, please refer to Appendix E. The grid-search results can be found in Appendix D.

12-layer Models on C4/MC4 with a Single GPU.

For GPT, Llama, BERT, and T5, proxy models with widths \leq 896 We found a positive correlation between the number of data points and the accuracy of the fitted loss curve. We choose seven points due to the limited computation resource. are used to fit the power-law curves and predict the loss for widths >>> 896. For the 12-layer GPT and Llama, the fitted power-laws are L=2.07*C0.52+3.85𝐿2.07superscript𝐶0.523.85L=2.07*C^{-0.52}+3.85italic_L = 2.07 * italic_C start_POSTSUPERSCRIPT - 0.52 end_POSTSUPERSCRIPT + 3.85 and L=1.34*C0.49+4.02𝐿1.34superscript𝐶0.494.02L=1.34*C^{-0.49}+4.02italic_L = 1.34 * italic_C start_POSTSUPERSCRIPT - 0.49 end_POSTSUPERSCRIPT + 4.02, respectively. For 12-layer BERT, the result is L=61.18*C1.31+3.43𝐿61.18superscript𝐶1.313.43L=61.18*C^{-1.31}+3.43italic_L = 61.18 * italic_C start_POSTSUPERSCRIPT - 1.31 end_POSTSUPERSCRIPT + 3.43; for the 12-layer T5, it is L=0.25*C0.47+2.82𝐿0.25superscript𝐶0.472.82L=0.25*C^{-0.47}+2.82italic_L = 0.25 * italic_C start_POSTSUPERSCRIPT - 0.47 end_POSTSUPERSCRIPT + 2.82. According to Figure 3, it’s evident that based on the loss of the narrow proxy model series, we successfully predicted the loss of wider models. This holds for both 7k and 20k training steps and for all the 4 different Transformer variants, supporting the effectiveness of μ𝜇\muitalic_μScaling. This extends the μ𝜇\muitalic_μP conclusion (Yang et al., 2022) of “the wider, the better” to “wider is better adherent to a power law”. In contrast, for the yellow points, whose HPs are not transferred by μ𝜇\muitalic_μP but follows standard parametrization, the scaling law is not well fitted. They also underperform the same-width networks using μ𝜇\muitalic_μP. Additionally, Figure 3 (d) demonstrates that the conclusion still holds on diverse multilingual datasets, such as MC4 (Raffel et al., 2020).

32-layer GPTs with FSDP.

To validate μ𝜇\muitalic_μScaling for larger-scaled models, we expanded the GPT to 32 layers. We train with the data parallel strategy (FSDP) on the benchmark pre-training data, as supported by nanoLM (Section 3.2). Based on the proxy models with widths ranging from 256 to 2048, we predict the loss for a model with a width of 8192. The results are presented in Figure 4 with the power-law being L=0.077*C0.61+3.37𝐿0.077superscript𝐶0.613.37L=0.077*C^{-0.61}+3.37italic_L = 0.077 * italic_C start_POSTSUPERSCRIPT - 0.61 end_POSTSUPERSCRIPT + 3.37. Notably, μ𝜇\muitalic_μScaling successfully predicts the loss of a 26B model at 7k steps based on the losses from models ranging from 38M to 1.7B. This demonstrates that nanoLM is an effective paradigm to study LLMs of billion-scale magnitude.

64-layer GPTs with Megatron.

We further validate nanoLM with extremely large models, exemplified by 64-layer GPTs implemented with Megatron (Shoeybi et al., 2019). We train proxy models with widths ranging from 256 to 2048 using nanoLM pre-training data and forecast the loss for a model in 8192 width. Results are presented in Figure 4 with the power-law being L=0.249*C0.47+2.82𝐿0.249superscript𝐶0.472.82L=0.249*C^{-0.47}+2.82italic_L = 0.249 * italic_C start_POSTSUPERSCRIPT - 0.47 end_POSTSUPERSCRIPT + 2.82. The results show that we can successfully predict the loss of a 52B model at 10k steps based on proxy models ranging from 77M to 3.4B, further affirming the accuracy of μ𝜇\muitalic_μScaling and usability of nanoLM.

Refer to caption
Figure 4: μ𝜇\muitalic_μScaling results for large models. The blue/green dots signify the loss values of the proxy models. The red star denotes the actual loss values from our training of the target models.

4.3 Efficiency

Cost Ratio.

We calculate the number of floating point operations (FLOPs) of a model following (Narayanan et al., 2021). Given a Transformer model with l𝑙litalic_l layers, width (a.k.a. hidden size) w𝑤witalic_w, sequence length s𝑠sitalic_s, vocabulary size V𝑉Vitalic_V, and training batch size B𝐵Bitalic_B, the total number of floating-point operations is M(l,w,s,V,B)=96Bslw2(1+s6w+V16lw)𝑀𝑙𝑤𝑠𝑉𝐵96𝐵𝑠𝑙superscript𝑤21𝑠6𝑤𝑉16𝑙𝑤M(l,w,s,V,B)=96Bslw^{2}\left(1+\frac{s}{6w}+\frac{V}{16lw}\right)italic_M ( italic_l , italic_w , italic_s , italic_V , italic_B ) = 96 italic_B italic_s italic_l italic_w start_POSTSUPERSCRIPT 2 end_POSTSUPERSCRIPT ( 1 + divide start_ARG italic_s end_ARG start_ARG 6 italic_w end_ARG + divide start_ARG italic_V end_ARG start_ARG 16 italic_l italic_w end_ARG ). Thus, the FLOPs ratio of the whole nanoLM process to the training of the full target model can be approximated as:

M(l,w1,s,V,B)*t+i=2nM(l,wi,s,V,B)M(l,wt,s,V,B),𝑀𝑙subscript𝑤1𝑠𝑉𝐵𝑡superscriptsubscript𝑖2𝑛𝑀𝑙subscript𝑤𝑖𝑠𝑉𝐵𝑀𝑙subscript𝑤𝑡𝑠𝑉𝐵\frac{M(l,w_{1},s,V,B)*t+\sum_{i=2}^{n}M(l,w_{i},s,V,B)}{M(l,w_{t},s,V,B)},divide start_ARG italic_M ( italic_l , italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT , italic_s , italic_V , italic_B ) * italic_t + ∑ start_POSTSUBSCRIPT italic_i = 2 end_POSTSUBSCRIPT start_POSTSUPERSCRIPT italic_n end_POSTSUPERSCRIPT italic_M ( italic_l , italic_w start_POSTSUBSCRIPT italic_i end_POSTSUBSCRIPT , italic_s , italic_V , italic_B ) end_ARG start_ARG italic_M ( italic_l , italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT , italic_s , italic_V , italic_B ) end_ARG , (1)

where w1subscript𝑤1w_{1}italic_w start_POSTSUBSCRIPT 1 end_POSTSUBSCRIPT is grid-search width and t𝑡titalic_t denotes corresponding number of grid search trails on μ𝜇\muitalic_μTransferable HPs (as detailed in Table 2). n𝑛nitalic_n signifies the count of distinct proxy models used for the fitting loss function, while wtsubscript𝑤𝑡w_{t}italic_w start_POSTSUBSCRIPT italic_t end_POSTSUBSCRIPT is the target model width. In our experiments, the 32-layer GPT comes with a sequence length of 512 and a batch size of 512. The vocabulary contains 100,256 tokens. Thus, the ratio of cost is approximated at 0.131. Under similar calculation, the cost ratio for our 64-layer GPT is 0.142. In short, with just 13.1%,14.2%percent13.1percent14.213.1\%,14.2\%13.1 % , 14.2 % of the one-time pre-training cost, we can predict the loss for 26B and 52B models. Note that without μ𝜇\muitalic_μScaling, grid-search is needed on the large models, which will enlarge the advantage of our method by several times.

4.4 Analytical Studies

Error Comparison with Related Work.

We compare the fitting error of nanoLM with related work that also provided loss numbers, including Cerebras-GPT (Dey et al., 2023) (in their Table 8), and Pythia (Biderman et al., 2023). As shown in Table 4, for fitting error, nanoLM achieves slightly better results than related work, even with larger models (26B and 52B). Moreover, the covariances of the fitted coefficients {a,b,c}𝑎𝑏𝑐\{a,b,c\}{ italic_a , italic_b , italic_c } are significantly lower than related work. This indicates that nanoLM is more accurate and confident on its scaling laws, supporting its reliability.

Impact of Embedding Weights.

We demonstrate an ablation study by Figure 4(a) and 4(b), indicating that our scaling law fits worse if embedding sizes are not counted as model sizes, in contrast to (Kaplan et al., 2020; Henighan et al., 2020). This is potentially because μ𝜇\muitalic_μP concludes that the learning rate of embedding layers should not be scaled down with widths, while most existing methods grid-search for a unified learning rate for all layers on each width, making embeddings learning slower on large models, and the matrix-like parameters dominating the training dynamics. This analysis is conducted with a 12-layer GPT-2 trained on OpenWebText (Gokaslan & Cohen, 2019) for 20k steps.

For other discussions, please see Appendix A.

5 Conclusion and Future Work

We present nanoLM, a cost-efficient benchmark for Large Language Model (LLM) studies, designed to predict losses accurately across various scales without direct training. This benchmark allows for comprehensive comparisons of different model architectures and algorithms, works well with FSDP, and contains curated pre-training datasets ranging from 100B to 2T tokens. Our loss prediction method, μ𝜇\muitalic_μScaling, is tested through comprehensive experiments, which demonstrate its effectiveness across different scales, enabling researchers with limited resources to conduct meaningful research for large models. Moving forward, we aim to extend nanoLM to more frameworks and tasks (e.g., vision), reducing resource waste from non-transferable findings, and enhancing collaboration between academia and industry.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

  • Almazrouei et al. (2023) Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, E., Heslow, D., Launay, J., Malartic, Q., Noune, B., Pannier, B., and Penedo, G. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
  • Biderman et al. (2023) Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp.  2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
  • Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
  • Dao (2023) Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
  • Devlin et al. (2019) Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp.  4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
  • Dey et al. (2023) Dey, N., Gosal, G., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. CoRR, abs/2304.03208, 2023. doi: 10.48550/ARXIV.2304.03208. URL https://doi.org/10.48550/arXiv.2304.03208.
  • Gao et al. (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
  • Gokaslan & Cohen (2019) Gokaslan, A. and Cohen, V. Openwebtext corpus. 2019.
  • Golikov & Yang (2022) Golikov, E. and Yang, G. Non-gaussian tensor programs. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=AcHUIG2wA8-.
  • Hendrycks et al. (2020) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
  • Henighan et al. (2020) Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
  • Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http://arxiv.longhoe.net/abs/1712.00409.
  • Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/arXiv.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
  • Kaplan et al. (2020) Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.longhoe.net/abs/2001.08361.
  • Lewkowycz et al. (2022) Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html.
  • Littwin & Yang (2023) Littwin, E. and Yang, G. Adaptive optimization in the $\infty$-width limit. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zgVDqw9ZUES.
  • Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
  • Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, pp.  58. ACM, 2021. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209.
  • OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
  • Penedo et al. (2023) Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/arXiv.2306.01116. URL https://doi.org/10.48550/arXiv.2306.01116.
  • Perrone et al. (2018) Perrone, V., Jenatton, R., Seeger, M. W., and Archambeau, C. Scalable hyperparameter transfer learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp.  6846–6856, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/14c879f3f5d8ed93a09f6090d77c2cc3-Abstract.html.
  • Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
  • Rajbhandari et al. (2020) Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp.  1–16. IEEE, 2020.
  • Rosenfeld et al. (2020) Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, N. A constructive prediction of the generalization error across scales. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryenvpEKDr.
  • Salinas et al. (2020) Salinas, D., Shen, H., and Perrone, V. A quantile-based approach for hyperparameter transfer learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp.  8438–8448. PMLR, 2020. URL http://proceedings.mlr.press/v119/salinas20a.html.
  • Schaeffer et al. (2023a) Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023a.
  • Schaeffer et al. (2023b) Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023b.
  • Shoeybi et al. (2019) Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.longhoe.net/abs/1909.08053.
  • Smith et al. (2022) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zheng, E., Child, R., Aminabadi, R. Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., and Catanzaro, B. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990, 2022. URL https://arxiv.longhoe.net/abs/2201.11990.
  • Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
  • Stoll et al. (2020) Stoll, D., Franke, J. K. H., Wagner, D., Selg, S., and Hutter, F. Hyperparameter transfer across developer adjustments. CoRR, abs/2010.13117, 2020. URL https://arxiv.longhoe.net/abs/2010.13117.
  • Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
  • Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
  • Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp.  5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
  • Villalobos (2023) Villalobos, P. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2023-9-4.
  • Wang et al. (2022) Wang, T., Roberts, A., Hesslow, D., Scao, T. L., Chung, H. W., Beltagy, I., Launay, J., and Raffel, C. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp.  22964–22984. PMLR, 2022. URL https://proceedings.mlr.press/v162/wang22u.html.
  • Wei et al. (2022) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022. doi: 10.48550/arXiv.2206.07682. URL https://doi.org/10.48550/arXiv.2206.07682.
  • Yang & Hu (2021) Yang, G. and Hu, E. J. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp.  11727–11737. PMLR, 2021.
  • Yang et al. (2022) Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. CoRR, abs/2203.03466, 2022. doi: 10.48550/arXiv.2203.03466. URL https://doi.org/10.48550/arXiv.2203.03466.
  • Zhao et al. (2023) Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Nguyen, B., Chauhan, G., Hao, Y., and Li, S. Pytorch FSDP: experiences on scaling fully sharded data parallel. CoRR, abs/2304.11277, 2023. doi: 10.48550/arXiv.2304.11277. URL https://doi.org/10.48550/arXiv.2304.11277.

Appendix A More analysis

Table 4: Comparison of Fitted Results: The “Coeff Values &\&& Cov” denote the coefficients {a, b, c} and the covariance of the power-law function y=aCb+c𝑦𝑎superscript𝐶𝑏𝑐y=aC^{b}+citalic_y = italic_a italic_C start_POSTSUPERSCRIPT italic_b end_POSTSUPERSCRIPT + italic_c.
Model Size in billions &\&& Loss(Error) Coeff Values &\&& Cov
Cerebras-GPT 0.111 0.256 0.59 1.30 2.700 6.700 13.00 - - 6.76e1 -8.45e-2 7.25e1
2.608 2.349 2.181 1.997 1.834 1.704(0.034) 1.572(0.025) - - 4.84e1 2.10e-2 3.44e-1
Cerebras-GPT+μ𝜇\muitalic_μP 0.111 0.256 0.59 1.30 2.700 - - - - 4.73e1 -7.37e-2 5.12e1
2.588 2.359 2.155 1.984 1.846(0.004) - - - - 1.88e1 1.28e-2 3.05e-1
Pythia 0.070 0.160 0.410 1.000 1.400 2.800 6.900 12.00 - 9.67e6 -0.34 1.42
2.549 2.204 1.989 1.858 1.889 1.724 1.644(0.049) 1.601(0.019) - 3.89e7 8.89e-2 1.50e-1
nanoLM 0.077 0.153 0.254 0.381 0.532 0.709 0.911 3.432 5.24e1 0.25 -0.47 2.82
3.656 3.389 3.298 3.215 3.198 3.087 3.080 2.958(0.018) 2.883(0.022) 7.33e-2 8.50e-2 7.66e-2
Refer to caption
(a) Scaling laws under μ𝜇\muitalic_μP for different HPs.
Refer to caption
(b) Scaling laws without embedding parameters.
Figure 5: A comparison of μ𝜇\muitalic_μScaling w/ and w/o embedding size, and the impact of HP loss basins.
Scaling law fails outside the loss basins.

Both vanilla scaling laws and μ𝜇\muitalic_μScaling requires at least one hyperparameter search. We define loss basin as an area in the HP space with locally minimal loss where L/H0𝐿𝐻0\partial L/\partial H\approx 0∂ italic_L / ∂ italic_H ≈ 0. Does the scaling law still hold if we randomly pick an H𝐻Hitalic_H outside the loss basin, and μ𝜇\muitalic_μTransfer it to all the widths? According to our results in Figure 4(a), although “the wider, the better” roughly holds as guaranteed by the μ𝜇\muitalic_μP theory, the numerical scaling law does not trivially generalize well outside the basin (blue and green lines). This observation still exists when models are more sufficiently trained for 20k steps. Thus, we suggest searching for the best HPs first anyway.

Take average for extremely small proxy models.

We find that for extremely small proxy models (e.g., GPTs with 256 to 512 width and 6 layers), the fitting results are more vulnerable to slight misalignment of the loss landscapes in μ𝜇\muitalic_μTransfer. This issue can be addressed by sampling more HP points for each small widths and use the average loss for fitting. Specifically, when we grid-search for the best HPs for 6-layer models with batch size 32 and width 256, we found (5e4,0.02,3.5)5𝑒40.023.5(5e-4,0.02,3.5)( 5 italic_e - 4 , 0.02 , 3.5 ) being inside the loss basin. As shown in Figure 5(a) (red line), nanoLM works perfectly for this single point. Then, we explored several other points around it and found that the scaling laws have larger deviations than 12-layer models (Figure 5(a), other lines). We try to balance-off this deviation by fitting scaling laws with the average results across all these HPs near the loss basin. This works perfectly as shown in Figure 5(b), and can be practical in applying nanoLM because we observe in Figure 5(a) that larger widths (e.g., 2048, 3072) have significantly lower variances in loss w.r.t different HPs and do not need multiple runs to take average.

Refer to caption
(a) Scaling law for training loss with different HPs for 6-layer models.
Refer to caption
(b) Scaling law for average training loss in the loss basin of 6-layer models.
Figure 6: Results with 6-layer Models.
General conditions for scaling laws.

Previous scaling laws directly search for HPs on each scale, and the optimal HPs do not satisfy μ𝜇\muitalic_μP function. It is widely known that these scaling laws still success in a wide range of model sizes. This indicates that μ𝜇\muitalic_μP is a sufficient but not necessary condition for scaling laws, and scaling law itself may represent a higher level of universality.

Emergent abilities and inverse scaling law.

We believe that emergent abilities (Wei et al., 2022) are closely related to evaluation metrics (Schaeffer et al., 2023a), but the pre-training loss and evaluation perplexity based on language model can still be accurately predicted as an effective indicator of model performances.

Appendix B Pre-training Data Mix Ratio

Please see Table 5.

Table 5: Pre-training data ratio.
Dataset Sampling prop(%) Total tokens(B)
Arkiv 6.04 28.31
Books 5.22 24.46
Falcon RefinedWeb 20.81 97.49
Falcon RefinedWeb(wiki-like) 49.78 233.21
OpenWebText2 3.11 14.59
StackExchange 3.81 17.84
Github 10.18 47.70
Wikipedia 1.03 4.82

Appendix C The Hyperparameter settings for all experiments

The specific parameters of the experiment are as follows. (1) The parameters of the model are: vocab_size = 50304; block_size = 1024; n_layer = [12, 32, 64]; head_size = 64; dropout = 0.0; output_mult = 1.0; zero_query = True; zero_emb = True. hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024, 2048, 4096, 8192]; (2) The parameters of the data are: input_length = 512; mlm_probability = 0.15; mean_noise_span_length = 3.0; num_workers = 2; (3) The parameters of the optimizer are: name = adamwscale; batch_size = [16, 512]; total_steps = [7000, 10000]; warmup_steps = 5000; lr_scheduler = cosine; weight_decay = 0.0; grad_clip = 1.0; grad_acc = 1; final_cosine = 1e-5; base_lr = [5e-4, 1e-3, 5e-3, 1e-2, 3e-2, 5e-2, 7e-2, 1e-1].

Appendix D Grid search results with 256 base width

Table 6: grid search on base width = 256. The specific parameters of the experiment are: n_layer = 12, batch_size = 16, hp_tune_actual_width = 256, total_steps = 7000, initialization std = 0.05, output multiplier = 5.0, base_lr = [5e-4, 1e-3, 5e-3, 1e-2, 3e-2, 5e-2, 7e-2, 1e-1].
lr 5e-4 1e-3 5e-3 1e-2 3e-2 5e-2 7e-2 1e-1
12-layer BERT loss 7.37 7.27 5.01 4.39 3.9 4.17 5.24 6.97
12-layer GPT loss 7.3 7.03 5.97 5.57 3.74 5.86 7.22 7.25
12-layer T5 loss 6.85 6.33 5.37 5.13 4.71 5.14 5.28 6.45
Table 7: grid search on base width = 256. The specific parameters of the experiment are: n_layer = 64, batch_size = 512, hp_tune_actual_width = 256, total_steps = 10000, initialization std = 0.01, output multiplier = 1.0, base_lr = [1e-4, 5e-4, 7e-4, 1e-3, 3e-3, 5e-3, 7e-3, 1e-2].
lr 1e-4 5e-4 7e-4 1e-3 3e-3 5e-3 7e-3 1e-2
64-layer GPT loss 4.35 3.73 3.69 3.64 8.37 13.3 9.66 8.12

Appendix E Specific Loss Values

E.1 12-layer Models on C4

Table 8: training loss on 12-layer@7k steps. The specific parameters of the experiment are: n_layer = 12, batch_size = [16, 512], hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024], base_lr = [1e-3, 3e-2].
width 128 256 384 512 640 768 896 1024
BERT w/o μ𝜇\muitalic_μP 4.25 3.71 3.59 3.52 3.47 3.42 3.37 3.40
BERT with μ𝜇\muitalic_μP 4.45 3.74 3.63 3.56 3.49 3.47 3.44 3.43
GPT w/o μ𝜇\muitalic_μP 4.73 4.48 4.50 4.42 4.36 4.33 4.29 4.31
GPT with μ𝜇\muitalic_μP 4.52 4.25 4.16 4.10 4.04 4.01 4.00 3.98
T5 w/o μ𝜇\muitalic_μP 5.14 5.06 4.82 4.81 4.66 6.50 5.71 6.03
T5 with μ𝜇\muitalic_μP 5.18 4.71 4.57 4.61 4.60 4.60 4.52 4.49

E.2 12-layer Models on MC4

Table 9: training loss on 12-layer@20k steps. The specific parameters of the experiment are: n_layer = 12, batch_size = [16, 512], hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024], base_lr = [1e-3, 5e-2].
width 128 256 384 512 640 768 896 1024
Bert 2.79 1.36 1.24 1.17 1.14 1.10 1.08 1.06
T5 2.15 2.05 1.76 1.62 1.54 1.51 1.45 1.41
GPT 1.50 1.38 1.34 1.32 1.32 1.31 1.30 1.29
Llama 1.58 1.48 1.46 1.44 1.40 1.40 1.39 1.38

E.3 32-layer GPT on nanoLM Benchmark Data

Table 10: training loss on 32-layer@7k steps. The specific parameters of the experiment are: n_layer = 32, batch_size = 512, hp_tune_actual_width = [256, 384, 512, 640, 768, 896, 1024, 2048, 4096, 8192], total_steps = 7000, base_lr = 5e-2.
width 256 384 512 640 768 896 1024 2048 8192
GPT with μ𝜇\muitalic_μP 3.92 3.76 3.65 3.59 3.54 3.49 3.47 3.45 3.41

E.4 64-layer GPT on nanoLM Benchmark Data

Table 11: training loss on 64-layer@10k steps. The specific parameters of the experiment are: n_layer = 64, batch_size = 512, hp_tune_actual_width = [ 384, 512, 640, 768, 896, 1024, 2048, 8192], total_steps = 10000, base_lr = 1e-3.
width 256 384 512 640 768 896 1024 2048 8192
GPT with μ𝜇\muitalic_μP 3.656 3.389 3.298 3.215 3.198 3.087 3.080 2.958 2.883