nanoLM: an Affordable LLM Pre-training Benchmark via Accurate Loss Prediction across Scales

Yiqun Yao Siqi Fan Xiusheng Huang Xuezhi Fang Xiang Li Ziyi Ni Xin Jiang Xuying Meng Peng Han Shuo Shang Kang Liu Aixin Sun Yequan Wang

Abstract

As language models scale up, it becomes increasingly expensive to verify research ideas because conclusions on small models do not trivially transfer to large ones. A possible solution is to establish a generic system that accurately predicts certain metrics for large models without training them. Existing scaling laws require hyperparameter search on the largest models, limiting their predicative capability. In this paper, we present an approach (namely $\mu$ Scaling) to predict the pre-training loss, based on our observations that Maximal Update Parametrization ( $\mu$ P) enables accurate fitting of scaling laws close to common loss basins in hyperparameter space. With $\mu$ Scaling, different model designs can be compared on large scales by training only their smaller counterparts. Further, we introduce nanoLM: an affordable LLM pre-training benchmark that facilitates this new research paradigm. With around $14\%$ of the one-time pre-training cost, we can accurately forecast the loss for models up to 52B.

Our goal with nanoLM is to empower researchers with limited resources to reach meaningful conclusions on large models. We also aspire for our benchmark to serve as a bridge between the academic community and the industry.

Machine Learning, ICML

1 Introduction

Table 1: Current LLMs. We summarize five popular Transformer models. GPT3 and MT-NLG have a model size exceeding 100B parameters, with training data around 300B tokens. In contrast, other models are smaller in size but trained on data surpassing a trillion tokens.

Model	Params	Tokens
GPT-3	175 B	300 B
MT-NLG	530 B	270 B
Chinchilla	70 B	1.4 T
Llama 2	7/13/70 B	2 T
Falcon	7/40/180 B	1.5/1 T

Large Language Models (LLMs) pre-trained on Web-scale data have demonstrated impressive performance on various downstream tasks under a variety of evaluation protocols such as zero-shot, few-shot, and fine-tuning.

Modern LLMs are based on the Transformer architecture (Vaswani et al., 2017), and can be trained with unsupervised objectives including causal language modeling (Brown et al., 2020), masked language modeling, among others (Wang et al., 2022). Since researches on scaling laws (Kaplan et al., 2020; Hoffmann et al., 2022) reveal the potential of improving model performance by increasing the total computation, the community have been scaling up both the model sizes and training data (Brown et al., 2020; Smith et al., 2022; Touvron et al., 2023b; Almazrouei et al., 2023), as briefly summarized in Table 1.

However, this trend makes it increasingly difficult for average researchers to verify research ideas or search for hyperparameters (HPs): first, conclusions achieved with small models do not trivially transfer to larger ones; on the other hand, directly training large models for multiple times is a costly endeavor. For instance, as reported by Llama-2 (Touvron et al., 2023b), time to train the 7B, 13B, and 70B models on roughly 2 trillion tokens is 184k, 368k, and 1.7M GPU hours with A100-80GB, respectively. An entry point as such prevents the community from swiftly making improvements that are reliable on large scales, and results in enormous waste of computational resources and intellectual efforts. Thus, it is necessary to establish a pipeline to compare different LLM structures, algorithms, and hyperparameters with limited computational resources (i.e., small models), while making sure the results are instructive for any model scale. This is the goal of this paper.

As a possible solution to this issue, the technical report of GPT-4 (OpenAI, 2023) showed that some behaviors of large models can be predicted before the training starts (with unpublished methods). In this paper, we start by proposing a method that yields accurate loss prediction, namely $\mu$ Scaling (a compound word of $\mu$ P (Yang et al., 2022) and Scaling Laws (Kaplan et al., 2020)), with experimental results supporting its correctness. Based on $\mu$ Scaling, we establish a new paradigm for meaningful research on large models without actually training them. Finally, we propose an affordable benchmark for LLM pre-training studies, namely nanoLM, which facilities this research paradigm for the community.

Contributions.

We substantiate our contributions as follows:

•

We propose $\mu$ Scaling, a loss prediction method based on $\mu$ P and modified scaling laws. For hyperparameters (HPs) in the common loss basins, the training loss can be accurately predicted by a power-law function w.r.t. model sizes, which includes embedding sizes, in contrast to existing methods. This method requires searching for the optimal HP only once to predict loss in arbitrary model scale.
•

We unlock a new LLM study paradigm that can directly compare the loss for different model designs on large scales without direct training. We facilitate this paradigm by proposing nanoLM, an affordable benchmark. nanoLM is compatible with mainstream Transformer architectures, including decoder-only structures (e.g., GPT, Llama), encoder-only structures (e.g., BERT), and encoder-decoder structures (e.g., T5), and supports data parallelism strategies (Section 3.1). For benchmark evaluation, we publicly release a pre-training dataset with 100B/400B/1T/2T tokens, chosen from existing sources and categorized into various specialized domains (Section 3.2).
•

Effectiveness: Empirically, we successfully utilized our method to forecast the loss for 12-layer GPT, Llama, BERT, and T5 models on the C4 and MC4 dataset. For more expansive models, we experiment on GPT models with 32 and 64 layers, culminating in sizes of 26B and 52B, respectively. Results indicate that the actual loss remains predictable (Section 4.2).
•

Efficiency: By generating a series of small proxy models with sizes ranging from 38M to 3.4B, predicting loss using $\mu$ Scaling incurs only $13.1\%,14.2\%$ of the one-time pre-training cost for 26B and 52B models, respectively. This demonstrates that nanoLM can help researchers make affordable and meaningful comparisons between different model designs and serve as a new benchmark for LLM study (Section 4.3).

To foster reproducibility, we will open-source all our code and data of nanoLM benchmark. Part of our code is attached in “Supplementary Material”.

2 Preliminaries

We first introduce our task setting (Section 2.1), and then discuss the pros and cons of current hyperparameter estimation methods (Section 2.2) and scaling laws (Section 2.3), as well as other related work (Section 2.4).

2.1 Task Setting

Our task is to do research with small models and safely generalize the conclusions to large-scaled models. This is not a trivial problem. Suppose a small model $M_{1}$ can be trained with an optimal HP $H_{1}$ to achieve a loss $L_{1}$ , while research suggests another model design, $M_{2}$ , has a better $L_{2}\!<\!L_{1}$ with hyperparameter $H_{2}$ . Then, for a LARGER version of both models ( $M_{1}^{\prime}$ and $M_{2}^{\prime}$ ), typically, we have $H_{1}^{\prime}\!\neq\!H_{1}$ and $H_{2}^{\prime}\!\neq\!H_{2}$ . Thus, both $H_{1}^{\prime}$ and $H_{2}^{\prime}$ must be re-searched by training the large models. Moreover, there is also no guarantee that $L_{2}^{\prime}\!<\!L_{1}^{\prime}$ still holds. We use these notations throughout our paper.

2.2 Estimating HPs for Large Models

Maximal Update Parametrization ( $\mu$ P, (Yang et al., 2022)) is a transferring function for certain classes of HPs (i.e. $\mu$ Transferable HPs, including learning rate, initialization variance, and multipliers) w.r.t. model widths (e.g. the hidden size of Transformers). Theories (Golikov & Yang, 2022; Littwin & Yang, 2023; Yang & Hu, 2021) suggest that for two models with the only difference being their widths $w$ and $w^{\prime}$ , their optimal $\mu$ Transferable HPs $H$ and $H^{\prime}$ satisfy $H^{\prime}\!=\!\mu P(H,r)$ , where $r\!=\!w^{\prime}/w$ . In other words, under such transfer function, the loss landscape (w.r.t. HPs) of models in different widths are aligned. See Figure 1 for a demonstration based on our reproduction. Please refer to the original papers for theoretical derivations.

For the $\mu$ Transferable HPs, $\mu$ P enables grid-searching on small models and directly transferring to large ones. However, it only provides insights on which HPs are better, but can not predict the loss values themselves under these HPs. Meanwhile, may studies actually care about different model structures (e.g. GPT vs. Llama) and non- $\mu$ Transferable HPs (e.g. layer numbers). Since $\mu$ P does not cover these cases, there are very limited reliable approaches to know “how many layers should I use for a 100B model” unless we could somehow know the final training loss for each layer number. These cases are what we actually mean by model designs in Section 2.1 and elsewhere.

2.3 Estimating Loss for Large Models

The term “scaling laws” in modern deep learning stands for the relations between the performance of models (usually the test loss or some other performance metrics) and computational scales (like model size, dataset size, or training compute) (Hestness et al., 2017; Kaplan et al., 2020; Villalobos, 2023; Rosenfeld et al., 2020). These relations give researchers insight on trade-offs between model capabilities and computational budgets. However, these relations must be established through HP tuning across all model scales. Often, the optimal HP choices are unknown before the training of large models finish. Thus, existing scaling laws have limited power for “loss prediction”.

2.4 Other Related Work

Some existing work explores HP transfer learning among different tasks or datasets (Perrone et al., 2018; Stoll et al., 2020; Salinas et al., 2020). In contrast, we focus on implementing loss prediction via HP transfer across different model widths, under the same task or data.

Our nanoLM benchmark is also related to current evaluation datasets for LLMs (Srivastava et al., 2022; Hendrycks et al., 2020). A critical difference is that nanoLM does not aim at directly evaluating pre-trained model checkpoints, but rather encourages the users to pre-train new models from scratch, and use the pre-training loss as metric for comparison. Thus, nanoLM contains pre-training code with implementations of our proposed $\mu$ Scaling on different models, in addition to data processing functions. Pre-training on nanoLM is made affordable via loss prediction, which avoids running the largest models. This is orthogonal to the optimization of parallelism mechanics and hardware usage (Shoeybi et al., 2019; Dao, 2023; Rajbhandari et al., 2020).

Table 2:

\mu

P function for a model

M^{\prime}

that is r times the widths of M. If a parameter tensor has 2 dimensions that goes infinite when the model width goes infinite, it is “matrix-like” (e.g., a fully-connected hidden layer); if the number is 1 or 0, it belongs to the “others” class. Note that embedding layers are “others”. “Output” means the layer that maps an infinite dimension to a finite dimension, which is the word decoding layer (

lm\_head

) in Transformers. A multiplier is a constant multiplied by a parameter tensor, which has a similar function to softmax temperature.

Hyperparameter (weight)	$M$	$M^{\prime}\sim r$
AdamW learning rate (matrix-like)	$l$	$l/r$
AdamW learning rate (others)	$l$	$l$
Initialization variance (matrix-like)	$\sigma$	$\sigma/r$
Initialization variance (others)	$\sigma$	$\sigma$
Multiplier (output)	$\tau$	$\tau/r$
Multiplier (others)	$\tau$	$\tau$

3 nanoLM with $\mu$ Scaling

In this section, we introduce our generic benchmark for LLM study without full-scaled training, namely nanoLM. Figure 2 provides the overview of our benchmark. Section 3.1 contains details of our $\mu$ Scaling method and the inducted new research paradigm, and Section 4 introduces the components of our benchmark.

3.1 Unlocking a New LLM Research Paradigm

Pre-training Loss as an Indicator of Model Performance.

LLMs’ training involves cross-entropy objectives such as predicting subsequent or masked tokens. Since typically the training data is processed for less than 1 epoch, this loss also strongly correlates with the perplexity on validation set. Besides, recent researches (Touvron et al., 2023b; Schaeffer et al., 2023b) suggest a link between pre-training loss and downstream task performance, although this is challenging to assess comprehensively. In short, for the same data, pre-training loss is a reasonable indicator of model capabilities.

$\mu$ Scaling for Loss Prediction.

We explore the problem of predicting the loss of a target (large) model with a series of small proxy models. Based on the discussions in Sections 2.2 and 2.3, we observe that in this problem setting, $\mu$ P and scaling laws are complementary: first, scaling law offers a formal function to directly calculate the loss values with model scales, but the optimal HP of each model should be searched separately; second, $\mu$ P provides a unified HP for multiple models varying only in width, but does not guarantee the exact loss values except knowing “the wider, the better”. Thus, we develop a loss prediction approach, namely $\mu$ Scaling, as follows (Figure 2, bottom):

First, we generate a series of proxy models $\{M(w_{0}),M(w_{1}),M(w_{2}),...,M(w_{n})\}$ with $M_{B}\!=\!M(w_{0})$ being the base model and $M_{T}\!=\!M(w_{n})$ being the target large model. They differ only in widths $\{w_{0},w_{1},...,w_{n}\}$ .

Second, we grid-search for the optimal $\mu$ Transferable HPs $H$ = (learning rate, init variance, multipliers) on $M_{B}$ . $\mu$ P concludes that $H(i)\!=\!\mu P(H(0),w_{i}/w_{0})$ are also the optimal HPs for each $M(w_{i}),i\!\in\![0,n]$ . We illustrate the $\mu$ P function we use in Table 2, which corresponds to Table 8 in (Yang et al., 2022).

Third, we select some small $w_{i}$ -s, train the models $M(w_{i})$ with $H(i)$ , and achieve loss $L(i)$ . Based on these points, we fit a power-law function $L(C)=aC^{b}+c$ w.r.t. the number of parameters $C$ , and expect that this function accurately predicts $L_{T}\!=\!L(n)$ that could be achieved by training $M_{T}\!=\!M(w_{n})$ with HP $H_{T}\!=\!H(n)$ .

The correctness of the third step is critical and has not been studied yet, but we show by our comprehensive experiments (Section 4) that it is effective for HPs within the common loss basins in the HP space, and with our modification on scaling laws (below). Note that the formulation of scaling law is empirical and should be validated by experiments (Kaplan et al., 2020; Hoffmann et al., 2022; OpenAI, 2023).

Embedding counts as model size. An interesting difference between $\mu$ Scaling and mainstream scaling laws is we find that $\mu$ Scaling fits better if the size of embedding is counted in the model size, while methods like (Kaplan et al., 2020) benefit from excluding it. We explain this in Section 4.4.

New Paradigm: Research without Re-search.

Based on $\mu$ Scaling, the $\mu$ Transferable HPs are actually “tunnels” towards directly comparing different model designs on large scales via loss prediction. For each model, we predict its loss via $\mu$ Scaling by shrinking its width into a series of small ones and do not need to actually train the large model with full width. In this process, only one HP search is needed for each model design. We formalize this new paradigm in Algorithm 1, and highlight its feature as “research without re-search”. This makes LLM studies more affordable, while still reaching meaningful conclusions for model scales beyond 50B.

$\mathcal{H}$ : all combinations of $\mu$ Transferable HPs; $\mathcal{W}$ : all possible widths; $\mathcal{P}$ : all possible model designs, which are the objectives of research.
$p^{*}\in\mathcal{P}$ : the best model found; $h^{*}\in\mathcal{H}$ : the optimal $\mu$ Transferable HPs; $w_{T}\in\mathcal{W}$ : the target width to predict; $\mathcal{D}$ : training data.

1: for

p

\mathcal{P}

2: Select the base and target widths

w_{0},w_{T}\in\mathcal{W}

;

h_{0}^{*}\leftarrow grid\_search(w_{0},p,\mathcal{H})

;

4: for some small

w_{i}

\mathcal{W}

h_{i}^{*}\leftarrow\mu P(h_{0}^{*},w_{i}/w_{0})

;

L_{i}\leftarrow train(w_{i},p,h_{i}^{*},\mathcal{D})

;

C_{i}\leftarrow count\_params(w_{i})

;

8: end for

9: Fit

L=aC^{b}+c

with all (

L_{i}

C_{i}

)s;

10:

C_{T}\leftarrow count\_params(w_{T})

;

11:

L_{T}(p)\leftarrow aC_{T}^{b}+c

;

\#

Loss Prediction

12:

h_{T}^{*}\leftarrow\mu P(h_{0}^{*},w_{T}/w_{0})

;

13: if

L_{T}(p)<L(p^{*})

then

14:

p^{*}\leftarrow p

;

15: end if

16: end for

17: Conclusion: model

p^{*}

is better according to the predicted loss on (

w_{T}

h_{T}^{*}

18: Optional:

train(w_{T},p^{*},h_{T}^{*},\mathcal{D})

Algorithm 1 New Paradigm: Research without Re-search

3.2 The nanoLM benchmark

We introduce nanoLM, an implementation of our research paradigm on different Transformer-based architectures, as well as a curated dataset serving as a benchmark for training and loss comparison.

3.2.1 Supported Architectures.

nanoLM is based on PyTorch^*^**https://pytorch.org/. and supports three popular and remarkable architectures: decoder-only structures (e.g., GPT (Brown et al., 2020) and Llama (Touvron et al., 2023a)), encoder-only structures (e.g., BERT (Devlin et al., 2019)), and encoder-decoder structures (e.g., T5 (Raffel et al., 2020)). Furthermore, considering the potential GPU memory overflow due to excessive model parameters and sequence lengths, nanoLM integrates Fully Sharded Data Parallel (FSDP, (Zhao et al., 2023)), which shards the optimizer states, gradients, and parameters across multiple GPUs.

3.2.2 Pre-training Data

To facilitate researchers in using nanoLM for comparative analysis across different model designs, we carefully construct a curated pre-training datasets from those of existing large-scale models (i.e., Llama, Falcon, GPT-3). It covers diverse domains to improve the generalization capabilities of the resultant models.

Data Statistics.

Our pre-training dataset, as detailed in Appendix Table 5, comprises a rich blend of various open-source materials that span a wide range of domains. Largely, we repurpose data sources previously utilized for training other LLMs, adhering strictly to the criteria that the data must be publicly accessible and conducive to open-sourcing. Our training data contains 4 tracks with 100B, 400B, 1T, or 2T tokens, representing different settings of data scales. All versions are sampled according to the proportions in Table 5.

Categorized Domains.

We add widely-ranged domain data to increase the diversity. We categorize the domains as follows.

WebText: We encompass Falcon RefinedWeb (Penedo et al., 2023), a substantial English web dataset derived from CommonCrawl, featuring rigorous filtering and extensive deduplication. We also include OpenWebText2 (Gao et al., 2020), a sizeable dataset of filtered text documents, collected from URLs in Reddit submissions.

Professional Knowledge: We select Books3 (Gao et al., 2020), which consists of a mix of fiction and nonfiction books.

World Knowledge: We add English Wikipedia to our training dataset, which is a standard source of high-quality text for language modeling.

Code: We include the public GitHub dataset and hope to improve downstream performance on code-related tasks.

Academic: For scientific knowledge, we include arXiv for the training dataset, which consists of preprint research papers (Lewkowycz et al., 2022). These papers are mainly in the fields of math, computer science, and physics.

Question Answering: We include a dump of Stack Exchange, a website of high-quality questions and answers that covers diverse domains ranging from computer science to chemistry.

4 Experiments

First, we outline our experimental setup in Section 4.1. Then, we present empirical loss prediction results in Section 4.2. Last, we delve into cost analysis in Section 4.3. For additional analysis, please refer to Section 4.4 and Appendix A.

Table 3: Model parameters and loss across various widths and architectures. Note: loss values in bold represent predicted values.

(a) 12-Layer models: parameter count (in millions) and loss value.

Width		128	256	384	512	640	768	896	1024
GPT	Size / M	8.82	22.36	40.61	63.59	91.28	123.69	160.82	202.67
GPT	Loss@20k	4.45	4.20	4.05	3.94	3.90	3.87	3.85	3.84 (3.810)
Llama	Size / M	9.59	25.47	47.64	76.10	110.85	151.90	199.24	252.86
Llama	Loss@20k	4.49	4.31	4.24	4.20	4.18	4.17	4.11	4.10 (4.112)
BERT	Size / M	22.52	51.28	86.33	127.67	175.31	229.24	289.45	355.96
BERT	Loss@20k	3.99	3.15	2.95	2.88	2.83	2.81	2.78	2.77 (2.782)
T5	Size / M	26.46	67.03	121.75	190.63	273.66	370.85	482.20	607.70
T5	Loss@20k	4.75	4.66	4.60	4.58	4.52	4.47	4.42	4.40 (4.415)

(b) 32-layer and 64-layer models: parameter count (in billions) and loss value.

Width		256	384	512	640	768	896	1024	2048	8192
32-layer GPT	Size / B	0.038	0.076	0.126	0.189	0.265	0.353	0.454	1.714	26.185
32-layer GPT	Loss@7k	3.92	3.76	3.65	3.59	3.54	3.49	3.47	3.45	3.41 (3.381)
64-layer GPT	Size / B	0.077	0.153	0.254	0.381	0.532	0.709	0.911	3.432	52.385
64-layer GPT	Loss@10k	3.656	3.389	3.298	3.215	3.198	3.087	3.080	2.958	2.883 (2.861)

4.1 Setup

Model and Training Details.

Our main goal is to justify the third step in $\mu$ Scaling (Section 3.1), as well as the usability of the nanoLM benchmark. To validate the loss prediction capability with different data scales, models, and computational resources, we experiment with the following three separate settings:

Single GPU. Firstly, we test loss prediction on Transformer architectures (i.e., GPT, BERT, T5) with 12 layers using the C4 (Raffel et al., 2020) dataset under a single-GPU setup. We utilize a base width of 256 to conduct a grid search for the $\mu$ Transferable HPs, primarily targeting the learning rate, with values ranging from $5e-4$ to $1e-1$ ^†^††The $\mu$ Transferable HPs include learning rate, initialization standard deviation, and multipliers. Due to computational resources, we focus on the learning rate. However, researchers can still explore the other $\mu$ Transferable HPs on nanoLM to achieve even better results.. The number of parameters for the model series ranges from 8M to 700M with different widths, as detailed in Table 3(a). The target width for prediction is 1024 ^‡^‡‡Given that this group of experiments target on only one single GPU, both BERT and T5 ran out of memory when expanded to a width of 2048. In these cases, we leave the results for future.. The batch size was established at 16, and the sequence length was set to 2048. According to (Yang et al., 2022), when the training steps surpass 5k, the $\mu$ P transfer tend to stabilize. To this end, we choose to fit the loss at 7k and 20k steps and study its influences on the predicative performances.

Multi-GPU with FSDP. Secondly, to experiment with larger models and more data, We employ Fully Sharded Data Parallel (FSDP) (Zhao et al., 2023) to effectively address the training challenges. We conduct loss prediction with 32-layer GPTs on nanoLM pre-training data (Section 3.2). We utilize a base width of 256 to conduct a grid search for the learning rate, ranging from $5e-4$ to $1e-1$ . The number of parameters ranges from 38M to 26B (Table 3(b)). The batch size was established at 512, and the sequence length was set to 512. We use proxy models with widths under 2048 to predict the loss of the target model with width 8192.

Multiple Machines with Megatron. Lastly, we further validate the feasibility of $\mu$ P and $\mu$ Scaling with extremely large widths. We conduct the experiments for the 64-layer GPTs on Megatron (Narayanan et al., 2021). It provides efficient tensor, pipeline, and sequence-based model parallelism for pre-training LLMs. We utilize a base width of 256 to conduct a grid search for the learning rate, ranging from $1e-4$ to $1e-2$ . The number of parameters ranges from 77M to 52B (Table 3(b)). The batch size was established at 512, and the sequence length was set to 2048. We exit training and fit the loss at 10k steps, with 10.49B tokens consumed (10 $\%$ of the 100B track of nanoLM pre-training data, Section 3.2).

All of our experiments are run on A100 GPUs. For each series of models used for loss prediction, the batches are fed into the models in the same order. More hyperparameters can be found in Appendix C.

$\mu$ P and $\mu$ Scaling Settings.

All of our models are trained from scratch. Additionally, we adhere to the recommendation by (Yang et al., 2022) to initialize the output word embeddings and query parameters as all-zero to avoid the Gaussian Process ^§^§§Such Gaussian Process may cause misalignment of landscapes between small and large models.. We maintain a consistent head dimension of 64 for each attention head and scale up the number of heads with model widths. The dropout and weight decay are set to 0. We employ the AdamW optimizer (Loshchilov & Hutter, 2019) with its default configurations. The ${a,b,c}$ coefficients in power law $L=aC^{b}+c$ , as well as their standard deviations are computed with $scipy.optimize.curve\_fit()$ ^¶^¶¶https://docs.scipy.org/doc/scipy/reference/generated/scipy.optimize.curve_fit.html.

4.2 Main results: Fits $\&$ Extrapolation of $\mu$ Scaling

We use a base width of 256 to grid-search for the optimal $\mu$ Transferable HPs. For 12-layer models, we found the optimal (learning rate, initialization standard deviation, output multipliers) being $(3e-2,0.05,5.0)$ . For 32-layer and 64-layer GPTs, the optimal HPs are $(3e-2,0.04,4.0)$ and $(1e-3,0.01,1.0)$ , respectively. These HPs are $\mu$ Transferred to other widths by Table 2. Some specific loss values can be found in Table 3(b), while the corresponding fitted curves are illustrated in Figure 3. For more numerical loss values, please refer to Appendix E. The grid-search results can be found in Appendix D.

12-layer Models on C4/MC4 with a Single GPU.

For GPT, Llama, BERT, and T5, proxy models with widths $\leq$ 896 ^∥^∥∥We found a positive correlation between the number of data points and the accuracy of the fitted loss curve. We choose seven points due to the limited computation resource. are used to fit the power-law curves and predict the loss for widths $>$ 896. For the 12-layer GPT and Llama, the fitted power-laws are $L=2.07*C^{-0.52}+3.85$ and $L=1.34*C^{-0.49}+4.02$ , respectively. For 12-layer BERT, the result is $L=61.18*C^{-1.31}+3.43$ ; for the 12-layer T5, it is $L=0.25*C^{-0.47}+2.82$ . According to Figure 3, it’s evident that based on the loss of the narrow proxy model series, we successfully predicted the loss of wider models. This holds for both 7k and 20k training steps and for all the 4 different Transformer variants, supporting the effectiveness of $\mu$ Scaling. This extends the $\mu$ P conclusion (Yang et al., 2022) of “the wider, the better” to “wider is better adherent to a power law”. In contrast, for the yellow points, whose HPs are not transferred by $\mu$ P but follows standard parametrization, the scaling law is not well fitted. They also underperform the same-width networks using $\mu$ P. Additionally, Figure 3 (d) demonstrates that the conclusion still holds on diverse multilingual datasets, such as MC4 (Raffel et al., 2020).

32-layer GPTs with FSDP.

To validate $\mu$ Scaling for larger-scaled models, we expanded the GPT to 32 layers. We train with the data parallel strategy (FSDP) on the benchmark pre-training data, as supported by nanoLM (Section 3.2). Based on the proxy models with widths ranging from 256 to 2048, we predict the loss for a model with a width of 8192. The results are presented in Figure 4 with the power-law being $L=0.077*C^{-0.61}+3.37$ . Notably, $\mu$ Scaling successfully predicts the loss of a 26B model at 7k steps based on the losses from models ranging from 38M to 1.7B. This demonstrates that nanoLM is an effective paradigm to study LLMs of billion-scale magnitude.

64-layer GPTs with Megatron.

We further validate nanoLM with extremely large models, exemplified by 64-layer GPTs implemented with Megatron (Shoeybi et al., 2019). We train proxy models with widths ranging from 256 to 2048 using nanoLM pre-training data and forecast the loss for a model in 8192 width. Results are presented in Figure 4 with the power-law being $L=0.249*C^{-0.47}+2.82$ . The results show that we can successfully predict the loss of a 52B model at 10k steps based on proxy models ranging from 77M to 3.4B, further affirming the accuracy of $\mu$ Scaling and usability of nanoLM.

4.3 Efficiency

Cost Ratio.

We calculate the number of floating point operations (FLOPs) of a model following (Narayanan et al., 2021). Given a Transformer model with $l$ layers, width (a.k.a. hidden size) $w$ , sequence length $s$ , vocabulary size $V$ , and training batch size $B$ , the total number of floating-point operations is $M(l,w,s,V,B)=96Bslw^{2}\left(1+\frac{s}{6w}+\frac{V}{16lw}\right)$ . Thus, the FLOPs ratio of the whole nanoLM process to the training of the full target model can be approximated as:

\frac{M(l,w_{1},s,V,B)*t+\sum_{i=2}^{n}M(l,w_{i},s,V,B)}{M(l,w_{t},s,V,B)},

(1)

where $w_{1}$ is grid-search width and $t$ denotes corresponding number of grid search trails on $\mu$ Transferable HPs (as detailed in Table 2). $n$ signifies the count of distinct proxy models used for the fitting loss function, while $w_{t}$ is the target model width. In our experiments, the 32-layer GPT comes with a sequence length of 512 and a batch size of 512. The vocabulary contains 100,256 tokens. Thus, the ratio of cost is approximated at 0.131. Under similar calculation, the cost ratio for our 64-layer GPT is 0.142. In short, with just $13.1\%,14.2\%$ of the one-time pre-training cost, we can predict the loss for 26B and 52B models. Note that without $\mu$ Scaling, grid-search is needed on the large models, which will enlarge the advantage of our method by several times.

4.4 Analytical Studies

Error Comparison with Related Work.

We compare the fitting error of nanoLM with related work that also provided loss numbers, including Cerebras-GPT (Dey et al., 2023) (in their Table 8), and Pythia (Biderman et al., 2023). As shown in Table 4, for fitting error, nanoLM achieves slightly better results than related work, even with larger models (26B and 52B). Moreover, the covariances of the fitted coefficients $\{a,b,c\}$ are significantly lower than related work. This indicates that nanoLM is more accurate and confident on its scaling laws, supporting its reliability.

Impact of Embedding Weights.

We demonstrate an ablation study by Figure 4(a) and 4(b), indicating that our scaling law fits worse if embedding sizes are not counted as model sizes, in contrast to (Kaplan et al., 2020; Henighan et al., 2020). This is potentially because $\mu$ P concludes that the learning rate of embedding layers should not be scaled down with widths, while most existing methods grid-search for a unified learning rate for all layers on each width, making embeddings learning slower on large models, and the matrix-like parameters dominating the training dynamics. This analysis is conducted with a 12-layer GPT-2 trained on OpenWebText (Gokaslan & Cohen, 2019) for 20k steps.

For other discussions, please see Appendix A.

5 Conclusion and Future Work

We present nanoLM, a cost-efficient benchmark for Large Language Model (LLM) studies, designed to predict losses accurately across various scales without direct training. This benchmark allows for comprehensive comparisons of different model architectures and algorithms, works well with FSDP, and contains curated pre-training datasets ranging from 100B to 2T tokens. Our loss prediction method, $\mu$ Scaling, is tested through comprehensive experiments, which demonstrate its effectiveness across different scales, enabling researchers with limited resources to conduct meaningful research for large models. Moving forward, we aim to extend nanoLM to more frameworks and tasks (e.g., vision), reducing resource waste from non-transferable findings, and enhancing collaboration between academia and industry.

Impact Statement

This paper presents work whose goal is to advance the field of Machine Learning. There are many potential societal consequences of our work, none which we feel must be specifically highlighted here.

References

Almazrouei et al. (2023) Almazrouei, E., Alobeidli, H., Alshamsi, A., Cappelli, A., Cojocaru, R., Debbah, M., Goffinet, E., Heslow, D., Launay, J., Malartic, Q., Noune, B., Pannier, B., and Penedo, G. Falcon-40B: an open large language model with state-of-the-art performance. 2023.
Biderman et al. (2023) Biderman, S., Schoelkopf, H., Anthony, Q. G., Bradley, H., O’Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., Skowron, A., Sutawika, L., and van der Wal, O. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA, volume 202 of Proceedings of Machine Learning Research, pp. 2397–2430. PMLR, 2023. URL https://proceedings.mlr.press/v202/biderman23a.html.
Brown et al. (2020) Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D. M., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A., Sutskever, I., and Amodei, D. Language models are few-shot learners. In Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual, 2020. URL https://proceedings.neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html.
Dao (2023) Dao, T. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023.
Devlin et al. (2019) Devlin, J., Chang, M., Lee, K., and Toutanova, K. BERT: pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL-HLT 2019, Minneapolis, MN, USA, June 2-7, 2019, Volume 1 (Long and Short Papers), pp. 4171–4186. Association for Computational Linguistics, 2019. doi: 10.18653/v1/n19-1423. URL https://doi.org/10.18653/v1/n19-1423.
Dey et al. (2023) Dey, N., Gosal, G., Chen, Z., Khachane, H., Marshall, W., Pathria, R., Tom, M., and Hestness, J. Cerebras-gpt: Open compute-optimal language models trained on the cerebras wafer-scale cluster. CoRR, abs/2304.03208, 2023. doi: 10.48550/ARXIV.2304.03208. URL https://doi.org/10.48550/arXiv.2304.03208.
Gao et al. (2020) Gao, L., Biderman, S., Black, S., Golding, L., Hoppe, T., Foster, C., Phang, J., He, H., Thite, A., Nabeshima, N., Presser, S., and Leahy, C. The Pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020.
Gokaslan & Cohen (2019) Gokaslan, A. and Cohen, V. Openwebtext corpus. 2019.
Golikov & Yang (2022) Golikov, E. and Yang, G. Non-gaussian tensor programs. In Oh, A. H., Agarwal, A., Belgrave, D., and Cho, K. (eds.), Advances in Neural Information Processing Systems, 2022. URL https://openreview.net/forum?id=AcHUIG2wA8-.
Hendrycks et al. (2020) Hendrycks, D., Burns, C., Basart, S., Zou, A., Mazeika, M., Song, D., and Steinhardt, J. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020.
Henighan et al. (2020) Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv preprint arXiv:2010.14701, 2020.
Hestness et al. (2017) Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kianinejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y. Deep learning scaling is predictable, empirically. CoRR, abs/1712.00409, 2017. URL http://arxiv.longhoe.net/abs/1712.00409.
Hoffmann et al. (2022) Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., Rae, J. W., Vinyals, O., and Sifre, L. Training compute-optimal large language models. CoRR, abs/2203.15556, 2022. doi: 10.48550/arXiv.2203.15556. URL https://doi.org/10.48550/arXiv.2203.15556.
Kaplan et al. (2020) Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. CoRR, abs/2001.08361, 2020. URL https://arxiv.longhoe.net/abs/2001.08361.
Lewkowycz et al. (2022) Lewkowycz, A., Andreassen, A., Dohan, D., Dyer, E., Michalewski, H., Ramasesh, V. V., Slone, A., Anil, C., Schlag, I., Gutman-Solo, T., Wu, Y., Neyshabur, B., Gur-Ari, G., and Misra, V. Solving quantitative reasoning problems with language models. In NeurIPS, 2022. URL http://papers.nips.cc/paper_files/paper/2022/hash/18abbeef8cfe9203fdf9053c9c4fe191-Abstract-Conference.html.
Littwin & Yang (2023) Littwin, E. and Yang, G. Adaptive optimization in the $\infty$-width limit. In The Eleventh International Conference on Learning Representations, 2023. URL https://openreview.net/forum?id=zgVDqw9ZUES.
Loshchilov & Hutter (2019) Loshchilov, I. and Hutter, F. Decoupled weight decay regularization. In 7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019. URL https://openreview.net/forum?id=Bkg6RiCqY7.
Narayanan et al. (2021) Narayanan, D., Shoeybi, M., Casper, J., LeGresley, P., Patwary, M., Korthikanti, V., Vainbrand, D., Kashinkunti, P., Bernauer, J., Catanzaro, B., Phanishayee, A., and Zaharia, M. Efficient large-scale language model training on GPU clusters using megatron-lm. In International Conference for High Performance Computing, Networking, Storage and Analysis, SC 2021, St. Louis, Missouri, USA, November 14-19, 2021, pp. 58. ACM, 2021. doi: 10.1145/3458817.3476209. URL https://doi.org/10.1145/3458817.3476209.
OpenAI (2023) OpenAI. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023.
Penedo et al. (2023) Penedo, G., Malartic, Q., Hesslow, D., Cojocaru, R., Cappelli, A., Alobeidli, H., Pannier, B., Almazrouei, E., and Launay, J. The refinedweb dataset for falcon LLM: outperforming curated corpora with web data, and web data only. CoRR, abs/2306.01116, 2023. doi: 10.48550/arXiv.2306.01116. URL https://doi.org/10.48550/arXiv.2306.01116.
Perrone et al. (2018) Perrone, V., Jenatton, R., Seeger, M. W., and Archambeau, C. Scalable hyperparameter transfer learning. In Advances in Neural Information Processing Systems 31: Annual Conference on Neural Information Processing Systems 2018, NeurIPS 2018, December 3-8, 2018, Montréal, Canada, pp. 6846–6856, 2018. URL https://proceedings.neurips.cc/paper/2018/hash/14c879f3f5d8ed93a09f6090d77c2cc3-Abstract.html.
Raffel et al. (2020) Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. J. Mach. Learn. Res., 21:140:1–140:67, 2020. URL http://jmlr.org/papers/v21/20-074.html.
Rajbhandari et al. (2020) Rajbhandari, S., Rasley, J., Ruwase, O., and He, Y. Zero: Memory optimizations toward training trillion parameter models. In SC20: International Conference for High Performance Computing, Networking, Storage and Analysis, pp. 1–16. IEEE, 2020.
Rosenfeld et al. (2020) Rosenfeld, J. S., Rosenfeld, A., Belinkov, Y., and Shavit, N. A constructive prediction of the generalization error across scales. In 8th International Conference on Learning Representations, ICLR 2020, Addis Ababa, Ethiopia, April 26-30, 2020. OpenReview.net, 2020. URL https://openreview.net/forum?id=ryenvpEKDr.
Salinas et al. (2020) Salinas, D., Shen, H., and Perrone, V. A quantile-based approach for hyperparameter transfer learning. In Proceedings of the 37th International Conference on Machine Learning, ICML 2020, 13-18 July 2020, Virtual Event, volume 119 of Proceedings of Machine Learning Research, pp. 8438–8448. PMLR, 2020. URL http://proceedings.mlr.press/v119/salinas20a.html.
Schaeffer et al. (2023a) Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023a.
Schaeffer et al. (2023b) Schaeffer, R., Miranda, B., and Koyejo, S. Are emergent abilities of large language models a mirage? arXiv preprint arXiv:2304.15004, 2023b.
Shoeybi et al. (2019) Shoeybi, M., Patwary, M., Puri, R., LeGresley, P., Casper, J., and Catanzaro, B. Megatron-lm: Training multi-billion parameter language models using model parallelism. CoRR, abs/1909.08053, 2019. URL http://arxiv.longhoe.net/abs/1909.08053.
Smith et al. (2022) Smith, S., Patwary, M., Norick, B., LeGresley, P., Rajbhandari, S., Casper, J., Liu, Z., Prabhumoye, S., Zerveas, G., Korthikanti, V., Zheng, E., Child, R., Aminabadi, R. Y., Bernauer, J., Song, X., Shoeybi, M., He, Y., Houston, M., Tiwary, S., and Catanzaro, B. Using deepspeed and megatron to train megatron-turing NLG 530b, A large-scale generative language model. CoRR, abs/2201.11990, 2022. URL https://arxiv.longhoe.net/abs/2201.11990.
Srivastava et al. (2022) Srivastava, A., Rastogi, A., Rao, A., Shoeb, A. A. M., Abid, A., Fisch, A., Brown, A. R., Santoro, A., Gupta, A., Garriga-Alonso, A., et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022.
Stoll et al. (2020) Stoll, D., Franke, J. K. H., Wagner, D., Selg, S., and Hutter, F. Hyperparameter transfer across developer adjustments. CoRR, abs/2010.13117, 2020. URL https://arxiv.longhoe.net/abs/2010.13117.
Touvron et al. (2023a) Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M., Lacroix, T., Rozière, B., Goyal, N., Hambro, E., Azhar, F., Rodriguez, A., Joulin, A., Grave, E., and Lample, G. Llama: Open and efficient foundation language models. CoRR, abs/2302.13971, 2023a. doi: 10.48550/arXiv.2302.13971. URL https://doi.org/10.48550/arXiv.2302.13971.
Touvron et al. (2023b) Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., Bikel, D., Blecher, L., Canton-Ferrer, C., Chen, M., Cucurull, G., Esiobu, D., Fernandes, J., Fu, J., Fu, W., Fuller, B., Gao, C., Goswami, V., Goyal, N., Hartshorn, A., Hosseini, S., Hou, R., Inan, H., Kardas, M., Kerkez, V., Khabsa, M., Kloumann, I., Korenev, A., Koura, P. S., Lachaux, M., Lavril, T., Lee, J., Liskovich, D., Lu, Y., Mao, Y., Martinet, X., Mihaylov, T., Mishra, P., Molybog, I., Nie, Y., Poulton, A., Reizenstein, J., Rungta, R., Saladi, K., Schelten, A., Silva, R., Smith, E. M., Subramanian, R., Tan, X. E., Tang, B., Taylor, R., Williams, A., Kuan, J. X., Xu, P., Yan, Z., Zarov, I., Zhang, Y., Fan, A., Kambadur, M., Narang, S., Rodriguez, A., Stojnic, R., Edunov, S., and Scialom, T. Llama 2: Open foundation and fine-tuned chat models. CoRR, abs/2307.09288, 2023b. doi: 10.48550/arXiv.2307.09288. URL https://doi.org/10.48550/arXiv.2307.09288.
Vaswani et al. (2017) Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., and Polosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems 30: Annual Conference on Neural Information Processing Systems 2017, December 4-9, 2017, Long Beach, CA, USA, pp. 5998–6008, 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html.
Villalobos (2023) Villalobos, P. Scaling laws literature review, 2023. URL https://epochai.org/blog/scaling-laws-literature-review. Accessed: 2023-9-4.
Wang et al. (2022) Wang, T., Roberts, A., Hesslow, D., Scao, T. L., Chung, H. W., Beltagy, I., Launay, J., and Raffel, C. What language model architecture and pretraining objective works best for zero-shot generalization? In International Conference on Machine Learning, ICML 2022, 17-23 July 2022, Baltimore, Maryland, USA, volume 162 of Proceedings of Machine Learning Research, pp. 22964–22984. PMLR, 2022. URL https://proceedings.mlr.press/v162/wang22u.html.
Wei et al. (2022) Wei, J., Tay, Y., Bommasani, R., Raffel, C., Zoph, B., Borgeaud, S., Yogatama, D., Bosma, M., Zhou, D., Metzler, D., Chi, E. H., Hashimoto, T., Vinyals, O., Liang, P., Dean, J., and Fedus, W. Emergent abilities of large language models. CoRR, abs/2206.07682, 2022. doi: 10.48550/arXiv.2206.07682. URL https://doi.org/10.48550/arXiv.2206.07682.
Yang & Hu (2021) Yang, G. and Hu, E. J. Tensor programs iv: Feature learning in infinite-width neural networks. In International Conference on Machine Learning, pp. 11727–11737. PMLR, 2021.
Yang et al. (2022) Yang, G., Hu, E. J., Babuschkin, I., Sidor, S., Liu, X., Farhi, D., Ryder, N., Pachocki, J., Chen, W., and Gao, J. Tensor programs V: tuning large neural networks via zero-shot hyperparameter transfer. CoRR, abs/2203.03466, 2022. doi: 10.48550/arXiv.2203.03466. URL https://doi.org/10.48550/arXiv.2203.03466.
Zhao et al. (2023) Zhao, Y., Gu, A., Varma, R., Luo, L., Huang, C., Xu, M., Wright, L., Shojanazeri, H., Ott, M., Shleifer, S., Desmaison, A., Balioglu, C., Nguyen, B., Chauhan, G., Hao, Y., and Li, S. Pytorch FSDP: experiences on scaling fully sharded data parallel. CoRR, abs/2304.11277, 2023. doi: 10.48550/arXiv.2304.11277. URL https://doi.org/10.48550/arXiv.2304.11277.

Appendix A More analysis

Table 4: Comparison of Fitted Results: The “Coeff Values

\&

Cov” denote the coefficients {a, b, c} and the covariance of the power-law function

y=aC^{b}+c

Model	Size in billions $\&$ Loss(Error)									Coeff Values $\&$ Cov
Cerebras-GPT	0.111	0.256	0.59	1.30	2.700	6.700	13.00	-	-	6.76e1	-8.45e-2	7.25e1
Cerebras-GPT	2.608	2.349	2.181	1.997	1.834	1.704(0.034)	1.572(0.025)	-	-	4.84e1	2.10e-2	3.44e-1
Cerebras-GPT+ $\mu$ P	0.111	0.256	0.59	1.30	2.700	-	-	-	-	4.73e1	-7.37e-2	5.12e1
Cerebras-GPT+ $\mu$ P	2.588	2.359	2.155	1.984	1.846(0.004)	-	-	-	-	1.88e1	1.28e-2	3.05e-1
Pythia	0.070	0.160	0.410	1.000	1.400	2.800	6.900	12.00	-	9.67e6	-0.34	1.42
Pythia	2.549	2.204	1.989	1.858	1.889	1.724	1.644(0.049)	1.601(0.019)	-	3.89e7	8.89e-2	1.50e-1
nanoLM	0.077	0.153	0.254	0.381	0.532	0.709	0.911	3.432	5.24e1	0.25	-0.47	2.82
	3.656	3.389	3.298	3.215	3.198	3.087	3.080	2.958(0.018)	2.883(0.022)	7.33e-2	8.50e-2	7.66e-2

Scaling law fails outside the loss basins.

Both vanilla scaling laws and $\mu$ Scaling requires at least one hyperparameter search. We define loss basin as an area in the HP space with locally minimal loss where $\partial L/\partial H\approx 0$ . Does the scaling law still hold if we randomly pick an $H$ outside the loss basin, and $\mu$ Transfer it to all the widths? According to our results in Figure 4(a), although “the wider, the better” roughly holds as guaranteed by the $\mu$ P theory, the numerical scaling law does not trivially generalize well outside the basin (blue and green lines). This observation still exists when models are more sufficiently trained for 20k steps. Thus, we suggest searching for the best HPs first anyway.

Take average for extremely small proxy models.

We find that for extremely small proxy models (e.g., GPTs with 256 to 512 width and 6 layers), the fitting results are more vulnerable to slight misalignment of the loss landscapes in $\mu$ Transfer. This issue can be addressed by sampling more HP points for each small widths and use the average loss for fitting. Specifically, when we grid-search for the best HPs for 6-layer models with batch size 32 and width 256, we found $(5e-4,0.02,3.5)$ being inside the loss basin. As shown in Figure 5(a) (red line), nanoLM works perfectly for this single point. Then, we explored several other points around it and found that the scaling laws have larger deviations than 12-layer models (Figure 5(a), other lines). We try to balance-off this deviation by fitting scaling laws with the average results across all these HPs near the loss basin. This works perfectly as shown in Figure 5(b), and can be practical in applying nanoLM because we observe in Figure 5(a) that larger widths (e.g., 2048, 3072) have significantly lower variances in loss w.r.t different HPs and do not need multiple runs to take average.

General conditions for scaling laws.

Previous scaling laws directly search for HPs on each scale, and the optimal HPs do not satisfy $\mu$ P function. It is widely known that these scaling laws still success in a wide range of model sizes. This indicates that $\mu$ P is a sufficient but not necessary condition for scaling laws, and scaling law itself may represent a higher level of universality.

Emergent abilities and inverse scaling law.

We believe that emergent abilities (Wei et al., 2022) are closely related to evaluation metrics (Schaeffer et al., 2023a), but the pre-training loss and evaluation perplexity based on language model can still be accurately predicted as an effective indicator of model performances.

Appendix B Pre-training Data Mix Ratio

Please see Table 5.

Table 5: Pre-training data ratio.

Dataset	Sampling prop(%)	Total tokens(B)
Arkiv	6.04	28.31
Books	5.22	24.46
Falcon RefinedWeb	20.81	97.49
Falcon RefinedWeb(wiki-like)	49.78	233.21
OpenWebText2	3.11	14.59
StackExchange	3.81	17.84
Github	10.18	47.70
Wikipedia	1.03	4.82

Appendix C The Hyperparameter settings for all experiments

The specific parameters of the experiment are as follows. (1) The parameters of the model are: vocab_size = 50304; block_size = 1024; n_layer = [12, 32, 64]; head_size = 64; dropout = 0.0; output_mult = 1.0; zero_query = True; zero_emb = True. hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024, 2048, 4096, 8192]; (2) The parameters of the data are: input_length = 512; mlm_probability = 0.15; mean_noise_span_length = 3.0; num_workers = 2; (3) The parameters of the optimizer are: name = adamwscale; batch_size = [16, 512]; total_steps = [7000, 10000]; warmup_steps = 5000; lr_scheduler = cosine; weight_decay = 0.0; grad_clip = 1.0; grad_acc = 1; final_cosine = 1e-5; base_lr = [5e-4, 1e-3, 5e-3, 1e-2, 3e-2, 5e-2, 7e-2, 1e-1].

Appendix D Grid search results with 256 base width

Table 6: grid search on base width = 256. The specific parameters of the experiment are: n_layer = 12, batch_size = 16, hp_tune_actual_width = 256, total_steps = 7000, initialization std = 0.05, output multiplier = 5.0, base_lr = [5e-4, 1e-3, 5e-3, 1e-2, 3e-2, 5e-2, 7e-2, 1e-1].

lr	5e-4	1e-3	5e-3	1e-2	3e-2	5e-2	7e-2	1e-1
12-layer BERT loss	7.37	7.27	5.01	4.39	3.9	4.17	5.24	6.97
12-layer GPT loss	7.3	7.03	5.97	5.57	3.74	5.86	7.22	7.25
12-layer T5 loss	6.85	6.33	5.37	5.13	4.71	5.14	5.28	6.45

Table 7: grid search on base width = 256. The specific parameters of the experiment are: n_layer = 64, batch_size = 512, hp_tune_actual_width = 256, total_steps = 10000, initialization std = 0.01, output multiplier = 1.0, base_lr = [1e-4, 5e-4, 7e-4, 1e-3, 3e-3, 5e-3, 7e-3, 1e-2].

lr	1e-4	5e-4	7e-4	1e-3	3e-3	5e-3	7e-3	1e-2
64-layer GPT loss	4.35	3.73	3.69	3.64	8.37	13.3	9.66	8.12

Appendix E Specific Loss Values

E.1 12-layer Models on C4

Table 8: training loss on 12-layer@7k steps. The specific parameters of the experiment are: n_layer = 12, batch_size = [16, 512], hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024], base_lr = [1e-3, 3e-2].

width	128	256	384	512	640	768	896	1024
BERT w/o $\mu$ P	4.25	3.71	3.59	3.52	3.47	3.42	3.37	3.40
BERT with $\mu$ P	4.45	3.74	3.63	3.56	3.49	3.47	3.44	3.43
GPT w/o $\mu$ P	4.73	4.48	4.50	4.42	4.36	4.33	4.29	4.31
GPT with $\mu$ P	4.52	4.25	4.16	4.10	4.04	4.01	4.00	3.98
T5 w/o $\mu$ P	5.14	5.06	4.82	4.81	4.66	6.50	5.71	6.03
T5 with $\mu$ P	5.18	4.71	4.57	4.61	4.60	4.60	4.52	4.49

E.2 12-layer Models on MC4

Table 9: training loss on 12-layer@20k steps. The specific parameters of the experiment are: n_layer = 12, batch_size = [16, 512], hp_tune_actual_width = [128, 256, 384, 512, 640, 768, 896, 1024], base_lr = [1e-3, 5e-2].

width	128	256	384	512	640	768	896	1024
Bert	2.79	1.36	1.24	1.17	1.14	1.10	1.08	1.06
T5	2.15	2.05	1.76	1.62	1.54	1.51	1.45	1.41
GPT	1.50	1.38	1.34	1.32	1.32	1.31	1.30	1.29
Llama	1.58	1.48	1.46	1.44	1.40	1.40	1.39	1.38

E.3 32-layer GPT on nanoLM Benchmark Data

Table 10: training loss on 32-layer@7k steps. The specific parameters of the experiment are: n_layer = 32, batch_size = 512, hp_tune_actual_width = [256, 384, 512, 640, 768, 896, 1024, 2048, 4096, 8192], total_steps = 7000, base_lr = 5e-2.

width	256	384	512	640	768	896	1024	2048	8192
GPT with $\mu$ P	3.92	3.76	3.65	3.59	3.54	3.49	3.47	3.45	3.41

E.4 64-layer GPT on nanoLM Benchmark Data

Table 11: training loss on 64-layer@10k steps. The specific parameters of the experiment are: n_layer = 64, batch_size = 512, hp_tune_actual_width = [ 384, 512, 640, 768, 896, 1024, 2048, 8192], total_steps = 10000, base_lr = 1e-3.

width	256	384	512	640	768	896	1024	2048	8192
GPT with $\mu$ P	3.656	3.389	3.298	3.215	3.198	3.087	3.080	2.958	2.883