A summary of large language models (LLMs) on a large scale (beyond 10B).

Summary

As shown in following table, I summarize the most popular LLMs at a large scale. It is clear that the size of LLMs has become larger and larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Although the training methods are different among these models, they all use Transformers as the standard backbone in LLMs due to the nature of efficient parallel computing and training. Since training large-scale models needs massive unsupervised data corpus, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.

Model	#Params	#Training Tokens	Masked LM	Causal LM	Prefix LM	Seq2Seq LM	Pre-Training Data
T5	11B	-	✘	✘	✘	✔	C4 Corpus (~750GB)
mT5	13B	-	✘	✘	✘	✔	mC4 Corpus (6.3T tokens)
Switch Transformers	1751B	-	✘	✘	✘	✔	C4 Corpus (~750GB)
CPM-2	11B	-	✘	✘	✘	✔	WuDao Corpus (2.3TB Chinese + 300GB English)
CPM-2-MoE	198B	-	✘	✘	✘	✔	WuDao Corpus (2.3TB Chinese + 300GB English)
Turing-NLG	17B	-	✘	✔	✘	✘	English data
GPT-3	175B	300B	✘	✔	✘	✘	cleaned CommonCrawl, WebText
CPM	2.6B	-	✘	✔	✘	✘	Chinese corpus (100GB)
HyperCLOVA	204B	-	✘	✔	✘	✘	Korean data
PanGu-$\alpha$	200B	-	✘	✔	✘	✘	Chinese data (1.1TB, 250B tokens)
DeBERTa1.5B	1.5B	-	✔	✘	✘	✘	English corpus
ERNIE 3.0	10B	-	✔	✔	✘	✘	Chinese data (4TB); English
Yuan 1.0	245B	-	✘	✔	✔	✘	Chinese data (5TB)
Megatron-Turing NLG	530B	270B	✘	✔	✘	✘	The Pile, CommonCrawl, RealNews, CC-Stories
OPT	175B	300B	✘	✔	✘	✘	BookCorpus, Stories, CCNews (RoBERTa) The Pile PuhsShift.io Reddit.
Gopher	280B	300B	✘	✔	✘	✘	MassiveText（10.5TB data) including webpages, books, news article, code.
Jurassic-1	178B	300B	✘	✔	✘	✘	GPT-3 data
Chinchilla	70B	1.4T	✘	✔	✘	✘	Same as Gopher.
Sparrow	70B	-	✘	✔	✘	✘	-
LaMDA	137B	168B	✘	✔	✘	✘	2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words
PaLM	540B	780B	✘	✔	✘	✘	780B token, dataset from LamDA, GLaM, and code
BLOOM	176B	366B	✘	✔	✘	✘	ROOTS dataset of 498 huggingface datasets. 46 natural languages，13 programming languages.
GLM-130B	130B	400B	✘	✔	✘	✘	English: 1.2T the Pile Chinese: 1T Chinese WuDao corpora; 250GB crawled from online forum, encyclopedia, QA.
ChatGLM-6B	6B	1T	✘	✔	✘	✘	Chinese-English bilingual data.
LLaMA	65B	1.4T	✘	✔	✘	✘	1.4T token. CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange.
Alpaca	7B	-	✘	✔	✘	✘	52K instruction-following data
Vicuna	1.3B	-	✘	✔	✘	✘	finetune 70K ChatGPT data
ChatRWKV (100% RNN)	14B	-	✘	✘	✘	✘	-
Galactica	120B	450B	✘	✔	✘	✘	106 billion tokens from papers, reference material, encyclopedias and other scientific sources
Codex	12B	100B	✘	✔	✘	✘	159GB Python code.
AlphaCode	41B/9B	967B/ 1250B	✘	✔	✘	✘	715GB code from GitHub.
Flamingo	80B	-	✘	✔	✘	✘	M3W, 43M webpaes Image/video-text pairs: ALIGN, LTIP/VTP
BEiT-3	1.9B	-	✔	✘	✘	✘	21M image-text pairs, 14M images, 160GB documents
Kosmos-1	1.6B	360B	✘	✔	✘	✘	1. text: The Pile, Common Crawl exclude GitHub/arXiv/Stack Exchange/PubMed Central, + CC-Stories, RealNews 2. Image-caption pairs LAION-2B/400M/COYO-700M/Comceptual Captions 3. Interleaved image-text data from Common Crawl
GPT-4	-	-	✘	✔	✘	✘	Open-sourced data and third-party data
Llama 1	65B	1.4T	✘	✔	✘	✘	English CC (67\%), C4 (15\%), GitHub (4.5\%), Wiki (4.5\%), Gutenberg and Books3 (4.5\%), ArXiv (2.5\%), Stack Exchange (2\%)
Llama 2	70B	1.8T	✘	✔	✘	✘	A new mix of data from publicly available sources
Llama 3	405B	15.6T	✘	✔	✘	✘
Gemma	7B	6T	✘	✔	✘	✘	Primarily-English data from web documents, mathematics, and code.
Gemma 2	27B	13T	✘	✔	✘	✘	Web documents, code, and science articles.
Gemini	-	-	✘	✔	✘	✘	-

According to different design of pre-training architectures, large-scale LLMs can be generally classified into three classes: encoder only, decoder only, and encoder-decoder. The majority of large LLMs leverage the decoder only and encoder-decoder architecture whereas seldom large models adopt the encoder only design. This is due to that encoder only architectures, such as BERT and DeBERTa, employ stacked Transformer encoders only to attend to bidirectional contexts in language, in which their bidirectional nature prevent them from applying to NLG tasks. In contrast, decoder only models are good at NLG tasks by nature and can perform NLU tasks via prompt-based methods. Examples inlucde GPT series and Turing-NLG.

Encoder only, i.e., pre-training on stacked Transformer encoders. Examples: DeBERTa_1.5B^[10].
Decoder only. This line of large PTMs pre-trained Transformer decoders by applying auto-regressive masks to prevent the current token from attending to future ones. Examples: Turing-NLG ^[5], GPT-3 ^[6], CPM ^[7], HyperCLOVA ^[8], PanGu-$\alpha$ ^[9], Yuan 1.0 ^[12], Megatron-Turing NLG ^[13].
Encoder-decoder, including (1) conventional sequence-to-sequence encoder-decoder, such as T5 ^[1], mT5 ^[2], CPM-2 ^[3]; and (2) Unified encoder-decoder, such as ERNIE 3.0 ^[11].

For attribution in academic contexts, please cite this work as:

@misc{chai2021scaling-ptms-summary,
  author = {Chai, Yekun},
  title = {{Scaling Up Large Language Models: A Summary}},
  year = {2021},
  howpublished = {\url{https://cyk1337.github.io/notes/2021/10/09/PTMs/Scaling-Up-LLMs/}},
}

References

1.(T5) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." JMLR (2020). ↩
2.(mT5) Xue, Linting, et al. "mt5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020). ↩
3.Zhang, Zhengyan, et al. "CPM-2: Large-scale Cost-effective Pre-trained Language Models." arXiv preprint arXiv:2106.10715 (2021). ↩
4.Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." arXiv preprint arXiv:2101.03961 (2021). ↩
5.Turing-NLG: A 17-billion-parameter language model by Microsoft. February 13, 2020. ↩
6.Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020). ↩
7.Zhang, Zhengyan, et al. "CPM: A large-scale generative Chinese pre-trained language model." AI Open 2 (2021): 93-99. ↩
8.Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021). ↩
9.Zeng, Wei, et al. "PanGu-$\alpha $: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation." arXiv preprint arXiv:2104.12369 (2021). ↩
10.He, Pengcheng, et al. "Deberta: Decoding-enhanced bert with disentangled attention." arXiv preprint arXiv:2006.03654 (2020). ↩
11.Sun, Yu, et al. "Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation." arXiv preprint arXiv:2107.02137 (2021). ↩
12.Wu, Shaohua, et al. "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning." arXiv preprint arXiv:2110.04725 (2021). ↩
13.Paresh Kharya and Ali Alvi, Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Oct 11, 2021. ↩