A summary of large language models (LLMs) on a large scale (beyond 10B).
Summary
As shown in following table, I summarize the most popular LLMs at a large scale. It is clear that the size of LLMs has become larger and larger in recent years, ranging from 2.6 billion to even 175 billion parameters. Although the training methods are different among these models, they all use Transformers as the standard backbone in LLMs due to the nature of efficient parallel computing and training. Since training large-scale models needs massive unsupervised data corpus, research on scaling up PTMs focuses on high-resource languages such as English and Chinese.
Model | #Params | #Training Tokens | Masked LM | Causal LM | Prefix LM | Seq2Seq LM | Pre-Training Data |
---|---|---|---|---|---|---|---|
T5 | 11B | - | ✘ | ✘ | ✘ | ✔ | C4 Corpus (~750GB) |
mT5 | 13B | - | ✘ | ✘ | ✘ | ✔ | mC4 Corpus (6.3T tokens) |
Switch Transformers | 1751B | - | ✘ | ✘ | ✘ | ✔ | C4 Corpus (~750GB) |
CPM-2 | 11B | - | ✘ | ✘ | ✘ | ✔ | WuDao Corpus (2.3TB Chinese + 300GB English) |
CPM-2-MoE | 198B | - | ✘ | ✘ | ✘ | ✔ | WuDao Corpus (2.3TB Chinese + 300GB English) |
Turing-NLG | 17B | - | ✘ | ✔ | ✘ | ✘ | English data |
GPT-3 | 175B | 300B | ✘ | ✔ | ✘ | ✘ | cleaned CommonCrawl, WebText |
CPM | 2.6B | - | ✘ | ✔ | ✘ | ✘ | Chinese corpus (100GB) |
HyperCLOVA | 204B | - | ✘ | ✔ | ✘ | ✘ | Korean data |
PanGu-$\alpha$ | 200B | - | ✘ | ✔ | ✘ | ✘ | Chinese data (1.1TB, 250B tokens) |
DeBERTa1.5B | 1.5B | - | ✔ | ✘ | ✘ | ✘ | English corpus |
ERNIE 3.0 | 10B | - | ✔ | ✔ | ✘ | ✘ | Chinese data (4TB); English |
Yuan 1.0 | 245B | - | ✘ | ✔ | ✔ | ✘ | Chinese data (5TB) |
Megatron-Turing NLG | 530B | 270B | ✘ | ✔ | ✘ | ✘ | The Pile, CommonCrawl, RealNews, CC-Stories |
OPT | 175B | 300B | ✘ | ✔ | ✘ | ✘ | BookCorpus, Stories, CCNews (RoBERTa) The Pile PuhsShift.io Reddit. |
Gopher | 280B | 300B | ✘ | ✔ | ✘ | ✘ | MassiveText(10.5TB data) including webpages, books, news article, code. |
Jurassic-1 | 178B | 300B | ✘ | ✔ | ✘ | ✘ | GPT-3 data |
Chinchilla | 70B | 1.4T | ✘ | ✔ | ✘ | ✘ | Same as Gopher. |
Sparrow | 70B | - | ✘ | ✔ | ✘ | ✘ | - |
LaMDA | 137B | 168B | ✘ | ✔ | ✘ | ✘ | 2.97B documents, 1.12B dialogs, and 13.39B dialog utterances, for a total of 1.56T words |
PaLM | 540B | 780B | ✘ | ✔ | ✘ | ✘ | 780B token, dataset from LamDA, GLaM, and code |
BLOOM | 176B | 366B | ✘ | ✔ | ✘ | ✘ | ROOTS dataset of 498 huggingface datasets. 46 natural languages,13 programming languages. |
GLM-130B | 130B | 400B | ✘ | ✔ | ✘ | ✘ | English: 1.2T the Pile Chinese: 1T Chinese WuDao corpora; 250GB crawled from online forum, encyclopedia, QA. |
ChatGLM-6B | 6B | 1T | ✘ | ✔ | ✘ | ✘ | Chinese-English bilingual data. |
LLaMA | 65B | 1.4T | ✘ | ✔ | ✘ | ✘ | 1.4T token. CommonCrawl, C4, GitHub, Wikipedia, Books, ArXiv, StackExchange. |
Alpaca | 7B | - | ✘ | ✔ | ✘ | ✘ | 52K instruction-following data |
Vicuna | 1.3B | - | ✘ | ✔ | ✘ | ✘ | finetune 70K ChatGPT data |
ChatRWKV (100% RNN) | 14B | - | ✘ | ✘ | ✘ | ✘ | - |
Galactica | 120B | 450B | ✘ | ✔ | ✘ | ✘ | 106 billion tokens from papers, reference material, encyclopedias and other scientific sources |
Codex | 12B | 100B | ✘ | ✔ | ✘ | ✘ | 159GB Python code. |
AlphaCode | 41B/9B | 967B/ 1250B | ✘ | ✔ | ✘ | ✘ | 715GB code from GitHub. |
Flamingo | 80B | - | ✘ | ✔ | ✘ | ✘ | M3W, 43M webpaes Image/video-text pairs: ALIGN, LTIP/VTP |
BEiT-3 | 1.9B | - | ✔ | ✘ | ✘ | ✘ | 21M image-text pairs, 14M images, 160GB documents |
Kosmos-1 | 1.6B | 360B | ✘ | ✔ | ✘ | ✘ | 1. text: The Pile, Common Crawl exclude GitHub/arXiv/Stack Exchange/PubMed Central, + CC-Stories, RealNews 2. Image-caption pairs LAION-2B/400M/COYO-700M/Comceptual Captions 3. Interleaved image-text data from Common Crawl |
GPT-4 | - | - | ✘ | ✔ | ✘ | ✘ | Open-sourced data and third-party data |
Llama 1 | 65B | 1.4T | ✘ | ✔ | ✘ | ✘ | English CC (67\%), C4 (15\%), GitHub (4.5\%), Wiki (4.5\%), Gutenberg and Books3 (4.5\%), ArXiv (2.5\%), Stack Exchange (2\%) |
Llama 2 | 70B | 1.8T | ✘ | ✔ | ✘ | ✘ | A new mix of data from publicly available sources |
Llama 3 | 405B | 15.6T | ✘ | ✔ | ✘ | ✘ | |
Gemma | 7B | 6T | ✘ | ✔ | ✘ | ✘ | Primarily-English data from web documents, mathematics, and code. |
Gemma 2 | 27B | 13T | ✘ | ✔ | ✘ | ✘ | Web documents, code, and science articles. |
Gemini | - | - | ✘ | ✔ | ✘ | ✘ | - |
According to different design of pre-training architectures, large-scale LLMs can be generally classified into three classes: encoder only, decoder only, and encoder-decoder. The majority of large LLMs leverage the decoder only and encoder-decoder architecture whereas seldom large models adopt the encoder only design. This is due to that encoder only architectures, such as BERT and DeBERTa, employ stacked Transformer encoders only to attend to bidirectional contexts in language, in which their bidirectional nature prevent them from applying to NLG tasks. In contrast, decoder only models are good at NLG tasks by nature and can perform NLU tasks via prompt-based methods. Examples inlucde GPT series and Turing-NLG.
- Encoder only, i.e., pre-training on stacked Transformer encoders. Examples: DeBERTa1.5B[10].
- Decoder only. This line of large PTMs pre-trained Transformer decoders by applying auto-regressive masks to prevent the current token from attending to future ones. Examples: Turing-NLG [5], GPT-3 [6], CPM [7], HyperCLOVA [8], PanGu-$\alpha$ [9], Yuan 1.0 [12], Megatron-Turing NLG [13].
- Encoder-decoder, including (1) conventional sequence-to-sequence encoder-decoder, such as T5 [1], mT5 [2], CPM-2 [3]; and (2) Unified encoder-decoder, such as ERNIE 3.0 [11].
For attribution in academic contexts, please cite this work as:1
2
3
4
5
6@misc{chai2021scaling-ptms-summary,
author = {Chai, Yekun},
title = {{Scaling Up Large Language Models: A Summary}},
year = {2021},
howpublished = {\url{https://cyk1337.github.io/notes/2021/10/09/PTMs/Scaling-Up-LLMs/}},
}
References
- 1.(T5) Raffel, Colin, et al. "Exploring the limits of transfer learning with a unified text-to-text transformer." JMLR (2020). ↩
- 2.(mT5) Xue, Linting, et al. "mt5: A massively multilingual pre-trained text-to-text transformer." arXiv preprint arXiv:2010.11934 (2020). ↩
- 3.Zhang, Zhengyan, et al. "CPM-2: Large-scale Cost-effective Pre-trained Language Models." arXiv preprint arXiv:2106.10715 (2021). ↩
- 4.Fedus, William, Barret Zoph, and Noam Shazeer. "Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity." arXiv preprint arXiv:2101.03961 (2021). ↩
- 5.Turing-NLG: A 17-billion-parameter language model by Microsoft. February 13, 2020. ↩
- 6.Brown, Tom B., et al. "Language models are few-shot learners." arXiv preprint arXiv:2005.14165 (2020). ↩
- 7.Zhang, Zhengyan, et al. "CPM: A large-scale generative Chinese pre-trained language model." AI Open 2 (2021): 93-99. ↩
- 8.Kim, Boseop, et al. "What changes can large-scale language models bring? intensive study on hyperclova: Billions-scale korean generative pretrained transformers." arXiv preprint arXiv:2109.04650 (2021). ↩
- 9.Zeng, Wei, et al. "PanGu-$\alpha $: Large-scale Autoregressive Pretrained Chinese Language Models with Auto-parallel Computation." arXiv preprint arXiv:2104.12369 (2021). ↩
- 10.He, Pengcheng, et al. "Deberta: Decoding-enhanced bert with disentangled attention." arXiv preprint arXiv:2006.03654 (2020). ↩
- 11.Sun, Yu, et al. "Ernie 3.0: Large-scale knowledge enhanced pre-training for language understanding and generation." arXiv preprint arXiv:2107.02137 (2021). ↩
- 12.Wu, Shaohua, et al. "Yuan 1.0: Large-Scale Pre-trained Language Model in Zero-Shot and Few-Shot Learning." arXiv preprint arXiv:2110.04725 (2021). ↩
- 13.Paresh Kharya and Ali Alvi, Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, the World’s Largest and Most Powerful Generative Language Model. Oct 11, 2021. ↩