| May 10, 2024 | Memory-Efficient Attention: MHA vs. MQA vs. GQA vs. MLA |
| Jan 26, 2023 | Positional Encoding in Transformers: From Sinusoidal to RoPE |
| Apr 17, 2022 | Efficient Distributed Training: From DP to ZeRO and FlashAttention |
| Jan 10, 2022 | Masking Strategies for Pre-trained Language Models: From MLM to T5 |
| Nov 29, 2021 | Subword Tokenization in NLP: BPE, WordPiece, and Unigram |
| Dec 14, 2019 | BERTology: From XLNet to ELECTRA |
| Feb 28, 2019 | Normalization in Neural Networks: BN, LN, RMSNorm, and Beyond |
| Jan 22, 2019 | Attention Mechanisms and the Transformer Architecture |