blog

Notes on language models and beyond.

Masking Strategies for Pre-trained Language Models: From MLM to T5

A comprehensive survey of masking and denoising strategies in pre-training—BERT's MLM (static vs. dynamic), whole-word masking, SpanBERT's geometric span masking, MASS, BART's corruption zoo, and T5's sentinel-based span corruption.

10 min read · January 10, 2022

2022 · Transformer machine-learning LMM · machine-learning
Subword Tokenization in NLP: BPE, WordPiece, and Unigram

A comprehensive guide to subword tokenization—BPE, Byte-level BPE, WordPiece, and Unigram LM—with algorithms, implementations (GPT-2, tiktoken, BERT, SentencePiece), and empirical comparisons.

21 min read · November 29, 2021

2021 · Transformer Tokenization LMM · Tokenization LMM
BERTology: From XLNet to ELECTRA

A technical deep-dive into BERT-era pre-training innovations—XLNet's permutation LM, two-stream attention, RoBERTa's training recipe, SpanBERT's boundary objective, ALBERT's parameter efficiency, and ELECTRA's replaced token detection.

6 min read · December 14, 2019

2019 · Transformer machine-learning LMM · machine-learning
Normalization in Neural Networks: BN, LN, RMSNorm, and Beyond

From Batch Normalization to RMSNorm—a systematic comparison of normalization techniques (BN, LN, IN, GN, SN) with formulations, implementations, and practical guidance for modern deep learning.

16 min read · February 28, 2019

2019 · Transformer machine-learning · machine-learning
Attention Mechanisms and the Transformer Architecture

From seq2seq bottlenecks to self-attention—a systematic walkthrough of additive attention, scaled dot-product attention, multi-head attention, the Transformer encoder-decoder, positional encoding, and relative position representations, with complete PyTorch implementations.

12 min read · January 22, 2019

2019 · Transformer Attention machine-learning · machine-learning