A review of multimodal tokenization approaches using vector quantization[1] approaches.
Memory-Efficient Attention: MHA vs. MQA vs. GQA vs. MLA
Efficient attention mechanisms are crucial for scaling transformers in large-scale applications. Here we explore different attention variants of Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA), analyzing their trade-offs in memory, speed, and expressivity, and how they enhance transformer scalability. π
Inductive Positions in Transformers
Diffusion Models: A Mathematical Note from Scratch
A diffusion probabilistic model is a parameterized Markov chain trained to reverse a predefined forward process, closely related to both likelihood-based optimization and score matching. The forward diffusion process is a stochastic process constructed to gradually corrupt the original data into random nose.
Large Language Models for Programming Languages
A note of code pre-trained language models (PLMs).
Efficient Large-Scale Distributed Training
A note of distributed training methods for large neural models.
Mask Denoising Strategy for Pre-trained Language Models
Mask modeling is a crucial role in pre-training language models. This note provides a short summary.
Subword Tokenization in Natural Language Processing
Summary of word tokenization in natural language processing.
Scaling Up Large Language Models: A Summary
A summary of large language models (LLMs) on a large scale (beyond 10B).
Review: Backpropagation step by step
A quick note on MLP implementation using numpy.