The Gradient

Multimodal Tokenization with Vector Quantization: A Review

Posted on 2024-06-24 In LMM , Tokenization Disqus:

A review of multimodal tokenization approaches using vector quantization^[1] approaches.

Memory-Efficient Attention: MHA vs. MQA vs. GQA vs. MLA

Posted on 2024-05-10 In Transformer , Attention Disqus:

Efficient attention mechanisms are crucial for scaling transformers in large-scale applications. Here we explore different attention variants of Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA), analyzing their trade-offs in memory, speed, and expressivity, and how they enhance transformer scalability. 🚀

Inductive Positions in Transformers

Posted on 2023-01-26 In Transformer Disqus:

We summarize the positional encoding approaches in transformers.

Summary

PE	Relative	Trainable	Each Layer	Extrapolation
Sinusoidal	✘	✘	✘	✘
T5 bias	✔	✔	✔	✔
RoPE	✔	✔	✔	✘
ALiBi	✔	✘	✔	✔
KERPLE	✔	✔	✔	✔
Sandwich	✔	✘	✔	✔
xPos	✔	✘	✔	✔

Diffusion Models: A Mathematical Note from Scratch

Posted on 2022-12-12 In Diffusion Models , ML Disqus:

A diffusion probabilistic model is a parameterized Markov chain trained to reverse a predefined forward process, closely related to both likelihood-based optimization and score matching. The forward diffusion process is a stochastic process constructed to gradually corrupt the original data into random nose.