The Gradient

Language is not just words.

Fork me on GitHub

Efficient attention mechanisms are crucial for scaling transformers in large-scale applications. Here we explore different attention variants of Multi-Head Attention (MHA), Multi-Query Attention (MQA), Grouped-Query Attention (GQA), and Multi-Head Latent Attention (MLA), analyzing their trade-offs in memory, speed, and expressivity, and how they enhance transformer scalability. πŸš€

Read more »

We summarize the positional encoding approaches in transformers.

Summary

PE Relative Trainable Each Layer Extrapolation
Sinusoidal ✘ ✘ ✘ ✘
T5 bias βœ” βœ” βœ” βœ”
RoPE βœ” βœ” βœ” ✘
ALiBi βœ” ✘ βœ” βœ”
KERPLE βœ” βœ” βœ” βœ”
Sandwich βœ” ✘ βœ” βœ”
xPos βœ” ✘ βœ” βœ”
Read more »

A diffusion probabilistic model is a parameterized Markov chain trained to reverse a predefined forward process, closely related to both likelihood-based optimization and score matching. The forward diffusion process is a stochastic process constructed to gradually corrupt the original data into random nose.

Read more »