证据 (1)
critically examine R1-Zero-like training by analyzing its two core components: base models and RL
共识 / 非共识命题挖掘
已隐藏 824 条无引用观点(仅保留有论文出处或属于 Camp 辩论锚点的观点)
critically examine R1-Zero-like training by analyzing its two core components: base models and RL
many disparate use-cases are grouped together under the umbrella term of long-context, defined simply by the total length...
Outperforming Curated Corpora with Web Data, and Web Data Only.
Outperforming Curated Corpora with Web Data, and Web Data Only
Outperforming Curated Corpora with Web Data, and Web Data Only
RefinedWeb... Outperforming Curated Corpora with Web Data, and Web Data Only
RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only
Outperforming Curated Corpora with Web Data, and Web Data Only
a critical challenge when applying RL to these agentic tasks arises from delayed rewards
Scaling inference compute enhances reasoning... RL has emerged as a crucial method... yet the conditions under which long CoTs emerge remain...
transformers can use meaningless filler tokens to perform hidden computation and still obtain reasoning gains
We study inference scaling laws and compute-optimal inference, focusing on the trade-offs
The natural questions are: retrieval-augmentation versus long context window, which one is better for downstream tasks?
developed various autonomous LLM agents to perform end-to-end software development tasks... Agentless: Demystifying LLM-based Software Engineering Agents
To train more reliable models, we can turn either to outcome supervision... or process supervision.
most variables... can employ low-precision data formats without compromising model accuracy and requiring no changes to hyper-parameter
We hypothesize that long context modeling, in particular the ability to utilize information at arbitrary input locations, is a capability that is mostly already acquired...
prior work links induction heads to ICL... this can only account for ICL when the answer is included within the context
Deliberation in latent space via differentiable cache augmentation
could we detect it and remove it using current state-of-the-art safety training techniques
Selecting according to human notions of data quality... the opposite can often happen.
we train Transformer models which can make output predictions at different stages of the network
reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds
Large Language Models... can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.
Different from previous weight pruning methods... Not All Experts are Equal
increasing the internal representation is just as useful as increasing the number of self-attention layers
Chinchilla argues ~20:1
This paper explores the impact of extending input lengths on the capabilities of LLMs... performance consistency across different input lengths is not well understood.
naively training a model on all available data may not be optimal (or feasible), as the quality of available text data can vary
we investigate state-of-the-art techniques and architectures in order to assess their effectiveness
Introduces trainable thinking tokens as hidden-compute budget for LM.
The central difficulty is gate training: the gate decision must propagate through many layers before it influences the language modeling loss
[Camp A: hand-tuned 4D (Megatron / MegaScale style)] Mesh, schedule and kernels are all decided explicitly by engineers; auto-parallel and FSDP serve only as components. The only reproducible route at 100B+.
[Camp B: auto-parallel (Alpa / GSPMD / pjit)] Compile parallelism decisions: write sharding annotations and let cost-model-driven ILP/RL search the mesh. Reaches 90–95% of hand-tuned below 100B.
[Camp C: FSDP-only is enough] Minimal code: ZeRO-3 plus offload handles any scale; TP/PP can be avoided.
[Camp D: 3D (DP+TP+PP) is enough, SP/EP optional] Keep the classic 3D stack, add long-context and MoE as later patches.
[Camp A: more code is always better, generalists should push ] Since code helps reasoning and Orbay et al. [CodeScalingLaws2023] see no saturation, keep adding—40% or even 60% should also be better for generalists.
[Camp B: code's benefit is purely regularisation / lower effe] Gadre et al. [Gadre2023Overtraining] show code has lower downstream variance, suggesting its role is just 'lower effective LR'—any stable data source can substitute.
[Camp C: keep code below 10% to protect NL] Code competes with NL for capacity; generalists should compress it to 5–10%; NL is the product foundation and must not be sacrificed.
[Camp D: code ability must be trained from scratch; continual] Continual code-heavy causes catastrophic NL forgetting; the only clean generalist→specialist path is from-scratch.
[Camp A: Positional extrapolation is enough] Get RoPE base / NTK scaling / per-dim interpolation right and you can extend any short-context model; data and packing are secondary. Representatives: PI [Chen2023PI], YaRN [Peng2023YaRN], LongRo
[Camp B: The data recipe is the main variable] Effective context is set by long-doc upsampling ratio, domain balance, and continual-pretrain tokens; positional encoding is tuning, not the main battle. Representatives: Fu et al. [Fu2024DataE
[Camp C: Packing engineering is the under-exploited lever] Given the same data pool and positional scheme, related-doc clustered packing + fewer truncations + learned separators unlock another tier of long-context capability. Representative
[Camp D: Switch architectures (SSM / linear) to bypass positi] Quadratic attention is the root issue; Mamba [Gu2023Mamba], RWKV [Peng2023RWKV], Jamba [Lieber2024Jamba], LongNet [Ding2023LongNet] scale directly to million-length sequences wi
[Camp A: kernels and algorithms must be co-designed] Architecture is a projection of kernel physics: GQA/MLA, per-block FP8, grouped GEMM, head_dim ∈ {64,128,192,256} all fall out of the roofline, tensor-core tiles, and memory hierarchy.
[Camp B: PyTorch level is enough] torch.compile + FlexAttention + standard transformer primitives already absorb most kernel wins; algorithm authors don't need CUTLASS.
[Camp C: hardware keeps improving, algorithms don't need to a] Blackwell / Rubin deliver 2× bandwidth + 2× FLOPs per gen; scale-up alone keeps dense MHA + BF16 viable, and architectural complication is self-inflicted.
[Camp D: non-NVIDIA hardware will catch up] TPU v5p / MI300X / Trainium2 will diversify frontier-pretrain hardware through 2026, and kernel ecosystems will fragment accordingly.
[Camp A: Quality classifier plus ablation ladder is enough] Exemplified by DCLM [DCLM2024], FineWeb-Edu [FineWeb2024], and RegMix [RegMix2024]: bulk filters plus per-decision ablations cover 95% of frontier decisions; influence and causal t
[Camp B: Influence functions are the main path] Exemplified by Anthropic [AnthropicInfluence2023], TRAK [TRAK2023], and Simfluence [Simfluence2023]: only per-example attribution can truly answer data value; classifiers and ablations are mer
[Camp C: Full causal inference is the future] Exemplified by Causal-LL [CausalLL2024] and Skill-it [SkillIt2023]: data value is fundamentally causal; only IV and mediator methods remove confounding, while other approaches are polluted by di
[Camp D: Skip measurement, rely on intuition and scale] An implicit stance in some startups and open-source communities: make data calls via senior-researcher intuition plus large-scale trial-and-error, skipping internal tooling to save eng
[Camp A: the FA series is the endpoint of attention engineeri] The main answers for attention kernels are already in FA1/2/3 [Dao2022FA1][Dao2023FA2][Shah2024FA3]; later work is just porting to new silicon, and variants / serving are deriva
[Camp B: Triton / FlexAttention is the real revolution] The real inflection is the kernel-author barrier dropping from 'PhD × 3 months' to 'engineer × 1 week' [Tillet2019Triton][Dong2024Flex]; FA3's CUDA is unmaintainable for the median eng
[Camp C: attention itself should be replaced (SSM / linear RN] Optimizing O(L²) attention in the long-context era is mis-targeted; RetNet [Sun2023RetNet] and Mamba [Waleffe2024Mamba] already match short-context LM loss with O(L) inference.
[Camp D: FA is the embodiment of NVIDIA lock-in] FA3's [Shah2024FA3] core depends on Hopper-exclusive instructions (TMA, warp specialization, FP8 tensor cores) [Luo2024HopperBench], with no portable path to AMD or domestic silicon.
[Camp A: MoE is the inevitable replacement for dense] At matched training FLOPs MoE offers 1.5–2× quality-per-FLOP advantage; fine-grained + shared + aux-loss-free has resolved most engineering footguns, and MoE is poised to be the default
[Camp B: dense paths yield better ROI in the end] MoE's total-param footprint, cross-node all-to-all, and post-train fragility offset its quality-per-active-param edge in many real deployments; fully-open dense families win on reproducibili
[Camp C: expert-choice / aux-loss-free is the future] Making load balance a structural constraint rather than a soft penalty is the right direction; expert-choice [Zhou2022ExpertChoice] opened it, aux-loss-free bias EMA [Wang2024AuxFree][De
[Camp D: MoE matters only for pretrain; post-train should rev] Sparse router gradients + expert specialization hurt SFT/RLHF stability; task-specific expert pruning [Chen2022TaskPrune][Koishekenov2022NLLBPrune] and dense-to-dynamic-k conver
[Camp A: Formal mixture search (DoReMi / RegMix / Data Mixing] Domain weights are searchable variables; pay the proxy or regression overhead to obtain a quantitatively optimal w*, which transfers to large scale reliably.
[Camp B: Heuristic + curriculum (Llama 3 / MiniCPM route)] The mixture is a trajectory, not a vector; expert priors plus 2–3 ablations suffice once the problem is viewed as curriculum scheduling—stage boundaries and the annealing mix matter
[Camp C: Online adaptive mixing] Treat domains as bandit arms and adjust weights online from per-step loss; skip proxy training entirely.
[Camp D: Ratio doesn't matter, quality does] Once filters are strong enough (DCLM-Baseline, FineWeb-Edu, textbook-grade synthesis), mixture weights become a secondary knob and the optimization budget should flow to content quality.
[Camp A: Explicit PE is necessary; RoPE interpolation is the ] Long-context capability must be backed by some learned or parameterized PE; ALiBi and the RoPE family cover all practically viable options.
[Camp B: SP and DP are orthogonal and can be optimized indepe] SP slices along L and DP slices along batch; decouple and push each to its limit separately.
[Camp C: KV compression = GQA/MQA is enough] GQA (shared KV heads) is a reasonable endpoint for KV-cache compression; going further hurts downstream.
[Camp D: Perplexity is still a valid long-context metric] Train on sufficiently long sequences and ppl improvements translate into long-context capability.
[Camp A: pretrain-time ABF is the clean path] Since low-freq dims were never trained, train them once in pretraining. base=500000 plus a 6-stage curriculum needs zero inference-time frequency remapping and wins naturally on RULER.
[Camp B: YaRN is the de-facto retrofit tool] At ≤128K, YaRN + 400 FT steps sits within 3 pp of ABF on RULER and needs no re-pretraining — the best ROI for most existing models.
[Camp C: ≥1M requires LongRoPE] Smooth NTK/YaRN show measurable per-dim mismatch at 1M; only the non-uniform scales found by evolutionary search survive million-token RULER.
[Camp D: bypass the whole PI/NTK/YaRN lineage] Position encoding itself is the bug source. Use LM-Infinite-style Λ-masks for zero-shot extrapolation, or switch to non-attention architectures like RetNet/LongNet and sidestep RoPE extension e
[Camp A: µP is the absolute default] µP is the only mathematically rigorous transfer method; Complete-P and u-µP have resolved compatibility with architectural components and low precision, so all pretraining should switch to µP.
[Camp B: Empirical formulas + a few sweeps suffice] No need to modify low-level parameterization; simply use empirical formulas from Cerebras or DeepSeek, combined with a few proxy runs, to accurately predict target-scale LR and Batch Size.
[Camp C: End-to-end Bayesian search is the endgame] Analytical solutions always have blind spots; we should use cost-aware Bayesian methods like CARBS to search all HPs directly on the Pareto frontier.
[Camp A: AdamW is never retired] The ≥70B dense production default is AdamW [Loshchilov2017AdamW]; every new optimizer's marginal gain halves under matched HP-search budget; AdamW's real moat is the muP LR-transfer ecosystem [Lingle2024muP]
[Camp B: Muon is the next default] Muon [Jordan2024Muon] reduces Shampoo's preconditioner to its cheapest usable form via Newton-Schulz orthogonalization, cutting NanoGPT speedrun from 5 min to 3.3 min (−34%); ≤30B hidden 2D weights should
[Camp C: Shampoo / SOAP is the proper endgame] Shampoo [Gupta2018Shampoo]'s Kronecker preconditioner approximates Gauss-Newton [Morwani2024Shampoo]; SOAP [Vyas2024SOAP] eliminates its grafting pathology, matches AdamW wall-clock, and cuts e
[Camp D: optimizers don't matter, data does] Agarwal et al. [Agarwal2020LRConfound] and Dahl et al. [Dahl2023AlgoPerf] show: once HP-search budget and LR schedules are controlled, adaptive-vs-SGD and most cross-optimizer gaps collapse by mo
[Camp A: short-to-long + per-doc mask (mainstream)] Short-window main pretrain with a long-context mid-train tail, packing under per-doc causal mask; representatives LLaMA-3 [Llama32024], Qwen2.5 [Qwen25Tech].
[Camp B: uniformly mixed-length training] Interleave short and long seqs across the whole pretrain to avoid the distribution drift of a tail mid-train.
[Camp C: naive concat + cross-doc visible] Keep GPT-2 [GPT2] / early LLaMA [LLaMA2023] style EOS-only packing without per-doc mask, citing engineering simplicity.
[Camp D: FIM for everything] Generalise OpenAI FIM's [FIM2022] 'free lunch' to NL pretrain, applying 50% FIM to every doc.
[Camp A: PPL remains the most reliable primary variable] This camp argues that when training protocol is controlled, loss/PPL power laws are the most stable, cheapest, and best suited for extrapolating large runs. Downstream tasks are noisy
[Camp B: PPL is only an intermediate state inside task scalin] This camp accepts that PPL contains signal, but argues that the quantity worth fitting is task performance itself. PPL may serve as an intermediate variable, domain-match signal
[Camp C: Stop searching for one scalar; use multi-panel diagn] This camp argues that the issue is not choosing the wrong scalar, but that many decisions are not compressible into one number in the first place. One should jointly track train
[Camp D: The problem with PPL is ontological, not merely pred] This camp argues that there is no stable one-to-one mapping between PPL and downstream capability because they measure different objects: average token coding efficiency versus
[Camp A: Dedup as aggressively as possible] Treat repetition as pure noise: apply maximal exact/near dedup on web corpora, preferring false positives over leaving duplicates; the rationale is that repetition drives memorization, contaminati
[Camp B: Uniform repetition ≤4 epochs is (almost) free] Treat epochs as the main lever under data constraints: run 2–4 epochs on a cleaned high-quality pool, with returns close to adding the same amount of fresh tokens; only beyond that con
[Camp C: Semantic dedup is the real battleground] Exact dedup only removes obvious duplicates; the real budget sink is semantic near-duplication and within-topic redundancy. Prioritize embedding-based semantic dedup and diversification to t
[Camp D: Zero repetition for sensitive/eval/copyright data] Enforce zero-repeat (or strict 0–1 exposure) for benchmarks, copyrighted text, PII, and high-risk identifiable sources, and implement governance as data-layer configuration (opt-ou
[Camp A: Kaplan — bigger model, fewer tokens] At fixed compute, allocate budget primarily to parameters N; tokens need only be 'sufficient'. GPT-3 (175B / 300B tokens) and Gopher-280B embody this line.
[Camp B: Chinchilla — balance N and D under compute] A compute-optimal ratio of tokens/param ≈ 20 is a universal slope; models trained under it sit on the frontier. The LLaMA family and LLaMA 2 represent the open-source wing.
[Camp C: Data-mixture pragmatists — data is the first axis] At matched (N, D) budgets, data filtering and mixture differences can swamp what scaling laws predict. DCLM, phi-1, RefinedWeb, and DsDm all point the same way: data quality is an
[Camp D: Against emergence-as-magic] Most 'emergent abilities' reported by Wei et al. come from metric nonlinearities (exact match, 0-1 accuracy). Under continuous metrics, capability rises smoothly with scale — no internal phase transition
[Camp A: Pure SSMs will eventually replace Transformers entir] Believes that by scaling hidden state dimensions ($d_{\text{state}}$) and improving selective gating mechanisms, pure SSMs can overcome compression bottlenecks and achieve end-t
[Camp B: Linear Attention and SSMs are fundamentally the same] Argues that linear attention variants like RWKV and RetNet, along with Mamba, mathematically reduce to linear RNNs with different decay strategies; the differences are purely en
[Camp C: Subquadratic models must be pretrained from scratch] Believes that the hidden state dynamics of SSMs fundamentally differ from Transformer attention distributions, requiring pretraining from scratch for the model to learn correct s
[Camp A: HumanEval is enough] This view treats function-level execution benchmarks as sufficient representatives of coding ability; if HumanEval or MBPP rises, programming ability is assumed to rise with it [Chen2021Codex] [Austin2021MBPP].
[Camp B: SWE-bench Verified is the only truth] This view argues that real GitHub issue resolution is closest to software engineering, so other code benchmarks are weak proxies; final ranking should therefore follow Verified scores [SWEbench
[Camp C: trajectory-level PPL is the most real pretrain-stage] This view tends to assume that if a model models patch, edit trajectories, or code distribution well enough in likelihood terms, downstream SWE will come along, so pretrain metr
[Camp D: agent UX metrics matter more than pass@1] This view argues that what users ultimately feel is interaction cost, retry gains, test stability, and trajectory quality, so pass@1 is only a small part of the picture [SelfDebug2023] [ReC
[Camp A: synthetic data-first (the Phi line)] The core claim is that high-quality synthetic and textbook-style data has much higher token density than noisy web text, so quality can partially substitute for scale, especially for small model
[Camp B: web-heavy backbone + synthetic mid-train (the LLaMA ] The core claim is that real web data still provides broad coverage, while synthetic data performs targeted shaping later in training. This preserves long-tail coverage while spe
[Camp C: pure web-scale, avoid synthesis as much as possible] This camp argues that the real-world distribution is already complex enough, and any synthetic data injects teacher bias and mode contraction; stronger filtering, larger crawling
[Camp D: unlimited synthetic scaling] This camp assumes that once the teacher is strong enough, synthetic tokens can scale almost without bound and real data will eventually be needed only as a tiny seed.
[Camp A: Scaffolding and Test-Time Compute are Everything] Through multi-agent frameworks, task graphs, and repeated sampling, inference-time compute can compensate for base model deficits to resolve complex SWE issues.
[Camp B: RL and Verifiers are the True Drivers] Large-scale RL and verifier training in executable environments are the keys to endowing models with real-world software engineering reasoning.
[Camp C: Just Mix More Code] As long as larger, permissively licensed code corpora (like The Stack) are used, the model's coding and engineering capabilities will naturally emerge.
[Camp A: Pure SSM is the ultimate solution for long context] Argues Mamba-2's SSD theory solves hardware efficiency, and pure SSMs' O(1) inference memory is sufficient to replace Transformers.
[Camp B: Hybrid is the pragmatic production path] Advocates mixing Attention and SSM at 1:3 to 1:7 ratios, using sparse Attention for recall and SSMs for high throughput.
[Camp C: Transformer + long-context extensions suffice] Argues that extending RoPE via YaRN or StreamingLLM makes Transformers sufficient for long contexts without needing SSMs.
[Camp D: RWKV is the correct RNN revival path] Sticks to 'Transformer skeleton + RNN forward', achieving linear complexity via WKV operators and matrix-valued states.
[Camp A: tokenizers are frozen preprocessing; coverage is eno] Reuse mainstream BPE/WordPiece defaults and spend optimization budget on data mixture, training recipes, and post-training; only ensure script coverage and a small set of specia
[Camp B: bigger vocab is always better; push to 256K+] Treat larger vocab as near-free compression: shorter sequences reduce attention cost and gains should be monotonic; extra embedding/softmax parameters are assumed negligible.
[Camp C: tokenizer-free is the endgame; abandon BPE] Model bytes/chars/patches directly to eliminate tokenization bias and OOV, gaining robustness and multilingual fairness; let architecture absorb the complexity.
[Camp D: tokenizer is a pretrain product spec—make BPE right ] Version tokenizers and regression-test them: 64K–128K vocab, single-digit numerals, standalone whitespace/newline tokens; run Magikarp and short continued pretraining for under-
[Camp A: the architecture is mostly done; just keep scaling f] This reading argues that loss and capability are driven mainly by parameters, data, compute, and training recipe, while architectural details are mostly second-order. Kaplan et
[Camp A: architecture details are mostly constants; keep clea] Loss and capability are driven mainly by parameters, data, compute, and training recipe; most architecture tweaks yield constant-factor gains while adding migration and maintena
[Camp B: the next backbone should move to SSM / RetNet / Mamb] This camp argues that Transformer state cost and long-context scaling are near their limit, and that retention or state-space backbones should replace them. Sun et al. [RetNet20
[Camp C: the second scaling path should be default—grow first] This camp argues that an existing pretrained base is an asset, and that deepening or block insertion is often cheaper than retraining. Wang et al. [Wang2023Grow], Yao et al. [Ya
[Camp D: QK-Norm / sandwich norm are optional details] This camp treats stability as mainly a matter of learning rate, initialization, clipping, and data cleaning, with QK-Norm or sandwich norm viewed as recipe noise rather than default-bac
[Camp A: Formal search first] This camp treats domain weights as first-order optimization variables that should be searched systematically, much like learning rate or batch size, rather than chosen by recipe intuition. Representative method
[Camp B: Heuristics plus curriculum are more robust] This camp argues that public frontier recipes already provide a strong enough prior: web-heavy early, capability-heavy late, with 3–5× upweighting of code and math, plus 2–3 rounds of abl
[Camp C: Online adaptation beats one-shot offline search] This camp argues that domain losses evolve during training, so searching for one fixed offline w* and applying it throughout is unnatural; a better approach is to adapt weights onlin
[Camp D: Ratio is secondary; quality is first-order] This camp emphasizes that strong filtering often yields larger gains than fine-grained ratio tuning; especially when raw web data is noisy, fixing quality is more cost-effective than tuni
[Camp D: Ratios are second-order; quality is first-order] When the pool is dominated by low-quality web, quality filtering/selection often yields larger gains than fine-grained mixture tuning; prioritize filters, dedup, and quality tiers be
[Camp A: synthetic-first can be the main route] This camp argues that if synthetic tokens are clean and textbook-like enough, small models and even general models can learn denser knowledge from fewer tokens; real web data mainly serves as
[Camp B: a web-heavy backbone plus synthetic mid-train is the] This camp uses large-scale real web/code data for the backbone, then applies mid-train to pull the model toward specialty distributions. Synthetic data acts as a densifier and r
[Camp C: avoid synthetic as much as possible; stronger filter] This camp argues that most synthetic gains really come from “cleaner data distributions,” which can be achieved through ranking, proxies, and large-scale filtering without takin
[Camp C: avoid synthetic as much as possible; stronger filter] This camp argues that many gains attributed to synthetic data actually come from cleaner, less redundant data distributions, and these gains can be obtained from real web pools
[Camp D: synthetic can scale almost without bound; collapse i] This camp emphasizes that verifiers, strong teachers, and better sampling are now enough to suppress degradation, so synthetic ratios can keep rising, and code/math experience c
[Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as predictable response surfaces or fit-able mixing laws: fit on small-scale experiments then extrapolate; or define preferences via robust objectives (worst-case
[Camp B: Heuristics + curriculum are more robust (few ablatio] Prioritize controllable engineering: fine buckets, quality tiers, simple staged schedules, then calibrate with 2–3 ablations. Held et al. [Held2025Utility] also suggests complex
[Camp C: Online adaptation beats one-shot offline search] Treat mixing as non-stationary: training dynamics and data inflow change optimal weights, so weights should adapt online from loss/signals rather than re-running offline search for e
[Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as optimizable variables, using scaling laws, regressed response surfaces, or robust objectives to select weights systematically. The goal is to infer large-scale
[Camp B: Heuristics + curricula are more robust (a few ablati] Treat mixture as engineering control: clean the web base, bucketize finely, then tune interpretable knobs with a small number of ablations; express ratios as curricula. Public r
[Camp C: Online adaptation beats one-shot offline search] Treat mixture as a train-time policy: dynamically adjust domain weights based on loss/generalization signals to avoid committing to a wrong one-shot ratio. Representative work includ
[Camp D: Ratios are second-order; quality/selection is first-] Argues the biggest gains on real web pools come from filtering and selection: remove low-quality noise so training enters an effective regime. FineWeb operationalizes this repro
[Camp D: Ratios are second-order; quality/selection is first-] In real web pools, the largest gains often come from filtering and selection: remove low-quality noise first to enter an effective training regime. DataComp-LM supports the regi
[Camp A: synthetic-first can be the main route] This camp argues that high-quality synthetic or curated tokens have higher information density than ordinary web tokens, so in data-constrained settings, small models, or highly structured dom
[Camp B: a web-heavy backbone plus synthetic mid-train is the] This camp places synthetic data after the backbone, using it to pull the distribution rather than replace real-world coverage. Llama 3, Phi-3, Code Llama, Llemma, and long-conte
[Camp D: synthetic can scale almost without bound; collapse i] This camp usually relies on two arguments: one from expert iteration, proof search, and verifier-backed recursive improvement, claiming that sufficiently hard feedback allows re
[Camp A: pretrain-time ABF + curriculum is the cleaner long-c] Treat long context as a pretraining recipe: set RoPE base to match the target window scale (e.g., ~500000 for 128K) and use a short-to-long curriculum to co-adapt data distribut
[Camp A: pretraining-time ABF + curriculum is the clean long-] Treat long context as distribution shift: set RoPE base to the target-window scale, then use a short-to-long curriculum so training actually contains long-range supervision and
[Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] Under no-repretrain constraints, YaRN packages “low-freq-only modification + stabilized long-range attention distribution” into a reusable implementation: per-dim ramp avoids hi
[Camp C: ≥512K requires per-dim search (LongRoPE-like)] At 512K–2M, global smooth formulas show structural mismatch; you need to explicitly learn per-dim (and sometimes per-head) scale patterns, otherwise you get either local degradation or
[Camp D: bypass the PI/NTK/YaRN lineage (ALiBi/RetNet/Mamba/m] Avoid RoPE extrapolation patches: either change positional bias (ALiBi), change the sequence modeling architecture (RetNet/Mamba), or add external memory/compression to move lon
[Camp A: tokenizers are frozen preprocessing; coverage is eno] Most gains come from scale, data quality, and training recipe; as long as the tokenizer avoids obvious OOV/garbage issues, it is not worth iterating during the main training cyc
[Camp B: bigger vocab is always better; push to 256K+ by defa] A larger vocab shortens sequences, reduces attention cost, and makes long context cheaper; if 128K works, we should keep pushing to 256K+ and treat compression as the primary ob
[Camp C: tokenizer-free is the endgame; abandon BPE as soon a] BPE’s sampling bias, OOV, and cross-lingual unfairness are structural; byte/char modeling removes these at the root. We should migrate quickly to tokenizer-free models and rely
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues (boundary bias, cross-language unfairness, OOV/rule debt); byte/char-level modeling removes these at the root and reduces ecosystem fragmentation from token
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues (sampling bias, cross-language unfairness, OOV/rule debt); modeling bytes/chars/patches removes these at the root, and sequence cost should be handled by ar
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues: sampling bias, non-unique encodings, and cross-lingual unfairness. Byte/patch/pixel modeling removes OOV and segmentation bias at the root. The longer-sequ
[Camp D: tokenizer is a pretrain product spec—make BPE right ] Tokenizers shape training-signal allocation, compositional generalization, and deployment debt; they should be versioned and regression-tested like the data recipe. First, stand
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] Treat tokenization as an implementation detail: as long as text is encodable with low OOV, focus on scaling (params/data/compute) and training recipe; tokenizer changes should n
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as the corpus is encodable with low OOV, quality is driven by scale, data quality, and training recipe; tokenizer changes should not consume main training budget beyond
[Camp B: bigger vocab is always better; default to 256K+] Larger vocab improves compression and shortens sequences with near-zero extra FLOPs at inference, especially helping multilingual and code; therefore default to larger vocab and trea
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Byte-/character-level modeling avoids BPE’s language unfairness, OOV, and hand-crafted rule debt; if BPB is near parity, prioritize moving to tokenizer-free architectures [Xue2021ByT
[Camp E: counter-trend—shrink/prune vocab to buy alignment & ] When the dominant bottleneck is alignment-stage stability/consistency (RLHF/DPO), reduce tail tokens and split heavy-domain long merges to lower non-unique encodings and train/i
[Camp A: the tokenizer is frozen preprocessing; coverage is e] This camp treats the tokenizer as affecting encodability but not much else. As long as there is no OOV problem, most performance differences are assumed to come from the model,
[Camp B: bigger vocab is always better; default to 256K+] This camp emphasizes shorter sequences, lower loss, and better multilingual/code behavior from larger vocabularies, and therefore treats vocabulary expansion as an almost monotonic s
[Camp C: tokenizer-free is the endgame; BPE should be abandon] This camp argues that BPE has structural flaws—boundary bias, language unfairness, and non-unique encodings—and should eventually be replaced by byte/char/patch models rather th
[Camp E: shrink or prune the vocabulary to buy alignment and ] This camp argues that pretraining may like larger vocabularies, but post-training, especially RL, may not. Tail tokens, rare long tokens, and non-unique encodings can amplify po
[Camp A: Tokenizer is frozen preprocessing; coverage is enoug] The tokenizer only needs to cover more than 99.9% of common substrings in training data, remaining characters can be represented with <unk> or bytes, and tokenizer choice has ne
[Camp B: Bigger vocab is always better; default to 256K+] Larger vocabularies shorten sequence length, reduce pretraining compute overhead, and improve model performance, so vocabulary size should be expanded to 256K+ as much as possible
[Camp C: Tokenizer-free is the endgame; BPE should be abandon] Inductive bias defects caused by BPE (non-unique encoding, unreasonable segmentation) cannot be fully solved by tokenizer optimization, and character-level or byte-level tokeniz
[Camp D: Shrink or prune the vocabulary to buy alignment and ] Low-frequency tail tokens amplify RL policy divergence, pruning 10%–20% of the tail vocabulary improves RLHF and alignment stability while reducing inference overhead
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as the vocabulary covers major characters and common subwords, tokenizer effects average out at scale; effort is better spent on data, steps, and architecture, and token
[Camp B: bigger vocab is always better; default to 256K+] Larger vocab yields shorter sequences and lower loss, so default to very large vocabularies (256K+), treating tokenization as a compression problem; tail issues are negligible or wil
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Subword tokenizers inject human inductive biases and cross-language unfairness and admit ambiguous encodings; modeling directly over bytes/characters removes these issues and learns
[Camp E: shrink/prune the vocabulary to buy alignment and RL ] During post-training (especially RL), tail tokens amplify policy divergence and instability; proactively pruning the tail or transferring to a smaller vocabulary yields more sta
[Camp A: kernels and algorithms must be co-designed (decide b] Treat decode KV bandwidth, attention IO dataflow, and low-precision scaling contracts as architecture inputs; structural changes (GQA/MLA), kernel forms (FlashAttention), and nu
[Camp B: PyTorch/graph-compiler level is enough; handwritten ] As long as the model is expressed as a high-level operator graph, the system will automatically select/generate efficient kernels across hardware; teams should focus on model an
[Camp C: hardware will get faster; algorithms need not adapt ] Hardware provides better low precision and bandwidth, so training can be stable with minimal changes; focusing on model innovation is more cost-effective than sweating kernel/nu
[Camp D: non-NVIDIA hardware will catch up; CUDA ecosystem wi] Low-precision formats and kernel stacks will mature on other platforms, so teams should avoid binding algorithms to CUDA-specific features and instead pursue portable operator e
[Camp A: hand-tuned 4D (Megatron / MegaScale style)] Treat parallel dimensions as a topology mapping problem: TP/EP on NVLink islands, PP across IB, DP/FSDP across pods; use explicit shapes and scheduling (e.g., zero-bubble) to keep MFU in
[Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Encode primitive frequency into mesh and topology: keep TP/EP within NVLink islands, use PP over IB, DP/FSDP across pods; deliver reproducibility via MFU plus topology details.
[Camp B: auto-parallel (Alpa / GSPMD / pjit)] Delegate sharding to compilers/runtimes via search or cost models to generate TP/PP/DP plans, lowering the engineering bar of hand-tuned 4D and approaching hand-tuned throughput at moderate scal
[Camp C: FSDP-only is enough (low intrusion first)] Use mostly DP/FSDP (ZeRO/FSDP family) to address memory and scaling, keeping complexity in the framework layer and minimizing intrusion into model code and operator implementations.
[Camp C: FSDP-only / ZeRO-only (low-intrusion first)] Use DP/FSDP (or ZeRO family) as far as possible to address memory and scaling, keeping complexity in the framework; rely on overlap, structure awareness, and runtime optimizations to rea
[Camp D: 3D (DP+TP+PP) is enough; SP/CP optional] Stick to the mature 3D recipe: TP for intra-op parallelism, PP for depth sharding, DP for throughput; enable long-context extra dimensions (SP/CP) only in extreme cases.
[Camp D: classic 3D (DP+TP+PP) is enough; SP/CP are optional] Stick to the mature 3D recipe: TP for intra-operator parallelism, PP for depth partitioning, DP/FSDP for throughput; long context should rely more on faster attention kernels or
[Camp A: generalists should push code past >40%; more is alwa] Argues code is the primary fuel for general capability: with enough code, reasoning and tool use should increase monotonically; NL regression either won’t happen or can be cheap
[Camp B: code helps mostly via optimization/regularization (l] Claims code’s main value is a more compressible, lower-noise gradient signal: under the same compute it trains more stably and overfits less, so reasoning gains are a byproduct
[Camp C: keep code <10% to protect NL; generalists should not] Treats code as a strong distributional bias: once its share rises, it displaces world knowledge and natural language coverage, degrading dialogue, writing, and commonsense. Henc
[Camp C: keep code <10% to protect NL in generalists] Treats code as a strong distributional bias that displaces natural language coverage, causing style drift, narrower knowledge, or weaker dialogue; thus generalists should minimize code a
[Camp D: code ability must be trained from scratch; continual] Argues continual training will unpredictably damage existing language capabilities or alignment properties, especially when adding a large new distribution (code). Therefore one
[Camp A: Positional extrapolation is enough to achieve long c] Only through positional encoding optimizations such as RoPE extrapolation, effective long context can be achieved without additional long data training, greatly reducing expansi
[Camp B: The data recipe is the main variable] The core of long-context expansion is optimizing the length distribution, domain distribution and burstiness of training data, with positional encoding only as an auxiliary optimization item.
[Camp C: Packing engineering is the under-exploited lever] Optimization of sequence concatenation, truncation control, and packing strategies can greatly improve effective context without increasing training compute and data costs.
[Camp D: Switch architectures (SSM / linear) to bypass positi] The quadratic attention complexity of Transformers is the core bottleneck for long-context expansion, switching to linear architectures such as Mamba can achieve more efficient
[Camp A: Formal search first (laws/regression/robust optimiza] Advocates fitting mixture→loss/downstream performance laws from small runs and extrapolating to target scale to reduce large sweeps; treats ratios as optimization variables and
[Camp B: Heuristics + curricula are more robust (a few ablati] Argues for getting data engineering right first (filtering, dedup, bucketization), then using a small number of controls and staged curricula to allocate scarce domains; views c
[Camp C: Online/adaptive mixing beats one-shot offline search] Argues for dynamically reweighting during training based on task/gradients/importance sampling to avoid committing to a wrong fixed ratio; especially for specialist goals or dis
[Camp D: Ratios are second-order; quality/selection is first-] Argues for defining “usable data” first: filtering/selection/detox/denoise often yields more direct gains; when low-quality web dominates, fine ratio tuning is drowned by noise.
[Camp A: quality classifiers + ablation ladders are sufficien] Treat data-value assessment as “repeatable engineering experiments”: bulk filtering via classifiers/perplexity/dedup, then ladders/mixture sweeps to validate net gains of each d
[Camp B: influence/attribution is the main path (infer data r] Argues that “data value” is fundamentally example-level contribution ranking; start with influence/attribution to find critical examples/clusters, then edit/reweight around them
[Camp B: example-level influence/attribution is the main path] “Data value” is fundamentally an example-level contribution ranking: use influence/attribution to find key examples and clusters, then do targeted curation or reweighting to red
[Camp C: full causal inference is the future (solve confoundi] Argues confounding between domains/features and downstream capabilities is too strong; correlational methods (classifiers, mixture sweeps, influence) are easily misled by common
[Camp C: full causal identification is the future (IV/mediato] Confounding between domain/features and downstream capabilities is too strong; classifiers, mixture sweeps, and influence are all contaminated by distribution shift. Data select
[Camp D: skip measurement, rely on intuition and scale (scali] Argues data-evaluation tooling is expensive and unstable (especially across scales). Prefer scaling-law planning: maximize tokens and coverage; use repetition or broader mixture
[Camp D: skip measurement, rely on intuition and scale (cover] Measurement toolchains are expensive and unstable (especially across scale). Follow scaling laws: maximize tokens and coverage, use repetition when needed to hedge filtering bia
[Camp A: FA is largely the endpoint of attention engineering ] On Hopper, FA3 [Shah2024FA3] already pushes dense attention close to compute-bound; further hand-tuned kernels often yield only 10–20% marginal gains while increasing maintenanc
[Camp B: the main line is Triton/FlexAttention—move optimizat] The long-term cost is not peak kernel throughput but variant explosion and maintenance: ALiBi/SWA/document masks/soft prompt masks/MoA keep appearing. FlexAttention [Dong2024Fle
[Camp C: attention itself should be replaced (SSM / sparse / ] Dense attention’s quadratic cost and KV cache keep long-context inference expensive; therefore structural replacements (sparse attention like Longformer [Beltagy2020Longformer],
[Camp D: FA3 embodies NVIDIA lock-in; avoid binding critical ] FA3 [Shah2024FA3] heavily depends on Hopper-specific asynchrony and instruction features; Luo et al. [Luo2024HopperDissect] indicates these are not “minor details”. Betting crit
[Camp A: explicit positional encoding is required; RoPE extra] Argues positional extrapolation is the core: use PI/YaRN to reach 128K, then LongRoPE to stabilize 1M+; evaluation and data recipes are secondary and do not change the main path
[Camp B: SP and DP are orthogonal; scaling to million tokens ] Claims long-sequence training is primarily a systems problem: FlashAttention-2 improves single-GPU efficiency, Ring/Ulysses split the sequence dimension, and the rest is standar
[Camp C: NIAH/perplexity is sufficient; task benchmarks are t] Argues long-context training should rely on perplexity and NIAH: reproducible, cheap, fast iteration; task benchmarks like LongBench/RepoQA are prompt- and dataset-biased, thus
[Camp C: perplexity/NIAH are sufficient; task benchmarks are ] Use perplexity and NIAH as primary metrics: cheap, reproducible, fast iteration; task benchmarks like LongBench/RepoBench are sensitive to prompting, dataset bias, and leakage,
[Camp D: alternative architectures (sparse/SSM/linear attenti] Claims full attention’s O(L^2) is unsustainable; we should move to subquadratic architectures (sparse attention, compressed memory, SSMs) for “native unlimited context.”
[Camp D: dense O(L^2) attention is unsustainable; alternative] Move to sparse attention, compressed memory, external memory, or attention-free/SSM-like architectures for “naturally infinite context,” rather than pushing Transformer dense wi
[Camp A: MoE becomes the default backbone (dense remains for ] Argues conditional compute increases total capacity at fixed training FLOPs; as templated recipes and reproduction platforms mature, stability and engineering barriers drop, mak
[Camp B: dense wins on full-lifecycle ROI (especially under u] Argues MoE advantages are amplified by scratch-training assumptions; in practice, teams reuse existing dense weights and post-training assets. Upcycling scaling laws show ceilin
[Camp C: learned routing/balancing is overrated (random/froze] Claims many MoE gains come from capacity effects (more total params with sparse activation) rather than sophisticated routing; in some settings, random/frozen routers can approa
[Camp C: learned routing/balancing is overrated (random/froze] Claims the benefit of learning routers is limited: many gains come from larger total parameters and better systems; in some settings frozen random routers can be close to learne
[Camp C: learned routing/balancing is overrated; random/froze] Many MoE gains come from total-parameter capacity under sparse activation rather than sophisticated routing; in some settings random/frozen routers approach learned routers on v
[Camp D: MoE is mainly for pretraining; post-training should ] Argues MoE’s value is concentrated in pretraining efficiency; post-training (SFT/DPO/RLHF) prioritizes stable optimization and controllability, and sparse routing adds noise and
[Camp D: MoE is mainly for pretraining; post-training should ] SFT/DPO/RLHF prioritize stable optimization and controllable throughput; sparse routing adds noise and systems complexity, so use dense training + sparse inference, or migrate p
[Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] Without changing pretraining, the most reliable extension is “move low-frequency, keep high-frequency,” plus temperature compensation for long-range attention distributions. YaR
[Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] At 512K–2M, errors are no longer dominated by “phase overflow” but by per-dim mismatch: different frequency dimensions need different extrapolation behavior, and a single ramp/s
[Camp C: ≥512K needs LongRoPE-style per-dim search/learning; ] At 512K–2M, the dominant error is per-dim mismatch rather than phase overflow: different frequency bands need different extrapolation, and a single ramp/scaling causes both unde
[Camp D: bypass RoPE (Mamba / length-extrapolatable Transform] RoPE extrapolation and quadratic attention make costs hard to control; instead use linear/sparse/state-space models or external memory to compress/carry history, enabling train-
[Camp A: PPL remains the most reliable primary variable (at l] Argues pretraining loss/PPL should remain central for scaling and engineering decisions: it is cheap, low-noise, and extrapolatable; in data-constrained regimes or fast iteratio
[Camp B: PPL is only stage-1 inside a task-scaling pipeline] Argues “continue training / model selection” should be modeled as per-task scaling: PPL monitors training, but final decisions should be driven by task-curve extrapolation with un
[Camp C: stop searching for a scalar; define quality via stan] Argues “quality” is inherently multi-axis: general capability, long context, tool use, alignment behavior, embedding/retrieval, etc., should not be collapsed into one number. Us
[Camp D: the issue is ontological—next-token loss is not the ] Argues uniform next-token loss does not correspond to the value of “useful tokens”: tokens contribute unequally to capability, and averaging dilutes key learning signals, so a s
[Camp A: µP (upgraded to Complete-P) should be the default; f] Argues parameterization should be treated as a transferable protocol: enforce coord check as a hard acceptance gate, prioritize zero-shot transfer of LR/init across width (then
[Camp B: empirical formulas + small sweeps are enough; µP is ] Treats HP transfer as a statistical fitting problem: under a fixed SP recipe, use empirical formulas or joint scaling laws to produce starting points for LR/batch/training durat
[Camp B: empirical formulas + small sweeps are enough; µP is ] Under a fixed SP recipe, empirical formulas and joint scaling laws can directly provide starting points for LR, batch, and token:param, and 1–2 small corrective sweeps can get c
[Camp C: end-to-end Bayesian/automatic optimization will repl] Argues to reduce manual HPs and transfer assumptions: use automation (automatic GD, warm-start BO, even optimizer search) to optimize directly at target scale, avoiding extrapol
[Camp D: transfer is dominated by non-transferable HPs; wd/β₂] Argues to re-evaluate the gains of µP/formulas: in real training, wd, β₂, and batch can dominate the optimal LR via stability boundaries and regularization mechanisms, so “trans
[Camp A: Dedup as much as possible (treat repetition as noise] Advocates aggressive dedup (from exact to semantic near-dup), viewing repetition as wasted compute plus leakage/contamination risk; prioritize removing repetition in web corpora
[Camp B: Uniform repetition ≤4 epochs is almost free (treat r] Argues that when high-quality tokens are limited, simply run uniform multi-epoch repetition, expecting near-fresh-token gains up to ~4 epochs; in engineering, prioritize fully u
[Camp C: Semantic dedup is the real battleground (exact dedup] Claims that semantic near-dup (paraphrases/reposts/template rewrites) is the dominant redundancy source at web scale; exact/MinHash leaves many tokens that repeat information bu
[Camp C: Semantic dedup is the main battleground (exact/MinHa] At web scale, the dominant redundancy comes from rewrites, repackaging, and semantic near-duplicates within topic clusters; exact/MinHash leaves many “same information, differen
[Camp D: Zero repetition for sensitive/eval/copyright data (t] Advocates near-zero exposure for benchmarks, PII, and copyrighted content: keep them out of the main training pool and out of multi-epoch; rely on strict filtering/dedup to ensu
[Camp A: AdamW won’t be retired (highest default priority)] Treat the optimizer as production infrastructure: prioritize reusable recipes, predictable failure modes, and robustness under fixed A/B budgets. Even if lower-loss methods exist,
[Camp B: Muon is the next default (but only as a hybrid)] Constrain “second-order gains” to the parameter subset that benefits most: hidden 2D weights. By keeping embeddings/norms/heads on AdamW, Muon isolates instability and tuning risk wh
[Camp B: Muon is the next default (but only as a hybrid)] Localizing second-order structure to hidden 2D weights is the deployable compromise: gains focus on the most matrix-like parameters while embeddings/norms/heads stay on AdamW to avoi
[Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Argues diagonal adaptivity (AdamW) is fundamentally limited on ill-conditioned layers; Kronecker-structured second-order preconditioning is closer to the “right geometry”[Gupta2
[Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Diagonal adaptivity is inherently limited on ill-conditioned, strongly coupled layers; Kronecker-structured preconditioning is closer to the right geometry. SOAP compresses seco
[Camp D: optimizers matter less; many gains are evaluation ar] Claims many optimizer-paper gains vanish under fair comparisons: schedules, dependence on stopping step T, and HP-search budgets are the real drivers. Standardize evaluation pro
[Camp A: short-to-long + per-doc masking (default engineering] Treat document boundaries as a hard objective constraint: packing targets throughput but must use per-doc causal masking; use short-to-long to push long-window compute to the ta
[Camp A: per-doc masking + short-to-long (default engineering] Treat document boundaries as a hard objective constraint by default: pursue high utilization via packing without cross-doc visibility; use short-to-long with a late-stage long-c
[Camp B: uniformly mixed-length training (anti-curriculum, al] Advocates mixing sequence lengths throughout training (rather than adding long windows only at the end), arguing curricula can overfit to length distributions and that long-cont
[Camp C: naive concat + cross-doc visible (cross-boundary by ] Argues pretraining should directly cover cross-document context: concatenation trains the model on distributions closer to prompting/long-form generation, and per-doc masking ma
[Camp D: FIM for everything (infilling as a universal default] Claims FIM/denoising objectives add little cost to left-to-right modeling while covering more task forms (editing, infilling, instruction following), so they should be enabled b
[Camp A: Kaplan-style—portable exponents, compute-optimal fav] Argues joint power laws provide extrapolatable exponents for loss vs N/D/C, implying that under fixed compute one should prioritize increasing parameter count; in this framing,
[Camp B: Chinchilla-style—balance N and D under fixed compute] Advocates IsoFLOP and extrapolation to place compute-optimal near tokens/param≈20, making “more tokens, smaller model” the default recipe; engineering replications like LLaMA ap
[Camp C: Data-mixture pragmatists—data is the first axis; get] Treats filtering/dedup/quality estimation/mixture design as first-class optimization variables independent of N and tokens; at fixed budgets, data recipe differences can dominat
[Camp C: Data-mixture pragmatists — get the data recipe right] Filtering, dedup, freshness, and mixtures are first-class variables independent of N and D; under fixed budgets, data recipes can rival or exceed the gains from a parameter-scal
[Camp D: Against emergence-as-magic—many “emergent” effects c] Prioritizes explaining emergence via non-linear evaluation metrics (thresholding by 0-1 scores) and pipeline noise rather than sudden internal capability; recommends de-threshol
[Camp A: HumanEval/MBPP is sufficient to represent coding abi] Argues HumanEval[Chen2021Codex] (and multilingual variants) should be the core metric: it is cheap, stable, reproducible, and tightly tied to “can write a correct function”; thu
[Camp B: SWE-bench Verified is the only trustworthy ground tr] Claims real GitHub issue resolving is the target distribution, and Verified[SWEbenchVerified2024] reduces noise via human validation, so other benchmarks should be secondary; mo
[Camp B: SWE-bench Verified is the only trustworthy ground tr] Only real GitHub issues with patch+tests resemble engineering tasks; Verified reduces noise via human validation, so other benchmarks (function problems, execution-semantics tas
[Camp B: SWE-bench Verified is the only trustworthy ground tr] Real GitHub issue resolving is the target distribution; Verified reduces noise via human validation, so models should be ranked by Verified, with other benchmarks only as auxili
[Camp C: trajectory-level PPL (e.g., patch-PPL) is the most r] Argues pretrain should prioritize optimizing/evaluating the likelihood of editing trajectories, because real SWE is a sequence of patches and edits; thus patch-PPL is closer to
[Camp D: deployment UX metrics reflect SWE-agent value better] Claims user experience and cost structure (retries, token usage, tool failures, trajectory readability) determine real value; even with the same Verified, differences in tokens/
[Camp A: Pure SSMs will eventually replace Transformers] Claims recurrent state is sufficient for long-range dependencies in language modeling; with better implementations and scale, attention’s O(L^2) structure becomes legacy overhead. Typ
[Camp B: Linear attention and SSMs are fundamentally the same] Emphasizes “everything is recurrence”: linear attention can be written as RNN state updates, and SSMs can be viewed as structured implementations of certain attentions; differen
[Camp C: Subquadratic models must be pretrained from scratch;] Argues backbone changes break learned representations and optimization trajectories; distillation aligns only short-range behavior, while long-context and algorithmic capabiliti
[Camp D: The engineering optimum is hybrid; minimize attentio] Treats attention as a “precise addressing module” and SSM/recurrence as a “linear routing module,” inserting attention sparsely to retain most quality while substantially reduci
[Camp A: algorithms and kernels must be co-designed (bytes/FL] Treat roofline and numeric contracts as design inputs: classify memory/compute/latency bounds first, then choose structure and dataflow; treat low precision and fused-kernel num
[Camp A: algorithms and kernels must be co-designed (bytes/FL] Treat roofline and numeric contracts as architecture inputs: classify memory/compute/latency bound first, then choose structure (MQA/GQA/MLA), dataflow (FlashAttention), and low
[Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Prioritize productivity and maintainability: use FSDP, compiler IR (MLIR), and fusion/autotuning to cover most needs, avoiding critical paths being gated by a small number of ke
[Camp C: hardware will get faster; algorithms need not adapt ] BF16 adoption historically required limited algorithmic intrusion: with master weights and appropriate loss scaling, many models train successfully [Kalamkar2019BF16Study]. So o
[Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Numerics and kernel autotuning are not CUDA-exclusive: HiFloat8 shows alternative ecosystems can propose new 8-bit contracts [Luo2024HiFloat8]; OpenCL/CUDA dual-stack and autotu
[Camp B: auto-parallel (Alpa / GSPMD / pjit line)] Delegate parallel plans to compiler/search: use cost models to decide sharding, PP partitioning, and placement on computation graphs, reducing manual tuning and code intrusion, and adapting
[Camp B: auto-parallel (Alpa / GSPMD / compiler-search)] Compile parallel plans: with minimal sharding annotations, a cost model + search decides sharding, PP partitioning, and placement, adapting plans to topology changes; the goal is to a
[Camp C: FSDP-only is enough (low intrusion first)] Prefer FSDP/ZeRO to shard params, grads, and optimizer states, avoiding high-intrusion dimensions like TP/PP/CP; rely on overlap, structure-aware execution, and runtime optimizations to re
[Camp D: 3D (DP+TP+PP) is enough; SP/CP are optional optimiza] Keep system complexity within 3D: use TP+PP for model scale and DP/FSDP for throughput; SP/CP are only needed for extreme long context, and most cases can be handled by better a
[Camp A: positional extrapolation is enough; long context is ] If positional encoding / positional bias is designed well, capability extrapolates to longer inputs without changing training data; long-context degradation mainly comes from Ro
[Camp B: the data recipe is the main variable; long-doc upsam] Effective context comes from how often training contains events where far-away tokens are necessary to reduce loss; prioritize long-doc upsampling, length curriculum in continue
[Camp B: data recipe is the main variable; long-doc ratio and] Effective context comes from how often training contains events where far tokens are necessary to reduce loss; prioritize long-doc upsampling, length curriculum, continual-pretr
[Camp B: the data recipe is the main variable; long-document ] This camp argues that effective context comes from how often training contains events where reducing loss requires using distant tokens; therefore long-document upsampling, leng
[Camp C: packing/concatenation is under-exploited; sequence c] The key is not just longer single documents, but frequent exposure to cross-document reference, repetition, and alignment during training; related-doc retrieval+clustering conca
[Camp C: packing / concatenation is underestimated; sequence ] This camp argues that the key is not merely making single documents longer, but making training frequently contain cross-document reference, repetition, and alignment; related-d
[Camp D: switch architectures (sparse / memory / recurrent) t] Transformer attention and KV-cache mechanics constrain long-range read/write budget allocation, preventing effective context from scaling linearly with window size; use sparse a
[Camp D: switch architectures to bypass Transformer long-rang] Transformer attention’s quadratic cost and KV-cache shape constrain long-range read/write budgeting; use sparse attention, recurrent memory, or linear/hybrid architectures to re
[Camp A: generalists should push code past >40%; more is alwa] Treats code as a strong prior for general reasoning: more code keeps improving structured reasoning and tool-use, while NL losses can be recovered by larger models or later alig
[Camp B: code helps mostly via optimization/regularization (l] Views code as “cleaner, more compressible” tokens: it reduces gradient noise and stabilizes optimization, indirectly improving downstream; thus code-fraction debates should focu
[Camp D: code ability must be trained from scratch; continual] Argues continual is inherently unstable under strong distribution shifts like code: it either forgets NL or drifts unpredictably; thus one should train coders from scratch with
[Camp A: FlashAttention is largely the endpoint of attention ] This camp argues that the exact-attention kernel line has largely converged: use FA2/FA3 for training [Dao2023FA2][Shah2024FA3], FlashInfer for serving [Ye2024FlashInfer][Ye2025
[Camp A: FA1/FA2/FA3 already solved the core; what remains is] The dense exact-attention kernel line has converged: pick FA2/FA3 by hardware for training [Dao2023FA2][Shah2024FA3], and use an engine like FlashInfer for serving [Ye2024FlashI
[Camp B: The main line should move from hand-written CUDA to ] This camp argues that the future lies not in writing more specialized CUDA but in handing attention semantics to the compiler. FlexAttention [Dong2024Flex] provides the unified
[Camp C: Attention itself should be replaced by SSM, linear, ] This camp argues that further attention-kernel optimization is patching an old paradigm, and that the real move is to replace attention with lower-complexity structures for long
[Camp D: FA3 embodies NVIDIA generation-specific lock-in, so ] This camp worries that FA3 [Shah2024FA3] depends too heavily on Hopper TMA and warp specialization, raising migration cost and shortening maintenance lifetime, especially for mu
[Camp D: FA3 implies generation lock-in; critical paths shoul] FA3 [Shah2024FA3] depends heavily on Hopper’s TMA and warp specialization, increasing cross-architecture migration cost and shortening the maintenance window; in multi-cloud or
[Camp A: classifiers + ablation ladders are sufficient (bet o] Treat data engineering as a regression-testable experimental system: use proxies/classifiers for bulk filtering, send candidate changes into ladders, and accept/reject via per-c
[Camp A: classifiers + ablation ladders are sufficient (contr] Treat data engineering as a regression-testable experiment system: use bulk filtering to make candidates trainable, then accept decisions via ladders under fixed budgets. Influe
[Camp B: influence/attribution is the main path (infer recipe] Believes example-level attribution can directly answer “which data drives which capabilities”, reducing the need for broad sweeps/ablations: use influence/attribution to find ke
[Camp C: full causal identification is the future (solve conf] Argues the core difficulty is confounding: document features, domain, quality, length, repetition, etc. are entangled, so correlational methods pick features that look effective
[Camp D: skip measurement, rely on intuition and scale (scali] Believes marginal gains from fine-grained data valuation are unstable; the most reliable lever is scaling params/data/steps while maintaining coverage. Under data scarcity, repe
[Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as a modelable response surface or a robust optimization problem: fit mixture→loss/utility from few points and extrapolate to larger scales; or directly optimize
[Camp B: Heuristics + curricula are more robust; a few ablati] Treat mixtures as engineering recipes: ensure data quality and coverage first, then use a small number of reversible phase-wise reweightings to patch capabilities. Public recipe
[Camp C: Online/adaptive mixing beats one-shot offline search] Put weight learning into the training loop: dynamically reweight using proxy or in-training signals to reach the same loss faster or improve worst-bucket behavior. DoReMi learns
[Camp D: Ratios are second-order; quality/selection is first-] Argues the main gains come from filtering/selection: high-quality web-only can outperform “web + curated mixtures,” making ratios a detail. RefinedWeb and FineWeb emphasize filt
[Camp A: native long-context (ABF + curriculum) is the cleane] Treat long context as a training problem of “frequency budget + distribution alignment”: set base to the target length and use short-to-long curricula so long-range dependencies
[Camp B: YaRN, not PI, should be the default for 32K–128K ret] PI’s global interpolation compresses high-frequency dimensions and reduces local relative-offset resolution; YaRN’s per-dim ramp concentrates interpolation on low-frequency dims
[Camp B: for 32K–128K retrofits, default to YaRN, not PI] PI’s global interpolation compresses high-frequency dimensions and reduces local relative-position resolution; YaRN’s per-dim ramp pushes interpolation mostly into low-frequency dime
[Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] At 512K–2M, error patterns diverge across frequency bands, so a single global scaling cannot satisfy both low-frequency phase coverage and high-frequency local resolution; LongR
[Camp D: bypassing RoPE/attention (SSM/external memory/retrie] Treat “long context” as a systems problem: use linear-time models (Mamba), recurrent/external memory (RMT, Infini-attention), or retrieval augmentation to avoid O(n^2) attention
[Camp A: explicit positional design is required, and RoPE ext] This camp treats length generalization as primarily a positional-modeling problem, so engineering effort should focus on RoPE scaling, interpolation, and related stabilization t
[Camp B: scaling to million tokens is mostly a parallelism pr] This camp treats long context mainly as a training-systems problem, arguing that once kernels, model parallelism, and sequence parallelism are in place, window scaling is mostly
[Camp C: NIAH/perplexity are sufficient; task benchmarks are ] This camp prefers simple, repeatable proxies and argues that needle tests and perplexity are sufficient for long-context ability, while complex task benchmarks are too affected
[Camp D: sparse, memory, or linear-attention alternatives wil] This camp argues that the quadratic cost of full attention is ultimately unsustainable, so the field should move to sparse attention, memory mechanisms, or alternative architect
[Camp A: MoE becomes the default backbone (dense kept for sma] Argues MoE should be the mainstream scaling path: at similar training compute, use larger total parameters for stronger capability; the key is templating balancing, congestion c
[Camp B: dense wins on full-lifecycle ROI (especially under u] Argues MoE gains must be ledgered by phase: dense→MoE conversion saturates early, and stabilization plus systems overhead is non-trivial; thus a steadier strategy is to keep tra
[Camp D: MoE is mainly for pretraining; post-training should ] Argues MoE instability and systems complexity should be confined to pretraining: post-training (SFT/RLHF/alignment) benefits more from stable throughput and controllable optimiz
[Camp A: Complete-P should be the default starting point, and] This camp argues that once the network family is specified correctly and module scaling rules are completed, base hyperparameters should transfer stably across width and partly
[Camp B: empirical formulas plus small sweeps are enough; µP ] This camp emphasizes the reality of fixed recipes: if proxy runs match the target-scale distribution, joint scaling laws can provide good initial values for LR, batch, and token
[Camp C: end-to-end BO / automatic optimization will replace ] This camp argues that since real training stacks have complex hyperparameter couplings, it is better to let automatic optimizers or BO learn update rules or search strategies on
[Camp D: transfer error is dominated by non-transferable hype] This camp argues that many cases of 'LR not transferring' are actually caused by failing to model wd, β₂, norm control, and related variables jointly. Under AdamW in particular,
[Camp A: Dedup as much as possible; repetition is mostly nois] This line treats repetition in web corpora as low-value tokens. The first objective is to remove exact, near-exact, and long-substring duplication to improve token efficiency wh
[Camp B: Uniform repetition up to about 4 epochs is close to ] This line is based on data-constrained training results: when unique tokens are insufficient, the first few uniform passes can still deliver returns close to fresh tokens, so it
[Camp C: The real battleground is semantic dedup; exact dedup] This line argues that at web scale, the larger redundancy source is not verbatim duplication but semantic near-duplicates, templated rewrites, and same-cluster content, so embed
[Camp D: For sensitive/eval/copyrighted data, zero repetition] This line treats repetition as a compliance and leakage risk rather than a token-efficiency issue. Benchmark contamination, PII leakage, and copyrighted-text memorization are be
[Camp A: PPL remains the most reliable primary variable (at l] With a stable training recipe, validation loss/PPL scaling is stable enough to drive budgeting, model selection, and even most pre-release comparisons; downstream evaluation is
[Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Use PPL in-loop for process and budgeting; product decisions (train more/overtrain/compress/ship) should be driven by per-task curves and extrapolation error.
[Camp C: stop searching for a scalar; define quality via stan] LM quality is multi-dimensional: capability, robustness, safety, bias, privacy cannot be summarized by one scalar; evaluation should be institutionalized as standardized scenari
[Camp D: the issue is ontological—next-token loss is not the ] Many product goals (dialog safety, instruction following, avoiding repetition, functional correctness) are not natural outcomes of MLE; change the objective (preference optimiza
[Camp B: always-mixed lengths (anti-curriculum)] The length distribution should cover the target window from the start to avoid baking short-length statistics as the default; mixed sampling improves length robustness and reduces distributio
[Camp C: cross-doc visibility by default (beyond-boundaries c] Align training distribution with inference-time context stitching: concatenation trains the model to condition across paragraphs/documents during pretraining, better matching in
[Camp D: FIM/denoising-style objectives as default (infilling] Pure causal LM is not the only reasonable default: mixed objectives (including infilling/span corruption) cover more task shapes and reduce the “only continuation” bias; FIM sho
[Camp A: Pure SSM will be the endgame for long context] Argues stronger SSM parameterizations and kernels will fully replace attention: constant-memory inference and high training throughput via scan/matmul optimizations; the main long-cont
[Camp B: Hybrids are the production default (tunable 1:3 to 1] Treat attention as a scarce resource: keep a few attention layers for discrete retrieval/routing and use Mamba/SSM elsewhere for linear throughput; tune ratios by context length
[Camp C: No architecture change—Transformer + long-context ex] Reduce cost via sparse/windowed/approximate attention and positional encoding; attention’s addressable interaction is the core capability and should not be weakened.
[Camp D: RWKV/linear RNN is the correct RNN revival path] Recurrence offers the cleanest deployment advantage: constant-memory and native streaming; training can be parallelized to avoid classic RNN bottlenecks. With better gating/parameter
[Camp A: AdamW won’t be retired (highest default priority)] Default choice should minimize total cost: fewer failure modes, transferable tuning, and mature tooling matter more than pointwise wins. In ≥70B production, the cost of 1–2 extra s
[Camp A: AdamW won’t be retired (highest default priority)] Default choice should minimize total cost: reusable recipes, predictable failure modes, and transferable tuning matter more than pointwise optimality. For ≥70B, get μP/LR transfer
[Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Tensor-level preconditioning is closer to the right geometry; SOAP reduces Shampoo’s stability and tuning burden to an acceptable level, leaving mainly implementation and scalin
[Camp D: optimizers matter less; many gains are evaluation ar] Many “faster/better” claims come from mismatched schedule families, unequal tuning budgets, or inconsistent evaluation metrics. Once protocols are aligned, optimizer gaps shrink
[Camp D: optimizers matter less; many gains are evaluation ar] With fair tuning (fixed budgets) and aligned schedules, many optimizer gaps vanish or shrink sharply; effort is better spent on standardized protocols, robust schedules, and reu
[Camp A: Pure SSM/RWKV will eventually replace Transformers] Treat attention as legacy overhead: with stronger SSM parameterizations, gating, and scaling, pure recurrence will match mainstream LM metrics while winning deployment via linear-
[Camp B: Linear attention and SSMs are the same class and mut] Emphasize operator-level equivalence: attention can be rewritten as structured recurrence, linear attention as an RNN; therefore choosing attention vs SSM is mostly an implement
[Camp C: Subquadratic models must be pretrained from scratch;] Argue distillation introduces uncontrolled bias: the student backbone has different inductive biases, transferred behaviors may be brittle approximations; therefore pretrain fro
[Camp D: The engineering optimum is hybrid; minimize attentio] Treat attention as an exact-addressing module and recurrence/SSMs as compression/routing modules; backstop copy/retrieval with a few attention layers while linearizing most laye
[Camp A: Kaplan-style — portable exponents, fixed-compute sho] The core claim is that joint power laws are stable enough that compute-optimal recipes can be extrapolated from modest sweeps, and that under fixed compute, budget should tilt t
[Camp B: Chinchilla-style — balance N and D under fixed compu] The core claim is that many large models were simply not trained long enough; under fixed compute, smaller models with more tokens are more efficient, with an empirical center a
[Camp C: Data-mixture pragmatists — data is the first axis, g] The core claim is that gains from data filtering, deduplication, freshness, and mixture can be as large as a model-scale upgrade; budget allocation should therefore run a data P
[Camp D: Against emergence-as-magic — many “emergent” effects] The core claim is that capability growth is not always a mysterious phase change; many apparently sudden effects come from discrete metrics, thresholded tasks, and evaluation no
[Camp A: HumanEval/MBPP is sufficient to represent coding abi] Function-style pass@k is simple, reproducible, and cheap; HumanEval/MBPP can be the primary metric for model iteration, while additional benchmarks mostly add noise and evaluati
[Camp C: patch-PPL/code BPB in pretraining is the best predic] Likelihood metrics are cheap and stable for large sweeps; if patch-PPL drops, the model is better at producing “realistic fixes”, so downstream issue resolving should improve to
[Camp D: deployment UX metrics reflect SWE-agent value better] Users care about solving within budget: retries, execution success rate, token cost, and whether trajectories are readable/controllable; pass@1 or a single Verified score does n
[Camp A: scaffolding and test-time compute are everything] Treats SWE as an inference-system design problem: retrieval, tool use, hierarchical debugging, multi-agent collaboration, and more test-time compute can cover most real engineering
[Camp B: RL and verifiers are the true drivers] Treats SWE as verifiable sequential decision making: with tests/rules/static analyzers as verifiers, RL or preference optimization can push pass rates higher; pretraining only needs basic synt
[Camp C: just mix more code (code is all you need)] Approximates SWE capability as code language modeling: with enough code tokens and broad language coverage, the model will naturally learn repair and engineering; process data and environm
[Camp C: just mix more code (code is all you need)] Code capability follows the scaling law. As long as enough code tokens are trained, the model will naturally emerge software engineering capabilities, and there is no need to additionally
[Camp D: data shape first (repo/patch/process/execution first] Treats SWE capability as distribution matching: during pretrain/mid-training, make repo-level co-occurrence, patch insertion, commit/PR processes, and tests/CI/traces frequent t
[Camp D: data shape first (repo/patch/process/execution first] The distribution of SWE tasks should be matched during the pretrain/mid-training stage: repo-level co-occurrence, patch insertion, development process data, executable feedback,
[Camp A: synthetic-first can be a primary route (especially u] When high-quality real data is constrained, synthetic (textbooks, synthetic exercises, seed→evol) can substitute for scarce data and drive capability growth; the key is making s
[Camp B: web-heavy backbone + (real/synthetic) mid-train is t] Train general capability with a web-heavy backbone, then use mid-train to pull toward code/math/reasoning/long-context; synthetic serves shaping and coverage completion rather t
[Camp B: web-heavy backbone + (real/synthetic) mid-train (a m] Learn broad coverage and tails with a web-heavy backbone, then use mid-train to pull toward target-domain distributions; synthetic is mainly for shaping and coverage fill, not r
[Camp C: avoid synthetic as much as possible; stronger filter] Synthetic introduces teacher bias and style contraction; instead of generating, make web data cleaner, more deduplicated, and better domain-covered. RefinedWeb, SemDeDup, and CC
[Camp D: synthetic scales almost without bound; collapse is m] If generation quality is high enough, synthetic can keep scaling and substitute for real-data bottlenecks; collapse is not a primary constraint.
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] Treat tokenization as a one-shot engineering choice: if OOV is low and average token length is reasonable, avoid changing it; focus on per-token PPL and downstream tasks, and le
[Camp B: bigger vocab is always better; default to 256K+] Argues larger vocabs shorten sequences and reduce cross-token dependencies, improving both quality and throughput; therefore default to 256K+ and rely on embedding/softmax engineerin
[Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Advocates byte/character modeling to avoid BPE ambiguity, tail tokens, and number/date fragmentation; accept higher systems cost via larger models and longer training.
[Camp E: shrink/prune the vocabulary to buy alignment and RL ] Argues smaller vocabs reduce rare tokens and tail-token risk, and make RL/alignment credit assignment more stable; accept longer sequences and higher inference cost.
[Camp A: architecture is mostly done; keep scaling from scrat] Most gains come from parameters, data, and compute; architecture details are mostly constant factors, and investing in new attention variants usually returns less than increasin
[Camp B: the next backbone should move to SSM / RetNet / Mamb] Attention’s quadratic behavior and KV-cache are structural bottlenecks; switching to retention/SSM/recurrence rewrites parallelism and inference cost at the root.
[Camp C: a second scaling path should be default—grow first, ] With a stable base, growing depth/blocks and continued pretraining can inherit representations and optimization state, reducing repeated from-scratch cost; especially suitable f
[Camp D: QK-Norm / sandwich norm are optional details] Instability can be handled by learning rate, batch size, and optimizer tuning; extra norm components add complexity and potential distribution shifts, so they should not be defaulted.
[Camp D: stability is mostly LR/optimizer/data; QK-Norm/sandw] Instability can mostly be handled via learning rate, warmup, initialization, clipping, and data cleaning; extra norm components add complexity and potential distribution shift,
[Camp A: PE/extrapolation is enough; long context is mainly a] If RoPE extrapolation (PI/YaRN/related scaling) is done correctly, short-context models can be extended to long context; extra long-data training and packing are secondary, cost
[Camp A: PE / extrapolation is enough; long context is mainly] Representative work argues that if RoPE base, interpolation, scaling, or position bias is designed correctly, short-context models can extrapolate to long context; extra long-da
[Camp C: packing/concatenation is underused; sequence constru] With the same data pool and PE, related-doc clustered packing, low-truncation packing, and explicit separators can further lift long-context ability; the mechanism is increasing
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as there is no obvious OOV/garbling, tokenization mostly affects I/O and token count; the main training budget should go to model scale, data quality, and recipe, and to
[Camp A: tokenizer is frozen preprocessing; coverage is enoug] Spend budget on parameters/data/training recipe; as long as there is no OOV/garbling, tokenization should not be a primary variable in the main training cycle. In practice, team
[Camp B: bigger vocab is near-monotonic; default to 256K+] Larger vocabs increase compression and shorten sequences, saving attention; embedding/softmax overhead is relatively small, so one should keep expanding from 128K to 256K+ and treat
[Camp B: bigger vocab is near-monotonic; default to 256K+] Larger vocab shortens sequences, saving attention and making long context cheaper; embedding/softmax overhead is relatively small, so the default should move from 128K to 256K+.
[Camp B: bigger vocab is near-monotonic; default to 256K+] Treat larger vocab as “cheap compression”: shorter sequences reduce attention cost and make long-context cheaper; embedding/softmax overhead is relatively small. It is especially be
[Camp E: shrink/prune the vocabulary to buy alignment and RL ] Pretraining may prefer large vocabs, but post-training (RLHF/DPO) cares more about policy consistency and controllability; proactively prune tails and split heavy-domain long me
[Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Treat TP/EP/PP/DP(FSDP)/CP as hierarchical placement of communication primitives: high-frequency (TP/EP) pinned to NVLink, PP over IB, DP/FSDP across pods; under long context CP
[Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Use high-level attention programming models and IRs (e.g., MLIR) to generate optimized kernels, reducing dependence on a small set of CUDA experts; algorithm teams focus on mode
[Camp C: hardware will get faster; dense MHA + BF16 need not ] Per-generation bandwidth and FLOPs gains will naturally absorb most bottlenecks; keeping standard transformer primitives and BF16 training reduces engineering cost, while struct
[Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Hardware diversification (TPU/ROCm/custom accelerators) makes portable operator expressions and compiler routes more important; teams should avoid binding algorithms to CUDA-spe
[Camp B: the mainline should move from hand-written CUDA to T] Long-run cost comes from variant explosion and cross-hardware deployment; the maintenance window of hand-written CUDA keeps shrinking. FlexAttention [Dong2024Flex] lifts variant
[Camp C: attention should be replaced by SSM/linear/sparse st] Dense attention’s O(L^2) complexity and KV-cache cost worsen with context length, so linear/sparse/recurrent structures (e.g., Longformer [Beltagy2020Longformer], Reformer [Kita
[Camp A: Formal search first (regression/laws/robust optimiza] Domain weights are first-order variables and should be searched systematically like learning rates: fit response surfaces on proxy ladders (RegMix) or extrapolate via mixing law
[Camp B: Heuristics + curricula are more robust; 2–3 ablation] In real engineering, the scarce resource is stable accounting and iteration speed, not algorithm novelty. Public recipes (LLaMA, Gopher, Llama 3, OLMo) provide strong priors: we
[Camp C: Online/adaptive mixing beats one-shot offline ratios] Mixtures are non-stationary: training phases change marginal returns per bucket, so weights should adapt online. DoReMi learns weights via Group DRO to improve worst domains and
[Camp A: MoE becomes the default backbone for frontier pretra] Under fixed training FLOPs, MoE expands total parameters to increase knowledge capacity; with mature fine-grained+shared and aux-loss-free balancing templates, failure rates bec
[Camp B: dense wins on full-lifecycle ROI; upcycling makes Mo] Real organizations more often reuse existing dense weights and post-training assets; upcycling scaling laws show ceilings and token-rich dependence, and systems overhead plus st
[Camp A: native long context (base-aligned + curriculum/ABF) ] The ceiling is set by RoPE phase coverage, and usability is set by training distribution; therefore base should be aligned to the target window during pretraining/continual pret
[Camp D: bypass RoPE/attention (SSM/external memory/retrieval] Quadratic attention cost and RoPE extrapolation risk make training and deployment hard to control; a more reasonable approach is to use linear/sparse/recurrent/memory/compressio
[Camp A: positional extrapolation is the main line; evaluatio] Extend to 128K with PI/YaRN, then stabilize 1M+ with LongRoPE; once extrapolation is stable, the model will naturally learn to use long context, while evaluation and data recipe
[Camp B: long-sequence training is mainly a systems-paralleli] FlashAttention-like kernels improve per-GPU efficiency and sequence parallelism shards the length dimension; push DP/TP/PP/SP to their limits and the rest is mostly resource sca
[Camp A: Complete-P as the default; formulas are a stopgap] Get the model family and scaling closure right first: use Complete-P’s module-wise rules to patch modern components and enforce coord check / RMS overlap as a hard acceptance gate;
[Camp C: end-to-end automation (BO/auto-optimizers) will repl] Instead of maintaining transfer rules, hand the objective to automation: warm-start BO transfers prior trial distributions, CARBS-like methods search directly on the cost×loss P
[Camp D: transfer error is dominated by non-transferable HPs,] Many “LR does not transfer” observations are misattributions: under AdamW, wd is an independent axis, and β₂ couples with batch/noise to shift stability boundaries and optima; w
[Camp B: Muon is the next default (but only as a hybrid)] Near-second-order benefits concentrate in hidden 2D weights; hybrid routing localizes both gains and risks: Muon on hidden weights, AdamW elsewhere. The primary goal is wall-clock an
[Camp A: Dedup-first and aggressive (exact → near → semantic)] Repetition is mostly wasted tokens and a risk amplifier; for web pools, delete aggressively first, then discuss epochs/recipes. Semantic near-duplicates and within-topic redunda
[Camp B: Epochs-first (fill compute under data constraints)] When unique high-quality tokens are insufficient, uniform multi-epoch training is the most direct way to utilize compute; the first few passes yield gains close to adding fresh to
[Camp D: Zero exposure (or 0–1 exposure) for risky data overr] Benchmarks, PII, and copyrighted content should not enter the main pretraining pool; default to 0–1 exposure with auditable controls (provenance, opt-out, filtering evidence). R
[Camp A: PPL remains the primary variable (at least for train] Treats validation loss/PPL as the most reliable scalable signal: it extrapolates compute budgets, guides parameter:token tradeoffs, and tends to move with capability on many tas
[Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Writes keep-training/scale-up decisions as per-task extrapolation: fit scaling laws for each key task and decide based on marginal gains and extrapolation error rather than upst
[Camp C: stop searching for a scalar; define quality via stan] Decomposes model quality into auditable dimensions (capability, robustness, safety, fairness, efficiency, scenario fit) and uses standardized panels/reporting to reduce single-n
[Camp D: the issue is ontological—next-token loss is not the ] Treats a “useful assistant” as preference and constraint satisfaction: helpfulness/harmlessness/instruction-following come from preference optimization and supervision selection
[Camp A: Kaplan-style — portable exponents; under fixed compu] Joint power laws are stable enough to extrapolate from small sweeps to larger scales; under fixed compute, prioritize increasing N while tokens only need to be “sufficient.” Thi
[Camp B: Chinchilla-style — tokens/param≈20 as the default re] Under fixed compute, smaller models with more tokens are more efficient; tokens/param≈20 is a reusable empirical center that avoids systematic waste from undertraining. Open evi
[Camp D: Against emergence-as-magic — many “emergent” effects] Many apparent capability jumps come from thresholded 0-1 metrics, benchmark granularity, and evaluation noise; with continuous metrics (task perplexity, log-prob) and more robus
[Camp A: per-doc masking + short-to-long (default engineering] Advocates using per-doc causal mask with FA varlen interface for packing, short-to-long curriculum for length training, enabling FIM by default only for code models, and using c
[Camp B: always-mixed lengths (anti-curriculum)] Advocates mixing short and long sequences according to distribution throughout pretraining, avoiding distribution shift in final-stage mid-train, letting models be exposed to long context ear
[Camp C: cross-doc visibility by default (beyond-boundaries c] Advocates allowing models to attend to other documents in the same pack by default during pretraining, aligning with cross-document inference scenarios such as prompt concatenat
[Camp D: FIM/denoising-style objectives as default (infilling] Advocates that mixed objectives such as FIM/span corruption have very low cost for left-to-right modeling, can cover more tasks such as infilling, editing, and instruction follo
[Camp E: shrink/prune the vocabulary to buy alignment and RL ] Alignment stages (RLHF/DPO/PPO) are sensitive to tail tokens, numerical instability, train–inference mismatch, and attack surface; sacrificing some compression to reduce tail an
[Camp E: shrink/prune the vocabulary to buy alignment and RL ] Alignment stages (RLHF/DPO/PPO) care about numerical stability and policy consistency; low-frequency tail tokens and non-unique encodings can amplify train/infer mismatch and at
[Camp A: HumanEval/MBPP is sufficient—cheap, stable, reproduc] Function-synthesis pass@k is strongly correlated with coding ability and should be the primary metric; more complex benchmarks add noise and evaluation cost, reducing iteration
[Camp C: pretrain BPB/patch-PPL best predicts SWE; downstream] Likelihood metrics are cheap and stable for large sweeps; real SWE is a distribution of patches and edit trajectories, so lower patch-PPL implies closer-to-real repair distribut
[Camp D: deployment UX/cost metrics reflect value better than] Users feel interaction cost and stability: retry gains, test execution success, token spend, and trajectory readability/controllability; therefore tokens-per-issue, retry@k, and
[Camp A: synthetic-first (a primary route under data constrai] High-quality synthetic (textbooks, exercises, explanations) has higher information density per token; for small models and structured domains (code/math) it can partially substi
[Camp C: minimize synthetic; stronger filtering + more real d] Synthetic introduces teacher bias and style narrowing and may trigger recursive degradation; instead of generating, expand real crawling, dedup, strong filtering, and pruning to
[Camp D: synthetic scales almost without bound; collapse is m] With a strong enough teacher and good sampling/feedback, synthetic ratios can keep rising; combined with “accumulate real breaks collapse,” real data can be reduced to a small s
[Camp B: Transformer state cost is near its limit; move to re] Attention’s quadratic behavior and KV-cache are structural bottlenecks; even with GQA, SWA, or cache compression, this is patching. A more coherent route is to move to retention
[Camp C: default a second scaling path—grow first, then decid] Pretrained bases are assets: deepening/inserting blocks/sparse upcycling can inherit representations and parts of optimization state, reaching near-target scale with fewer token
[Camp A: Inference-time scaffolding and test-time compute are] As long as sufficiently complex multi-agent frameworks, task decomposition, and tool calling workflows are designed, most SWE problems can be solved without modifying the base m
[Camp B: RL and verifiers are the true drivers] As long as there are enough tests, static analysis, and execution feedback as reward signals, RL or preference optimization can push the model's SWE capability to a very high level, and pretra
[Camp D: changing the architecture or system boundary is more] This camp argues that the attention and KV-cache form of Transformers determines the long-range read/write budget, so effective context is unlikely to scale linearly with window
[Camp A: keep Classic NTP + scale; hallucination is mostly so] Power-law scaling under standard NTP is robust: scale model/data/compute and improve average data quality via filtering/dedup/mixtures; explicit cross-doc context recovery is op
[Camp B: HDP / retrieval-aware pretraining—make retrieval and] Hallucination is prior-filling after missing variables Z are marginalized; fix it by writing Z back into context C via links/retrieval/tools and masking evidence out of the supe
[Camp C: Method-2 rewriting / reverse prompt-plan—structured ] Many “missing Z” issues are really low learnability due to noisy, inconsistent corpora; normalize via rewriting and recover prompts/plans via back-translation so latent task str
[Camp D: trajectory distillation / self-reflection first—CoT ] Collect CoT/self-reflection/self-refinement traces from stronger models to distill reasoning and agent behavior into smaller/cheaper models; faster and closer to product iterati
[Camp A: Long context only requires engineering length scalin] Extending context window only through engineering means such as position encoding optimization and memory optimization can meet long-context requirements, structured augmentatio
[Camp B: Hyper-Doc pretraining is a collection of scattered m] Hyper-Doc construction methods vary greatly across different domains, forced unification will limit method innovation, independent exploration per domain is more efficient.
[Camp C: Inference-time RAG can completely replace training-t] Inference-time RAG can flexibly inject latest context without modifying pretrained models, with lower cost and higher flexibility.
[Camp A: Downstream abilities are fundamentally threshold-eme] This reading, represented by Wei et al. [Wei2022Emergent], treats many tasks as near-floor at small scale and abruptly rising only after a threshold, making pre-main-run extrapo
[Camp B: The compute axis is already sufficient; there is no ] This reading argues that with well-designed small-scale experiments, compute-to-task extrapolation already supports most budget decisions, and adding loss only increases evaluat
[Camp C: Observational scaling over public models is sufficie] The route represented by Ruan et al. [Ruan2024Observational] argues that public models already span a rich capability space, so low-dimensional manifold regression can predict n
[Camp D: A single power law covers most tasks; broken laws ar] This reading emphasizes that a single power law is sample-efficient, low-parameter, and easy to communicate, so it should remain the default model; piecewise or broken fits risk
Existing data selection methods suffer from slow and computationally... We propose efficient online data mixing for language model pre-training.
[Camp A: MLA will become a general replacement for GQA] Supporters point out that MLA shrinks KV cache to a small fraction of conventional attention under long context, and both V2 [DeepSeekAI2024V2] and V3 [DeepSeekAI2024V3] use it as the
[Camp B: many-expert plus shared experts is the stable endgam] Supporters emphasize that DeepSeekMoE [Dai2024DeepSeekMoE], V2 [DeepSeekAI2024V2], and V3 [DeepSeekAI2024V3] all evolve toward finer experts with a small shared path, suggesting
[Camp C: data quality mainly comes from curated mixtures, not] Supporters argue that the steady gains from DeepSeek LLM [DeepSeekLLM2024], DeepSeek-Coder [DeepSeekCoder2024], V2 [DeepSeekAI2024V2], and V3 [DeepSeekAI2024V3] show that hand-d
[Camp D: the main path for reasoning has shifted from SFT/RLH] Supporters use DeepSeekMath [DeepSeekMath2024] and R1 [DeepSeekR12025] as evidence: after removing the critic with GRPO, pure RL or RL-first pipelines can directly induce math a
LMs assign significant attention to the first token, even if it is not semantically important, which is known as attention sink
[Camp A: NIAH can still serve as the primary long-context met] This camp assumes that if a model can reliably recover a needle from arbitrary positions, long-context capability is largely established; more complex benchmarks mainly add task
[Camp B: Long context is mostly a retrieval problem, and RAG ] This camp argues that most long-context tasks are fundamentally sparse evidence lookup, so the marginal benefit of longer windows is limited and retrieval augmentation is cheape
[Camp C: Lost-in-the-middle is mainly a positional-encoding p] This camp attributes the main cause of U-shape to RoPE extrapolation or PE design, and expects positional degradation to ease substantially once PE is changed or interpolated.
[Camp D: Long-context capability is distributed across the wh] This camp prefers to view long-context capability as a consequence of overall representation quality, treating head-level specialization as an analytic convenience rather than a
[Camp A: Global filtering and deduplication are already stron] RefinedWeb [RefinedWeb2023], Dolma [Dolma2024], and Lee et al. [Dedup2022] show that strong filtering, deduplication, documentation, and stable pipelines already support high-qu
[Camp B: Low-value data should be pruned directly; repair is ] The core claim of Less is More [LessIsMore2024] is that removing low-value tokens/samples before training is often cheaper than repairing them one by one, and better aligned wit
[Camp C: A model-generated loop can directly take over qualit] Bai et al. [ConstitutionalAI2022] and Lambert et al. [SelfRewarding2024] show that models can generate, critique, score, and recycle training signals, suggesting that data quali
[Camp D: High-density synthetic data is a more direct path th] Textbooks Are All You Need [Textbooks2023] and TinyStories [TinyStories2023] support a different route: instead of repairing massive web corpora, directly construct dense, low-n
[Camp A: Looping can largely substitute for adding parameters] Supporters point to Huginn [Geiping2025Huginn], MoR [Bae2025MoR], and some small-model depth-design evidence such as MobileLLM [Liu2024MobileLLM], arguing that repeatedly execut
[Camp B: Fixed loop counts are enough; adaptive routing only ] This camp emphasizes that fixed-r training is more stable, easier to parallelize, and easier to scale. Huginn [Geiping2025Huginn] and earlier UT [Dehghani2018UniversalTransforme
[Camp C: The real loop should live in latent state, not in th] Supporters cite Coconut [Hao2024Coconut], CoCoMix [Tack2025CoCoMix], Compressed CoT [Cheng2024CompressedCoT], latent-cache deliberation [Liu2024LatentCache], and LCM [Barrault20
[Camp D: Looping gains come from recurrent inductive bias, no] This camp cites Fan et al. [Fan2024Length], Giannou et al. [Giannou2023LoopedICL], and Sparse UT [Tan2023SparseUT], arguing that looping matters because it better matches iterat
direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach
Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood
ReAct: Synergizing Reasoning and Acting in Language Models... reasoning and acting have primarily been studied as separate topics.
[Camp A: Outcome-only RLVR is enough] On verifiable tasks, final-answer rewards plus group-based policy optimization form a simple training loop; DeepSeekMath [DeepSeekMath2024], OpenAI o1 [OpenAIo1Card2024], and DeepSeek-R1 [DeepSeekR12025
[Camp B: Process and step rewards] The main issue with sparse rewards is not only scarcity, but inability to identify the wrong step. Implicit Rewards [ImplicitRewards2025], AgentPRM [AgentPRM2025], PRL [PRL2026], and Self-Guided PRL [SelfG
[Camp C: Attribution and causal credit] When only a few steps determine success or failure, uniformly spreading reward wastes updates. SCAR [SCAR2025], SPA-RL [SPARL2025], CAPO [CAPO2025], Attribution-based CA [AttributionCA2025], InT [InT2
[Camp D: Agentic RL versus non-RL agent baselines] Search-R1 [SearchR12025], ReSearch [ReSearch2025], ToRL [ToRL2025], ReTool [ReTool2025], ToolRL [ToolRL2025], WebRL [WebRL2024], and WebAgent-R1 [WebAgentR12025] show that RL can explore to
Our model leverages grouped-query attention (GQA) for faster inference.
The extrapolation capability of Large Language Models based on Rotary Position Embedding is currently a topic of considerable interest.
Incentivize the Search Capability of LLMs without Searching
learn how many computational steps to take between receiving an input and emitting an output
Tree search at inference improves problem solving by deliberate exploration beyond left-to-right decoding.
We investigate how the pre-training loss, supervised data amount, and augmented data amount influence reasoning performances.
struggle to effectively utilize information from the middle part of the context... propose middle-focused positional encoding
The ordering of mesh dimensions is dictated by communication-primitive frequency, not by model size: TP/EP on the 8-GPU NVLink island, PP across IB, DP/FSDP across pods [Shoeybi2019Megatron, Narayanan2021PTD, Jiang2024MegaScale].
Zero-bubble 1F1B [Qi2023ZeroBubble] has replaced interleaved 1F1B [Narayanan2021PTD] as the PP default since 2024; not adopting ZB typically costs 2–5 pp MFU.
For L ≥ 32K, CP is a mandatory fourth dimension; Ring [Liu2023RingAttn] and Ulysses [Jacobs2023Ulysses] are selected or mixed by USP's topology awareness [Fang2024USP] and should not be conflated with SP.
Auto-parallel [Zheng2022Alpa, Xu2021GSPMD] reaches 90–95% of hand-tuned below 100B, but no 100B+ public reproduction exists on the GPU stack; the TPU stack (GSPMD/pjit) is the sole counter-example.
FSDP-only [Rajbhandari2021ZeROInfinity, Wang2023ZeROpp] caps near dense-70B; beyond that, TP/PP must be introduced explicitly or HBM↔NVMe bandwidth collapses MFU.
The 2026 sane MFU bands are 40–55% for dense [Jiang2024MegaScale] and 25–45% for MoE [DeepSeek2024V3]; significantly below those bands usually signals a mesh/schedule issue rather than an algorithmic one.
[counter to Camp A: hand-tuned 4D (Megatron / MegaScale style)] High engineering cost, slow response to heterogeneous new models, and mesh parameters require substantial manual sweeping.
[counter to Camp B: auto-parallel (Alpa / GSPMD / pjit)] No public GPU reproduction at 100B+; cost models still handle FP8 and heterogeneous MoE only coarsely [Ko2024DFModel].
[counter to Camp C: FSDP-only is enough] Past 70B, HBM↔NVMe bandwidth becomes the bottleneck; MoE all-to-all has no native abstraction. MegaScale [Jiang2024MegaScale] and DeepSeek-V3 [DeepSeek2024V3] show that explicit TP/EP is required at
[counter to Camp D: 3D (DP+TP+PP) is enough, SP/EP optional] For L ≥ 32K attention dominates memory and the 3D stack OOMs [Korthikanti2022SP, Liu2023RingAttn]; under MoE, EP is not optional [GShard2020, DeepSeek2024V3].
Generalist default is 20% code, range 15–25%; below 15% reasoning is under-trained, only above 30% does NL enter the risk zone ([DeepSeekCoder2024][CodeLlama2023][Aghajanyan2023SciMix]).
The specialist range is 50–90%, with a 6–9 pp MMLU drop versus a matched generalist base; don't pretend NL won't regress, recover it via post-training ([DeepSeekCoder2024]).
Code boosts the structured / multi-step slice of reasoning, not commonsense as a whole; the controlled experiment of Petty et al. [Petty2024CodeReasoning] shows +2–5 pp on math and no significant effect on commonsense.
Continual is not automatically catastrophic: Code Llama [CodeLlama2023] loses <1 pp MMLU after 500B continual code tokens; but Ma et al. [Ma2023AtWhichLayer] shows annealing-only code transfers far less than mixing throughout.
Repo-level packing [Shi2023InContextPretraining] + FIM 0.5–0.9 [Bavarian2022FIM] + structural tokens [Li2023StarCoder] are 2026 defaults; skipping any one loses 5–10 pp on cross-file completion.
[counter to Camp A: more code is always better, generalists should push ] The phase plot of Aghajanyan et al. [Aghajanyan2023SciMix] puts >40% in the interference regime; DeepSeek-Coder [DeepSeekCoder2024]'s 6–9 pp MMLU gap at 87% is a hard
[counter to Camp B: code's benefit is purely regularisation / lower effe] Olsson et al. [Olsson2022InductionHeads] offer circuit-level evidence: code's repeat-token structure is specific induction-head fuel, not substitutable by generic low
[counter to Camp C: keep code below 10% to protect NL] Three independent ratio ablations [DeepSeekCoder2024][CodeLlama2023][CodeScalingLaws2023] show reasoning is under-trained below 15%; Llama 3 [Llama3Herd] actually raises code share in l
[counter to Camp D: code ability must be trained from scratch; continual] Code Llama [CodeLlama2023]'s 500B continual loses <1 pp MMLU—a direct counter-example; Ma et al. [Ma2023AtWhichLayer] further show that with NL replay and full-run co
'Nominal 128K' models that only change positional encoding collapse to ~32K effective length on RULER [Hsieh2024RULER], consistent with Fu et al. [Fu2024DataEngineering] and Xiong et al. [Xiong2023EffectiveLongCtx] continual-pretrain ablati
ICL emergence, under Chan et al. [Chan2022DataDist]'s controlled synthetic setups, is a function of burstiness + skewed Zipfian, not parameter count; this explains why Shi et al. [Shi2023InContextPretraining]'s related-document packing adds
The induction-head [Olsson2022InductionHeads] phase transition aligns with the ICL phase transition during training; packing order directly controls whether the circuit transfers across documents.
Long-context post-training (Bai et al. [Bai2024LongAlign]) must replicate pretrain packing structure; otherwise 32K+ capability degrades during SFT.
Retrieval-only long-context evals (NIAH and variants) overstate capability by 20–30 pp; Karpinska et al. [Karpinska2024NoCha] and Goldman et al. [Goldman2024LongCtxTaxonomy] provide honest difficulty calibration.
[counter to Camp A: Positional extrapolation is enough] Hsieh et al. [Hsieh2024RULER] show effective context collapses around 32K for this lineage; Karpinska et al. [Karpinska2024NoCha] show Gemini-1.5-1M scores ~50% on book-length reasonin
[counter to Camp B: The data recipe is the main variable] Shi et al. [Shi2023InContextPretraining] and Staniszewski et al. [Staniszewski2023StructuredPacking] note: same data, different packing structures, yield differs by an order of magni
[counter to Camp C: Packing engineering is the under-exploited lever] Camp A advocates note packing is done internally and under-reported. Camp E evals (RULER [Hsieh2024RULER], NoCha [Karpinska2024NoCha]) show gaps remain after packing — no
[counter to Camp D: Switch architectures (SSM / linear) to bypass positi] On compositional evals (RULER [Hsieh2024RULER], NoCha [Karpinska2024NoCha]), same-size SSMs and hybrids still trail pure Transformers; induction heads are not a natur
Over 70% of key 2022–2026 pretrain architecture decisions are driven by kernel constraints rather than independent algorithmic choice: KV-head counts in GQA/MLA ([Ainslie2023GQA], [DeepSeek2024V2]) come directly from the decode memory-bound
FP8 pretraining above 100B drifts without per-block scaling — [Fishman2024FP8Scale] exposes systematic per-tensor failure at 2T tokens, and [Mishra2025MXFP8Recipes] supplies the reproducible MXFP8 fix.
Kernel numerics leak into convergence: [Golden2024FAStability] correlates FlashAttention's block-softmax with large-scale loss spikes, breaking the illusion that 'kernels only affect speed, not quality'.
The correct opening move for any kernel diagnosis is the roofline ([Williams2008Roofline]), not fusion: fusing a memory-bound kernel is nearly useless; classify first, optimize second.
Fine-grained MoE (DeepSeek-V2/V3, Mixtral, Qwen-MoE) is the downstream of grouped GEMM maturing ([Gale2022MegaBlocks]) combined with Hopper async ([Luo2024HopperDissect]), not an algorithmic inspiration that later found a kernel.
[counter to Camp A: kernels and algorithms must be co-designed] Too costly for non-frontier teams; smaller teams can capture 80% without writing kernels themselves.
[counter to Camp B: PyTorch level is enough] [Fishman2024FP8Scale] shows trillion-token scale requires kernel-level numerics; [Golden2024FAStability] shows kernel choice leaks into convergence; MLA ([DeepSeek2024V2]) demands a custom attent
[counter to Camp C: hardware keeps improving, algorithms don't need to a] Each hardware generation's new capabilities force algorithmic migration: Hopper async birthed FA3, Blackwell MXFP8 birthed [Mishra2025MXFP8Recipes], FP4 tensor cores
[counter to Camp D: non-NVIDIA hardware will catch up] The CUTLASS 3 / CuTe-DSL ([CUTLASS3]) and FA3 ([Shah2024FA3]) software moat is still widening; MXFP8 hardware support today is Blackwell-only, with non-NV vendors still chasing per-bloc
Proxy-to-frontier rank correlation in data ablation ladders is task-dependent: on DCLM [DCLM2024], Spearman ≈ 0.78 for MMLU but ≈ 0.41 for HumanEval across 412M↔7B, so code/reasoning calls cannot be made below 1B.
On DCLM-style ladders, proxy-to-target rank correlation is strongly capability-dependent: Spearman≈0.78 for MMLU (412M↔7B) but only ≈0.41 for HumanEval; therefore code/reasoning data decisions should not be finalized at ≤1B scale.[DCLM2024]
Influential data drifts with scale: in [AnthropicInfluence2023], the top-influence samples for the same completion overlap <10% between 810M and 52B models, invalidating the tacit assumption that influential data is model-agnostic.
Classifier family matters more than threshold tuning: DCLM [DCLM2024] shows 4–6 pp MMLU gaps between fastText, DSIR [DSIR2023], and perplexity filters, while single-classifier threshold sweeps usually move <1 pp. Pick the classifier before
Synthesis-with-verifier gains concentrate on the verifier's covered distribution: Phi [Textbooks2023] hits 50%+ on HumanEval but degrades on code distributions outside verifier coverage; synthesis is not a universal data-value lever.
[counter to Camp A: Quality classifier plus ablation ladder is enough] DCLM [DCLM2024] itself reports only 0.41 Spearman between 412M and 7B on code; Physics-of-LMs [PhysicsLMs2024] shows per-document duplication non-linearities that classi
[counter to Camp B: Influence functions are the main path] [AnthropicInfluence2023] itself reports <10% top-influence overlap across scales, making influence unreliable for production-level filtering; TRAK [TRAK2023] lacks independent >10B
[counter to Camp C: Full causal inference is the future] IV requires the instrument to affect outcome only through treatment, but backdoor paths between domain and downstream are hard to rule out on real web data; Skill-it [SkillIt2023]'s e
[counter to Camp D: Skip measurement, rely on intuition and scale] DCLM [DCLM2024]'s thousands of ablations show experienced teams' intuition lags regression predictions by 3–5 pp MMLU on mixture decisions; FineWeb-Edu [FineWeb2024]'s open
FA3 [Shah2024FA3] hits ~75% BF16 peak and ~1.2 PFLOPs/s FP8 on H100, more than 2× FA2-BF16 [Dao2023FA2] — attention has crossed from memory-bound to compute-bound.
On H100, FA3 [Shah2024FA3] pushes attention close to compute-bound, reporting ~75% peak in BF16 and ~1.2 PFLOPs/s in FP8; therefore, “hand-writing even more aggressive kernels” typically looks like 10–20% marginal gains rather than order-of
FlexAttention [Dong2024Flex] reaches 85–95% of hand-written FA2 throughput on ALiBi / SWA / soft-mask variants with ~10 lines of Python instead of ~800 lines of CUDA; the 2026 default entry point for variants is FlexAttention, not CUDA.
Training-shape kernels hit <20% SM occupancy in decode; FlashDecoding [Hong2023FlashDec] restores 70%+ by chunking KV along seq into 128/256-token blocks, and FlashInfer [Ye2024FlashInfer] is now vLLM's default backend.
FP8 FA3 shows <0.1% per-step loss deviation at 10B scale, but [Golden2024FAStable] shows some rescale orderings amplify error; production FP8 training must run a 2–5B-step BF16 control.
The 'replace attention' camp (RetNet [Sun2023RetNet], Mamba [Waleffe2024Mamba]) has matched short-context LM loss but still trails by 2–5 pp on retrieval / ICL; the 2026 reliable win is a hybrid with ~1/6 attention layers, not pure SSM.
[counter to Camp A: the FA series is the endpoint of attention engineeri] Ignores the tooling dividend of FlexAttention [Dong2024Flex] and FlashInfer [Ye2024FlashInfer]: fewer kernel authors, many more variant authors, and the center of gra
[counter to Camp B: Triton / FlexAttention is the real revolution] FlexAttention still trails FA3's FP8 + warp-specialization by 5–15% [Shah2024FA3]; at frontier training's marginal batch that 10% translates into real compute budget.
[counter to Camp C: attention itself should be replaced (SSM / linear RN] [Waleffe2024Mamba]'s own 8B study shows 2–5 pp deficits on retrieval / ICL; the stable endpoint is hybrid (~1/6 attention layers), not pure SSM.
[counter to Camp D: FA is the embodiment of NVIDIA lock-in] FA1/2 already have maintained ports on AMD MI300 and several domestic accelerators, and Triton's AMD backend matured in 2024; the lock-in concentrates on the FA3 FP8 generation, wh
At matched active parameters, DeepSeekMoE's fine-grained + 1 shared beats Mixtral-style coarse (8×7B, top-2) by 1.8–3.4 pp MMLU [Dai2024DeepSeekMoE][Jiang2024Mixtral] — the primary driver behind the 2024 template switch.
Aux-loss-free bias-EMA isn't an 'aux-loss equivalent' — it's structurally simpler: dead-expert convergence has a stochastic-approximation proof [Han2025AuxFreeTheory], and published logs show token-drop rate stays <0.5% [DeepSeek2024V3][Wan
Expert-choice routing [Zhou2022ExpertChoice] still wins on encoder-only and prefill, but can't preserve causality in decoder-only autoregressive decoding — the hard constraint that kept it out of the 2024+ open-MoE default.
Dense→MoE upcycling only pays off when the dense checkpoint is already token-rich, with effective-token factor ≈0.4–0.6× [Liew2025Upcycling]; upcycling from under-trained dense frequently yields negative marginal return, giving OLMo 3 / OLM
Router z-loss [Zoph2022STMoE] is orthogonal to aux loss — it fixes logit overflow, not load collapse — which is why DeepSeek-V3 keeps z-loss=1e-3 after removing aux loss [DeepSeek2024V3].
Router z-loss primarily prevents router-logit blow-ups and numerical overflow, not load collapse; it is orthogonal to “remove aux loss / switch to bias EMA”, which explains why DeepSeek-V3 keeps z-loss around 1e-3 even after removing aux lo
[counter to Camp A: MoE is the inevitable replacement for dense] Ignores serving-side memory footprint and post-training stability; in the 7–13B active deployment band dense is still cheaper [OLMo2025Olmo3][Walsh2024OLMo2].
[counter to Camp B: dense paths yield better ROI in the end] No response to frontier-scale ≥70B active quality ceiling; DeepSeek-V3 already overtakes many 7–70B dense models at matched FLOPs [DeepSeek2024V3][DeepSeekAI2024V2].
[counter to Camp C: expert-choice / aux-loss-free is the future] Expert-choice breaks causality under autoregressive decoders [Zhou2022ExpertChoice]; aux-loss-free bias-EMA still needs manual monitoring in the first 2000 steps and isn't mai
[counter to Camp D: MoE matters only for pretrain; post-train should rev] DeepSeek-V3 retains 256 experts through post-training with frontier alignment results, showing MoE post-training isn't structurally impossible — just engineering-heav
RegMix [Liu2024RegMix] approaches DoReMi [Xie2023DoReMi] at ~30× lower compute; mid-sized teams should default to RegMix, reserving DoReMi's proxy overhead for >100B-token, >10-domain settings.
The public recipes of Llama 3 [MetaLlama32024] and MiniCPM [Hu2024MiniCPM] place the capability-heavy stage in the last 10–30% of compute; this schedule isn't aesthetic—it exploits the fact that upsampling hard domains pays off most when gr
The filter gains in DCLM [Li2024DCLM] and FineWeb-Edu [Penedo2024FineWeb] are on the same order as mixture optimization (3–6 pp on MMLU), meaning they compete for the same improvement budget rather than adding independently.
Upsample ceilings for scarce domains are pinned by the repetition law [Muennighoff2023Repeat]: ≤4 epochs are roughly equivalent to fresh tokens, returns collapse past 16. This places a hard numerical ceiling on how far math/code can be push
Online mixing [Albalak2024ODM] matches offline DoReMi at fixed compute, but stability hinges on loss-signal normalization; under aggressive upstream quality filters it can degenerate into noise-driven reweighting.
[counter to Camp A: Formal mixture search (DoReMi / RegMix / Data Mixing] Transfer evidence past 10 domains and 30B scale is still thin; DoReMi's proxy overhead has to be paid repeatedly when the domain inventory changes.
[counter to Camp B: Heuristic + curriculum (Llama 3 / MiniCPM route)] Heuristics can't answer 'what if I upsample another 1 pp'; beyond 20 domains or without long-term institutional experience, heuristic error compounds quickly.
[counter to Camp C: Online adaptive mixing] Absolute per-domain losses aren't comparable; imperfect normalization collapses the policy onto a single domain. Open reproductions so far are limited to <10B scale.
[counter to Camp D: Ratio doesn't matter, quality does] Even after strong filtering Llama 3 [MetaLlama32024] still invests in curriculum; filters cannot substitute for explicit upsampling of multilingual or domain-scarce data.
128K capability is governed by long-document ratio, not token count: ≥25% long docs with 5B continual-pretraining tokens already saturates NIAH [Fu2024DE128K].
Perplexity rankings decouple from RULER/LongBench v2 [Gao2024EffectiveLongCtx, LongBenchV2]; selecting long-context models by ppl is no longer defensible.
[counter to Camp A: Explicit PE is necessary; RoPE interpolation is the ] Kazemnejad et al. [Kazemnejad2023NoPE] show NoPE beats ALiBi and RoPE on synthetic extrapolation, indicating the causal mask already injects enough positional signal.
[counter to Camp B: SP and DP are orthogonal and can be optimized indepe] ByteScale [ByteScale2025] measures that on real variable-length corpora the orthogonal assumption causes severe load imbalance; unifying them as one partitioning dime
[counter to Camp C: KV compression = GQA/MQA is enough] DeepSeek-V2 [DeepSeekV2] stretches context 32K→128K at equal KV budget, and V3 [DeepSeekV3] sustains MLA at 671B/37B-active without regression—a hard rebuttal to GQA.
[counter to Camp D: Perplexity is still a valid long-context metric] Gao et al. [Gao2024EffectiveLongCtx] and LongBench v2 [LongBenchV2] jointly show ppl decouples from RULER and cross-span reasoning; a 5% ppl drop may even regress on 64K+
RoPE extrapolation fails because of low-frequency OOD, not a global scale problem — which is why PI's high-frequency damage cannot be erased by a few more fine-tune steps [Chen2023PI][bloc972023NTK].
At ≤128K, YaRN + 400 fine-tune steps and ABF + continual pretrain differ by <3 pp on RULER; retrofit is the higher-ROI choice [Peng2023YaRN][Xiong2023LongLlama][Young2024Yi].
At ≥512K, per-dim non-uniform scales cannot be bypassed; smooth NTK clearly trails LongRoPE's evolutionary search at 1M [Ding2024LongRoPE][GeminiTeam2024].
PPL does not measure long-context capability; any long-context paper that skips RULER / LongBench / multi-hop variants deserves a lower credibility weight [Hsieh2024RULER][Bai2023LongBench][Li2025Haystack].
Non-RoPE architectures (ALiBi, LM-Infinite, RetNet) approach RoPE on ≤32K needle tasks but clearly trail on 128K+ multi-hop real tasks, and are not a 2026 production option [Press2021ALiBi][Han2023LMInfinite][Sun2023RetNet].
[counter to Camp A: pretrain-time ABF is the clean path] Expensive: requires large-scale continual pretraining, and most teams lack the token budget for the full 8K→128K ladder.
[counter to Camp B: YaRN is the de-facto retrofit tool] Above 512K the smooth ramp begins to show per-dim mismatch; RULER's long-tail tasks degrade visibly.
[counter to Camp C: ≥1M requires LongRoPE] Search is expensive, and below 512K its edge over YaRN is inconsistent.
[counter to Camp D: bypass the whole PI/NTK/YaRN lineage] Collectively trail on RULER multi-hop: masking loses middle-range info; RetNet's retention decays too fast. Zhong et al. [Zhong2024AttnRoPE] provide the mechanical explanation.
[counter to Camp A: µP is the absolute default] Lingle et al. point out that in complex practical deployments, basic µP still exhibits non-negligible drift; and for existing SP codebases, the refactoring cost far outweighs the benefits.
[counter to Camp B: Empirical formulas + a few sweeps suffice] Gemstones proves this fitting is extremely sensitive to aspect ratio and LR schedule; once the architecture is tweaked, existing empirical formulas break down.
[counter to Camp C: End-to-end Bayesian search is the endgame] For dimensions like LR and Init Scale that already have explicit analytical scaling rules (µP), spending compute on search is pure waste.
Muon [Jordan2024Muon] cut NanoGPT speedrun from AdamW's 5 min to 3.3 min (−34%); it is the only A/B candidate worth priority for ≤30B new runs
SOAP [Vyas2024SOAP] cut Shampoo's extra HPs from 4 to 1 and matches AdamW wall-clock — the 'second-order is too expensive' objection was retired in 2024
Apollo [Zhu2024Apollo] replaces per-param second moments with per-tensor scalars, collapsing optimizer state from 2P to ~0 with matched 7B/13B loss; the 2025 default for memory-tight teams
Controlling HP-search budget (AlgoPerf [Dahl2023AlgoPerf]; Agarwal et al. [Agarwal2020LRConfound]) collapses more than half of the adaptive-vs-SGD and cross-optimizer gap — any A/B without HP control is noise
AdamW [Loshchilov2017AdamW]'s moat is the muP [Lingle2024muP] ecosystem, not the algorithm itself; for Muon/SOAP to enter ≥70B production, the missing piece is second-order muP [Ishikawa2023SecondOrdermuP]
[counter to Camp A: AdamW is never retired] Muon still yields −34% wall-clock on NanoGPT under matched HP budget [Jordan2024Muon]; SOAP still beats AdamW's loss at 360M–1.3B under matched budget [Vyas2024SOAP]. 'AdamW is irreplaceable' over
[counter to Camp B: Muon is the next default] Public evidence for Muon at ≥30B is still missing (cluster coverage marked sparse); productionized second-order muP [Ishikawa2023SecondOrdermuP] is not yet delivered, and Muon's LR transfer from
[counter to Camp C: Shampoo / SOAP is the proper endgame] ≥7B public Muon-vs-SOAP head-to-heads remain open; productionized second-order muP [Ishikawa2023SecondOrdermuP] is still in the lab; the HP-transfer ecosystem overall lags AdamW by a
[counter to Camp D: optimizers don't matter, data does] Muon still gives −34% wall-clock under matched budget [Jordan2024Muon]; SOAP still reaches lower loss under matched budget [Vyas2024SOAP]; the strong version 'data is enough' overfits
The 95/5 short-to-long curriculum saves 20–40% wallclock at equal compute vs mixed-length training; LLaMA-3 [Llama32024] and Qwen2.5 [Qwen25Tech] both default to it.
In the context-extension stage RoPE base should be moved from a default of 10K to 500K–1M with 100B+ mid-train tokens; YaRN [YaRN2023] supplies the attention-temperature correction.
The FIM 'free lunch' claim [FIM2022] has been independently reproduced only for code models; on NL models Code Llama [CodeLlama2023] observed a small but consistent regression, so it should be off by default.
The first-token loss of each packed doc should be dropped; EOS+BOS double markers plus split-then-pack are the default in published industrial recipes [NeMoPacking2024][Llama32024].
[counter to Camp A: short-to-long + per-doc mask (mainstream)] Critics worry the tail 100B-token mid-train is insufficient for attention heads to fully adapt to long context.
[counter to Camp B: uniformly mixed-length training] Attention O(L²) forces short docs into a long window; at equal compute wallclock is 20–40% higher, and LLaMA-3 / Qwen2.5 long-context evals do not show a short-to-long disadvantage.
[counter to Camp C: naive concat + cross-doc visible] Krell et al. [Packing2021] quantify a 0.5–2% loss bias; FA2 [FlashAttention2Varlen] / FA3 [FlashAttention32024] cut per-doc mask cost to < 3% throughput.
[counter to Camp D: FIM for everything] Code Llama [CodeLlama2023] reports a small but consistent NL regression; StarCoder [StarCoder2023] restricts FIM to code.
Within the same tokenizer, objective, and model family, pretraining loss/PPL remains stable for scale extrapolation; once we cross tokenizers, languages, or post-training stages, its explanatory power drops materially [Kaplan2020ScalingLaws
The same or nearly the same pretraining loss does not guarantee the same downstream capability; optimization path, implicit bias, compression method, and behavioral distribution shift can all rewrite task performance while leaving PPL nearl
For multiple-choice, reasoning, and pass/fail tasks, continuous PPL improvements are often mapped into threshold-like jumps by discrete evaluation; using a single PPL scalar to predict task inflection points is therefore structurally unstab
A more robust decision workflow is two-stage: use PPL/BPB for fitting and monitoring inside the training loop, then model task scaling directly and pair it with a standardized evaluation panel outside the loop; this is more actionable than
[counter to Camp A: PPL remains the most reliable primary variable] The problem is that this conclusion is stable mainly inside the training loop. Hong Liu et al. [HongLiu2022SameLossBetterDownstream], Khanal and Capone [KhanalCapone2024Com
[counter to Camp B: PPL is only an intermediate state inside task scalin] The difficulty is that task scaling laws are more fragile than loss scaling laws, being more sensitive to benchmark choice, evaluation protocol, and task discreteness
[counter to Camp C: Stop searching for one scalar; use multi-panel diagn] The cost is greater system complexity and harder organizational maintenance; without a clear aggregation rule, a multi-panel setup can degrade into 'too many metrics,
[counter to Camp D: The problem with PPL is ontological, not merely pred] Taken to the extreme, this view can understate the engineering value of PPL. Many training decisions still need a cheap, continuous, low-noise monitor, and PPL/BPB cu
“Uniform repetition ≤4 epochs is basically free” only holds when the exposure distribution stays close to uniform; repeating ~2% of a hot subset up to 100× yields measurable loss degradation and leaves interpretable repetition fingerprints
In web corpora, near-duplication can reach double-digit percentages (e.g., up to ~13.6%), and substring/MinHash dedup improves perplexity while reducing memorization and train-test leakage [Lee2021Dedup]; therefore scrape-level passive repe
Semantic near-duplicates are a blind spot for MinHash: embedding-based semantic dedup/diversification (e.g., SemDeDup, D4) can match or exceed performance with fewer samples at fixed compute, effectively increasing the “effective token coun
Benchmark contamination is hard to eliminate via “broad filtering” alone; a safer engineering policy is to treat benchmarks/copyright/PII as zero-repeat (or 0–1 exposure) subsets with separate provenance and prefix/suffix dedup, rather than
“Less data is more” (large-scale pruning) is not consistently reliable: at some scales aggressive pruning yields limited gains or reversals; a more controllable path is to dedup/semantic-dedup to remove redundancy, then use mixture/reweight
[counter to Camp A: Dedup as aggressively as possible] Counterexamples come from data-constrained training: once you are data-limited on a finite high-quality pool, expanding unique tokens after dedup may be infeasible, and uniform multi-ep
[counter to Camp B: Uniform repetition ≤4 epochs is (almost) free] The heuristic is highly sensitive to uniform exposure: Hernandez et al. [Hernandez2022RepeatedData] shows hot-subset over-exposure degrades and fingerprints induction heads;
[counter to Camp B: Uniform repetition up to about 4 epochs is close to ] The problem is that the word “uniform” is often dropped in practice. Hernandez et al. [Hernandez2022RepeatedData] show that hot-subset over-exposure causes extra degr
[counter to Camp C: Semantic dedup is the real battleground] Evidence is still one-sided: negative results where semantic dedup deletes useful diversity and creates capability gaps are scarce for text LMs, and interactions with mixture/rewe
[counter to Camp D: Zero repetition for sensitive/eval/copyright data] The practical constraint is that many risky items cannot be perfectly identified, and cross-corpus reuse makes “zero-repeat” hard to falsify operationally. The goal shou
Kaplan 2020's large-model optimum was a side effect of a fixed LR schedule and short training horizons — Chinchilla [Hoffmann2022Chinchilla] corrected the ratio from ≈ 1.7 to ≈ 20 tokens/param.
Open-source independent re-fits (DeepSeek LLM [DeepSeek2024LLM]) place compute-optimal in a wide 5–100 tokens/param band — there is no copyable universal slope.
Data mixture is an independent scaling axis: at matched N and D budgets, DCLM [Li2024DCLM] recipes open ≥ 7 pp downstream gaps, and phi-1 [Gunasekar2023Textbooks] beats 10× larger models on code with 1.3B parameters.
Per-task scores do not follow the loss power law — Gadre et al. [Gadre2024OverTraining] and Bhagia et al. [Bhagia2024TaskLadders] show a two-step regression (loss → task perplexity → accuracy) is required to reach ≈ 1.9% prediction error.
Most reported 'emergent abilities' disappear once Schaeffer et al. [Schaeffer2023Mirage] swap to continuous metrics — emergence is better read as a nonlinearity in the evaluation metric than as an internal phase transition.
≈ 4 epochs is the safe boundary for token repetition: Muennighoff et al. [Muennighoff2023DataConstrained] give the average curve, Hernandez et al. [Hernandez2022Repeated] give the mechanism — beyond it induction-head capacity is spent on me
[counter to Camp A: Kaplan — bigger model, fewer tokens] Hoffmann et al. [Hoffmann2022Chinchilla] show, via three independent fits, tokens/param ≈ 20. Chinchilla-70B / 1.4T beats Gopher-280B and GPT-3-175B, empirically falsifying Kaplan's b
[counter to Camp B: Chinchilla — balance N and D under compute] DeepSeek LLM [DeepSeek2024LLM] finds a 5–100 band under independent data and batch schedules; LLaMA [Touvron2023LLaMA] itself over-trains 7B to ratio 143; Muennighoff et al. [M
[counter to Camp C: Data-mixture pragmatists — data is the first axis] phi-1's [Gunasekar2023Textbooks] 'textbook-quality' edge does not generalize to broad tasks; RefinedWeb's [Penedo2023RefinedWeb] web-only claim weakens in code/math spec
[counter to Camp D: Against emergence-as-magic] The GPT-4 technical report [OpenAI2023GPT4] shows stepwise jumps where small models score near zero and large ones become usable; Yuan et al. [Yuan2023Math] report threshold-like behavior on r
[counter to Camp A: Pure SSMs will eventually replace Transformers entir] Fixed-dimensional hidden states mathematically cannot losslessly compress long sequences containing numerous low-frequency entities, leading to hard physical limits o
[counter to Camp B: Linear Attention and SSMs are fundamentally the same] Mamba's core innovation lies in "input-dependent gating," allowing its decay matrix to change dynamically with the input, whereas traditional linear attention typical
[counter to Camp C: Subquadratic models must be pretrained from scratch] Pretraining from scratch is prohibitively expensive, and Transformers have already learned high-quality feature representations and induction heads, which can be disti
HumanEval alone is insufficient for SWE comparison: it mainly measures single-file function synthesis, while repo context, infilling, private libraries, and issue fixing systematically reorder model rankings [Chen2021Codex] [Austin2021MBPP]
SWE-bench Verified is the anchor at the SFT/RL stage, but not the only truth: it reduces flaky-test noise yet still concentrates on Python repositories; once extended to 7 languages, relative rankings can change [SWEbench2023] [SWEbenchVeri
A contamination-controlled rolling benchmark is a required condition for SFT-ready comparison; without a time window, high code-benchmark scores may partly reflect contamination rather than generalization [LiveCodeBench2024] [EvalPlus2023]
If deployment evaluation looks only at pass@1, it understates system differences; retry@k, test-execution rate, unnecessary file reads, and tokens per issue are often closer to user-perceived quality [SelfDebug2023] [CodeT2022] [ClaudeCode2
[counter to Camp A: HumanEval is enough] The pushback is now concentrated: EvalPlus [EvalPlus2023] shows that original tests are too loose; CoderEval [CoderEval2023], RepoBench [RepoBench2023], and BigCodeBench [BigCodeBench2024] show that
[counter to Camp B: SWE-bench Verified is the only truth] The issue is not that Verified is poor, but that it is incomplete. Multi-SWE-Bench [MultiSWEBench2025] shows cross-language reordering; SWE-Gym [SWEGym2024] and Claude Code [ClaudeCo
[counter to Camp C: trajectory-level PPL is the most real pretrain-stage] The current literature does not provide sufficiently direct evidence that patch-PPL, code BPB, or CrossCodeEval systematically predict SWE-bench Verified. CRUXEval [C
[counter to Camp D: agent UX metrics matter more than pass@1] Looking only at UX metrics is also insufficient, because without a success anchor a system may simply be more conservative, more expensive, or buying stability with more budget.
[counter to Camp A: synthetic data-first (the Phi line)] The problem is that this line is easy to overread as “synthetic can replace the real world.” The Phi papers do not establish that. Their public recipes still rely on real seeds, filte
[counter to Camp B: web-heavy backbone + synthetic mid-train (the LLaMA ] The cost is higher engineering complexity: a separate mid-train schedule, more data pipelines, and more verifier/filter components. Public reports also rarely disclos
[counter to Camp C: pure web-scale, avoid synthesis as much as possible] The issue is that pure web-scale is inefficient for scarce capabilities. WRAP shows that the same content becomes denser after rephrasing, and DeepSeekMath shows that
[counter to Camp D: unlimited synthetic scaling] Current public evidence does not support “unlimited scaling.” MAD and Gerstgrasser et al. [CollapseInevitable2024] already clarify the risk of replacing real data; moreover, only domains wher
[counter to Camp A: Scaffolding and Test-Time Compute are Everything] Scaffolding is merely a multiplier; pretraining capability is the multiplicand. If the base model hasn't seen cross-file dependencies or commit formats during pretraining
[counter to Camp B: RL and Verifiers are the True Drivers] RL success is highly dependent on the initial policy provided by pretraining. Experiments by Wei et al. [2502.18449] show that without a high-quality pretraining base, execution fee
[counter to Camp C: Just Mix More Code] Simply increasing the token count of single-file code yields diminishing returns. Without changing the data organization format (e.g., introducing topological packing), the model cannot learn cross-fi
[counter to Camp A: Pure SSM is the ultimate solution for long context] Arora et al. [Arora2024LinearAttention] and Merrill et al. [Merrill2024IllusionState] prove fixed state capacity cannot losslessly compress long sequences, causing MQAR
[counter to Camp B: Hybrid is the pragmatic production path] Hybrid architectures introduce heterogeneous operators, increasing the engineering complexity of KV cache management.
[counter to Camp C: Transformer + long-context extensions suffice] At 256K scale, even with GQA, Transformer's KV cache memory footprint and prefill latency remain prohibitive.
[counter to Camp D: RWKV is the correct RNN revival path] Chen et al. [Chen2024StuffedMamba] point out RWKV faces the same state interference issues due to fixed state capacity.
In fixed-compute controlled pretraining, tokenizer choice alone induces 0.6–5.1 pp downstream variance, comparable to common data-mixture changes [Ali2024TokenizerChoice].
Under fixed-compute controlled pretraining, changing only the tokenizer induces 0.6–5.1 pp downstream variance, and higher compression alone does not explain the ranking [Ali2024TokenizerChoice].
At fixed compute and the same 2.6B model size, swapping the tokenizer alone induces 0.6–5.1 pp downstream variance; treating the tokenizer as constant misattributes this variance to data or training recipe [Ali2024TokenizerChoice].
Scaling vocab from 32K to 128K yields ~0.02–0.04 nats lower training loss with near-zero inference bottleneck at large-model serving; gains concentrate in better bytes/token for non-English and code [Dubey2024Llama3].
Scaling vocab 32K→128K adds little inference overhead beyond embedding/softmax, yet at 15.6T training tokens it yields 0.02–0.04 nats lower loss and improves bytes/token for non-English and code [Dubey2024Llama3].
Scaling vocab to 128K in industrial training corresponds to 0.02–0.04 nats lower loss [Dubey2024Llama3] and shorter non-English/code sequences via lower bytes/token, shifting inference cost pressure more toward KV cache and attention comput
Digit tokenization (single-digit vs merged BPE) causes stable 10–20 pp gaps on 3–5 digit arithmetic, with no sign of “closing with scale” from 1B→7B up to frontier models [Spiess2023Digits][Singh2024TokenizationCounts].
Digit-tokenization boundary choices create 10–20 pp gaps on 3–5 digit arithmetic, and the gap does not automatically converge from 1B→7B nor in frontier measurements [Spiess2023Digits][Singh2024TokenizationCounts].
Trained LMs contain 3k–10k+ detectable under-trained tokens that are more likely to emit garbage strings; Magikarp’s embedding-gap plus probes can localize these risks pre-deployment [LandBartolo2024Magikarp].
Cross-tokenizer evaluation should be byte-normalized (BPB/byte-normalized loss); per-token perplexity systematically favors more compressed tokenizers even when downstream is not better [Gao2020Pile][Schmidt2024TokenizationMoreThanCompressi
[counter to Camp A: tokenizers are frozen preprocessing; coverage is eno] Controlled experiments show tokenizers are not a second-order detail: 0.6–5.1 pp variance at fixed compute is not explained by “coverage is already fine” [Ali2024Toke
[counter to Camp A: tokenizers are frozen preprocessing; coverage is eno] Fixed-compute controlled experiments directly refute the engineering assumption that tokenizer effects are negligible: changing only the tokenizer yields 0.6–5.1 pp d
[counter to Camp B: bigger vocab is always better; push to 256K+] Public evidence largely covers up to 128K, not 256K+. Larger vocabs can amplify long-tail and under-trained risks [LandBartolo2024Magikarp], and compression alone does not pr
[counter to Camp C: tokenizer-free is the endgame; abandon BPE] BLT pushes quality to BPE parity but also reports a 2–3× inference cost [Pagnoni2024BLT]. For throughput-sensitive products, this tax is often harder than the opportunity cost
[counter to Camp D: tokenizer is a pretrain product spec—make BPE right ] The main challenge is not theory but org/tooling: tokenizer sweeps, BPB reporting, and Magikarp scan/repair must enter the training pipeline; multilingual vocab alloc
For dense decoder-only LLMs below 70B, prefer GQA over MQA: it keeps most of the quality while reducing KV heads to grouped settings such as h/8, capturing much of MQA’s inference-memory gain without the usual quality drop [Ainslie2023GQA][
On the Norm/FFN side, defaults have mostly converged to pre-RMSNorm + SwiGLU; the main value of QK-Norm is suppressing attention-logit spikes in large-scale training, not delivering standalone downstream gains at every scale [RMSNorm2019][S
When a stable base already exists and the target size is roughly below 30B, depth-up-scaling or block expansion is often more cost-effective than from-scratch scaling; SOLAR [Kim2023Solar] grows a 7B model to 10.7B with 200B tokens and outp
[counter to Camp A: the architecture is mostly done; just keep scaling f] The counterargument comes from deployment constraints rather than pure training loss. Ainslie et al. [Ainslie2023GQA], Jiang et al. [Mistral2023], Google DeepMind [Ge
[counter to Camp B: the next backbone should move to SSM / RetNet / Mamb] The issue is that public LM evidence is still not matched enough on budget, task, and serving constraints. By contrast, Gemma 3 [Gemma3Report], DeepSeek-V2 [DeepSeekV
[counter to Camp C: the second scaling path should be default—grow first] The counter-view, represented by Kaplan et al. [Kaplan2020ScalingLaws] and modern from-scratch reports [Llama3Report], is that if the goal is a new large model, new d
[counter to Camp D: QK-Norm / sandwich norm are optional details] Wortsman et al. [Wortsman2023Instabilities] already provide more specific mechanistic evidence, and Gemma 3 [Gemma3Report] also includes QK-Norm in a public default recipe. T
Public frontier recipes have shifted from fixed mixtures to curricula: early training is web-heavy, while the tail upsamples code, math, and reasoning by roughly 3–5× [MetaLlama32024][MiniCPM2024].
Upsampling scarce domains has repetition ceilings: roughly ≤4 epochs is often nearly free, while >16 epochs usually calls for fresh or synthetic data rather than more repetition [Muennighoff2023Repeat][Textbooks2023][Cosmopedia2024].
[counter to Camp A: Formal search first] The counterargument is that formal search adds substantial cost, and the resulting weights often depend on dedup, filtering, and domain partitioning; if those prerequisites are unstable, fine-grained
[counter to Camp B: Heuristics plus curriculum are more robust] The counterargument is that such recipes depend on specific models, token budgets, and data pipelines, and may distort under transfer; without explicit search, teams often do n
[counter to Camp C: Online adaptation beats one-shot offline search] The counterargument is that online methods inject a control problem into the main training loop, with noisier feedback and higher systems complexity; if the sampler, loggi
[counter to Camp D: Ratio is secondary; quality is first-order] The counterargument is that quality mainly compresses noise inside the web domain; once web quality improves, the remaining optimization room still lies in the ratios of code,
Mid-train has shifted from an optional trick to a standard stage in frontier pretraining, with public recipes often allocating roughly 10–30% of total training compute to post-backbone distribution pullback.[Phi3Report][DeepSeekMath2024][Ll
[counter to Camp A: synthetic-first can be the main route] The counterpoint is about extrapolation limits. The successes of Phi-1 [Phi1Textbooks] and Phi-3 Technical Report [Phi3Report] are concentrated in small models, controlled distribut
[counter to Camp B: a web-heavy backbone plus synthetic mid-train is the] The counterargument is that this route is conservative and still depends on expensive real-data collection and filtering, leaving some of the cost advantage of synthe
[counter to Camp C: avoid synthetic as much as possible; stronger filter] The problem is that filtering can only select from an existing candidate pool; it cannot actively expand high-value distributions the way rephrasing or verifier loops
[counter to Camp C: avoid synthetic as much as possible; stronger filter] The limitation is that filtering can only reorder an existing distribution; it cannot reliably create scarce task shapes. In high-value but naturally low-frequency re
[counter to Camp D: synthetic can scale almost without bound; collapse i] The public evidence does not support this unbounded extrapolation. MAD [MAD2023] shows that replace-real dynamics lose tail support first; CollapseInevitable [Collaps
[counter to Camp D: synthetic can scale almost without bound; collapse i] Counterexamples still matter. MAD [MAD2023] shows that once real data is replaced, tail modes collapse first; and the strongest verifier-backed successes are concentr
Curriculum mixtures are more robust than a fixed vector: public recipes from Llama 3 and MiniCPM both follow “web-heavy early + ~3–5× upweighting of code/math/reasoning late,” matching a variance-first then capability-shaping mechanism [Met
[counter to Camp A: Formal search first (laws/regression/robust optimiza] The main risk is transfer: optima drift with model scale and training stage, so one-shot offline search can fail at target scale; offline methods also assume stable,
[counter to Camp B: Heuristics + curriculum are more robust (few ablatio] Counterpoint: for compute-optimal or worst-case constrained goals, pure heuristics can miss meaningful gains; lack of systematic search also makes it harder to justif
[counter to Camp C: Online adaptation beats one-shot offline search] Evidence is still sparse: few equal-compute head-to-head comparisons; online methods are highly sensitive to domain accounting, samplers, and loss noise, and can treat sys
[counter to Camp D: Ratios are second-order; quality is first-order] Quality-first does not mean ratios are useless: on filtered pools, remaining degrees of freedom concentrate in scarce domains (code/math/multilingual) and stage control (c
Under a fixed total compute budget, the main value of RegMix/law-style offline methods is not “finding a global optimum” but turning 2–3 ablation rounds into a reusable local response surface; however, scale transfer is not free—optimal mix
Expressing ratios as a curriculum (a time function) matches public large-scale recipes better than a static vector: phased web-heavy training plus ~3–5× late upweighting of code/math/reasoning effectively increases scarce gradient signal ne
[counter to Camp A: Formal search first (laws/regression/robust optimiza] The fragility is coordinate drift: change dedup/filtering/buckets and the regression target and weight semantics change. Scale transfer is also unstable—proxy optima
[counter to Camp B: Heuristics + curricula are more robust (a few ablati] The ceiling is interactions and budget planning: as domains and objectives grow (multilingual+code+long context), a few ablations may miss critical combinations. Heur
[counter to Camp C: Online adaptation beats one-shot offline search] The main risk is not the algorithm but engineering accounting and stability: low-latency statistics, interpretable domain signals, and non-divergence under distribution/bu
[counter to Camp D: Ratios are second-order; quality/selection is first-] Demoting ratios to “second-order” can ignore scarce domains and capability targeting: filtering often removes volume first, making code/math/multilingual even scarcer
[counter to Camp D: Ratios are second-order; quality/selection is first-] The counterargument is that “cleaner” is not monotonic: over-filtering removes long-tail diversity useful for generalization; and for scarce domains (code/math/reason
[counter to Camp D: Ratios are second-order; quality/selection is first-] Treating “cleaner” as a single objective ignores scarce domains and capability targeting: filtering reduces volume first, making code/math/multilingual domains scarce
[counter to Camp A: synthetic-first can be the main route] The limitation is that most of this evidence comes from highly constrained settings such as small models, code, children's stories, and textbook-like corpora. Once the setting moves
[counter to Camp B: a web-heavy backbone plus synthetic mid-train is the] The counterargument is that this may only reflect incomplete synthetic pipelines in current industry practice, or conservative public reporting that does not reveal h
RoPE extrapolation fails primarily in low-frequency dims: with too small a base, low-frequency phase coverage is insufficient at target lengths, making far positions inseparable; this yields a base↔learnable-effective-length bound [Xu2024Ro
PI’s uniform rescaling brings rotation angles back into the trained range, but it also compresses high-frequency dims, effectively reducing local positional resolution; NTK-aware makes the mechanism explicit: keep high-freq nearly fixed and
For 32K–128K retrofitting, YaRN’s per-dim ramp plus attention temperature is a more robust default: it avoids high-frequency damage while correcting long-range logit entropy collapse, and typically converges within ~400 fine-tune steps [Pen
For 32K–128K retrofits, YaRN is the more reliable default: per-dim ramp avoids globally compressing high-frequency dimensions, and attention temperature prevents entropy collapse of long-range attention logits; with similar fine-tuning budg
For 32K–128K retrofits, YaRN is the more reliable default: per-dim ramp interpolates mostly low-frequency dimensions while preserving high-frequency ones, and attention temperature stabilizes long-range logit entropy; PI’s global interpolat
At ≥512K, global scaling (smooth PI/NTK/YaRN formulas) shows per-dim mismatch; LongRoPE turns “choose one s” into “learn a scale vector” via per-dim evolutionary search, with a reproducible 2M recipe [Ding2024LongRoPE].
For new pretraining, ABF (base≈500000 + short-to-long curriculum) incurs less irreversible damage than “pretrain at base=10000 then retrofit”: production reports disclose a 6-stage length curriculum and long-context benchmark curves, consis
Long-context evaluation must be task-centric: RULER/LongBench/L-Eval/LV-Eval separate advertised from effective context, while perplexity correlates poorly with retrieval/aggregation/multi-hop tracing [Hsieh2024RULER][Bai2023LongBench][An20
[counter to Camp A: pretrain-time ABF + curriculum is the cleaner long-c] The counterargument: retrofitting is cheaper and more flexible, and YaRN/LongLoRA can extend 4K→128K within ~400 steps; full pretraining is unrealistic for many teams
[counter to Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] The counterargument: for zero-shot window increases ≤2×, NTK-aware scaling is simpler; PI is the simplest to implement and may carry lower engineering risk [bloc97202
[counter to Camp C: ≥512K requires per-dim search (LongRoPE-like)] The counterargument: instead of complex RoPE searches, use systems/architectural approaches to avoid million-length attention costs (Ring Attention, sparse attention, extern
[counter to Camp D: bypass the PI/NTK/YaRN lineage (ALiBi/RetNet/Mamba/m] The counterargument: these methods often conflate “can run longer” with “can use longer”. On RULER/LongBench-style tasks requiring precise recall and cross-span reaso
Across multiple released models, 3k–10k+ under-trained tokens can be detected; they are more likely to emit garbage at deployment and can be repaired by Magikarp scanning plus short continued pretraining [LandBartolo2024Magikarp].
Per-token perplexity is not comparable across tokenizers: shorter tokens mechanically lower PPL; robust comparisons require byte-normalized metrics (BPB/byte-level loss) [Gao2020Pile].
[counter to Camp B: bigger vocab is always better; push to 256K+ by defa] Public evidence covers feasibility and gains up to 128K [Dubey2024Llama3], but lacks direct 256K+ comparisons under fixed compute with BPB normalization and measured
[counter to Camp C: tokenizer-free is the endgame; abandon BPE as soon a] Tokenizer-free evidence on the modeling side is strengthening, but the default production path is still constrained by system cost: longer sequences shift throughput/
[counter to Camp D: tokenizer is a pretrain product spec—make BPE right ] The main risk is pipeline complexity: tokenizer versioning, BPB regressions, and end-of-run repair add engineering overhead; and tokenizer-free architectural innovati
Compression (bytes/token) correlates with performance but is insufficient: sizable differences can remain at similar compression, so evaluation must report BPB/byte-level loss plus task-specific failure modes, not just “shorter tokens” [Gol
A 128K vocab yields 0.02–0.04 nats lower loss at industrial scale and reduces sequence length for non-English and code; the inference-side benefit is primarily lower KV-cache pressure rather than extra FLOPs [Dubey2024Llama3].
Local digit/date merge rules create reproducible 10–20 pp gaps (3–5 digit arithmetic, date reasoning) that persist in frontier evaluations; the claim that scale automatically washes out tokenizer bias lacks support [Singh2024TokenizationCou
Deployment risk is not only OOV but under-trained tokens: 3k–10k+ problematic tokens are detectable across models and repairable via scan + short continued pretraining; this should be a release gate for tokenizer/vocab changes [LandBartolo2
Non-unique encodings map the same string to multiple token sequences, causing RL to treat identical trajectories as different sequences and harming reasoning; thus vocab shrinkage / splitting long tokens can be an alignment-stability option
[counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute tokenizer-only ablations directly contradict “negligible impact”: at the same 2.6B and budget, variance is 0.6–5.1 pp [Ali2024TokenizerChoice]. Compress
[counter to Camp A: the tokenizer is frozen preprocessing; coverage is e] Ali et al. [Ali2024TokenizerChoice] directly show 0.6–5.1 pp differences under fixed compute, which rules out “negligible impact.” Petrov et al. [Petrov2023Unfairness
[counter to Camp A: Tokenizer is frozen preprocessing; coverage is enoug] Ali et al. [Ali2024TokenizerChoice] found 0.6–5.1 pp downstream differences across 24 tokenizers with coverage above 99.9% under fixed 2.6B model and compute budgets,
[counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Ali et al. [Ali2024TokenizerChoice] show 0.6–5.1 pp variance under fixed compute, contradicting the “it averages out” premise; Vieira et al. [Vieira2024Characters] fu
[counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute evidence contradicts “negligible impact”: swapping only the tokenizer yields 0.6–5.1 pp deltas [Ali2024TokenizerChoice]. Also, per-token PPL is non-comp
[counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute evidence directly contradicts “negligible”: swapping only the tokenizer yields 0.6–5.1 pp variance [Ali2024TokenizerChoice]. Compression does not fully
[counter to Camp B: bigger vocab is always better; default to 256K+] Two debts scale up with vocab size: (1) under-trained tokens (thousands to 10k+) causing deployment garbage risks [LandBartolo2024Magikarp]; (2) alignment/RL instability f
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The main blocker is system tax: much longer sequences amplify KV-cache and attention costs, so throughput/latency must be aligned via deployment-level measurement; meanwhi
[counter to Camp E: counter-trend—shrink/prune vocab to buy alignment & ] Public evidence is still mostly “mechanism/theory + observation”, lacking controlled experiments that apply vocab shrink/prune during alignment on the same model (sta
Under fixed model size and training budget, swapping the tokenizer alone yields 0.6–5.1 pp downstream variance; treating it as constant leaves out a source of variation comparable to other major recipe choices [Ali2024TokenizerChoice].
At industrial scale, a 128K vocabulary can deliver about 0.02–0.04 nats lower loss, and the main systems gain comes from shorter sequences and lower KV-cache pressure rather than extra embedding-matmul cost [Dubey2024Llama3][FlashAttention2
"Bigger vocab is always better" is false: local digit/date merges create reproducible 10–20 pp gaps, showing that vocabulary structure can dominate vocabulary size in reasoning failures [Singh2024TokenizationCounts][Bhatia2025DateFragments]
“Bigger vocab is better” is not monotonic: local digit/date merges can create 10–20 pp gaps on 3–5 digit carry-sensitive arithmetic and date reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments], making these tasks high-sensitivi
Per-token perplexity cannot compare tokenizers; changing token length changes the denominator, so BPB or character-string likelihood is the minimum requirement, otherwise tokenizer changes get misread as model improvements [Gao2020Pile][Vie
Vocabulary design is not a one-shot decision: deployment often reveals 3k–10k+ under-trained tokens, and RL adds extra stability debt through tail tokens and non-unique encodings [LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis][Cla
[counter to Camp B: bigger vocab is always better; default to 256K+] Singh et al. [Singh2024TokenizationCounts], Bhatia et al. [Bhatia2025DateFragments], and Liu and Ellis [LiuEllis2026SayAnythingButThis] show that local merges, date fragme
[counter to Camp C: tokenizer-free is the endgame; BPE should be abandon] The counterargument is not that these routes are impossible, but that they are not yet a full replacement under the same compute and deployment constraints. Dubey et
[counter to Camp E: shrink or prune the vocabulary to buy alignment and ] The counterargument is that shrinking the vocabulary lengthens sequences, increases KV-cache pressure, and may give up multilingual and code compression gains [Dubey2
Under fixed compute budgets, downstream performance gaps across different tokenizers reach 0.6–5.1 pp, exceeding gains from most architecture adjustments at the same budget [Ali2024TokenizerChoice]
A 128K vocabulary reduces pretraining loss by 0.02–0.04 nats compared to a 32K vocabulary, while shortening non-English and code sequences by 15%–22% on average, reducing inference KV cache footprint and latency [Dubey2024Llama3]
Unreasonable BPE merges for numbers and dates reduce arithmetic and temporal reasoning performance by 10–20 pp, an effect consistent across 7B to 405B parameter scales [Singh2024TokenizationCounts][Bhatia2025DateFragments]
Mainstream open-source and closed-source LLMs universally have 3k–10k+ under-trained tail tokens, and over 90% of abnormal output issues can be repaired via 10B–50B tokens of continued pretraining [LandBartolo2024Magikarp]
[counter to Camp B: Bigger vocab is always better; default to 256K+] Compared to 128K vocabularies, 256K vocabularies have diminishing pretraining loss improvements below 0.01 nats, while embedding layer parameters double, inference KV cach
[counter to Camp C: Tokenizer-free is the endgame; BPE should be abandon] Current byte-level models have 0.12–0.18 nats higher pretraining loss, 8–12 pp lower downstream performance, and 3x higher inference latency than 128K BPE tokenizer m
[counter to Camp D: Shrink or prune the vocabulary to buy alignment and ] Indiscriminate vocabulary pruning increases output error rates of multilingual and niche domain terms by 3–7 pp, and only improves RL stability by 2%–3%, with lower b
Under fixed compute, the tokenizer is not second-order: with the same 2.6B model and budget, swapping among 24 tokenizers yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice], so “coverage is enough” is not a safe default assumpti
Under fixed compute, tokenization is not a minor “compression tweak”: at 2.6B scale with the same budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance, not explained by coverage or average token length alone [Ali2024Tok
Under fixed compute, swapping only the tokenizer yields 0.6–5.1 pp downstream variance (2.6B model, 24 tokenizers), comparable to typical data-mixture changes; therefore ‘tokenization is negligible’ is an experimentally falsifiable default
Under fixed-compute controlled pretraining, with the same 2.6B model and budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice]; treating tokenization as a “second-order detail” misses regression
Under fixed compute, tokenization is not second-order: with the same 2.6B model and budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice], comparable to typical data-mixture changes.
Deployment-time vocab issues are governable engineering debt: multiple models contain 3k–10k+ under-trained tokens, detectable via scanning and repairable with short continued pretraining [LandBartolo2024Magikarp]; tokenizers therefore need
Tokenizers create post-training debt: multiple models contain 3k–10k+ under-trained tail tokens that can be detected via scanning and repaired with short continued pretraining; vocabularies need post-deployment governance [LandBartolo2024Ma
Per-token PPL is not comparable across tokenizers: models define distributions over token strings while application semantics live over character strings; changing tokenizers changes the denominator and reachable string set, so BPB or chara
Per-token PPL is mathematically non-comparable across tokenizers: LMs define distributions over token strings, and tokenizers change the denominator and reachable sequence set; more stable aligned metrics are BPB, character-string likelihoo
Per-token PPL is not comparable across tokenizers because models define distributions over token strings while application semantics live on character strings [Vieira2024Characters]; BPB or exact byte-level probabilities align denominators
[counter to Camp B: bigger vocab is always better; default to 256K+] Evidence is strong up to 128K but does not yet provide monotonic fixed-compute curves for 256K+. Non-monotonic risks are reproducible for digits/dates: local merges can cr
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Tokenizer-free’s main cost is much longer sequences, shifting bottlenecks from vocab matrices to long-sequence modeling and throughput [Wang2024MambaByte]; in today’s Tran
[counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Pruning has expressivity and compatibility costs: multilingual and code often rely on long-tail symbols and rare fragments; over-pruning can lengthen sequences or pus
In decode, once KV-cache traffic dominates token time (typical for long context / large batch), reducing KV heads from H to 8/4/1 (GQA/MQA) or re-parameterizing KV as a low-rank latent (MLA) shrinks bytes/token roughly proportionally; these
When KV-cache traffic dominates decode time, reducing KV heads from H to 1/4/8 (MQA/GQA) or re-parameterizing KV as low-rank latents (MLA) approximately scales down bytes/token proportionally; these structural moves hit memory-bound bottlen
FlashAttention rewrites attention cost from “HBM round-trips of an N×N intermediate” to “SMEM tiling + online softmax,” enabling longer usable context and higher throughput at the same precision; however, numeric deviations in fused attenti
FP8 stability behaves like a system problem of “scale update rules + outlier distribution”: drift that only appears at trillion-token scale implies small-scale ablations are insufficient; MXFP8 per-block scaling plus FP32 masters/optimizer
Replacing “fusion-first” with “roofline-first” reduces wasted iterations: use arithmetic intensity to classify memory/compute/latency bound, then choose tiling, parallel decomposition, or structural rewrites; Hopper microbenchmark dissectio
[counter to Camp A: kernels and algorithms must be co-designed (decide b] The counterargument is that frameworks/compilers can lower graphs into efficient kernels so algorithm teams need not go low-level; and some effective architecture cha
[counter to Camp B: PyTorch/graph-compiler level is enough; handwritten ] Two facts push back: (1) the biggest throughput wins often come from specialized kernels that change dataflow/scheduling (FlashAttention-3), not generic lowering; (2)
[counter to Camp C: hardware will get faster; algorithms need not adapt ] FP8 evidence points the other way: drift that appears only at trillion-token scale shows “short-horizon stability” does not extrapolate; MXFP8 recipes codify per-bloc
[counter to Camp D: non-NVIDIA hardware will catch up; CUDA ecosystem wi] Current evidence is more about “formats/specs emerging” than “end-to-end critical kernels matching throughput and stability.” FA3 bakes Hopper features into schedulin
Within an 8-GPU NVLink island, TP’s per-layer all-reduce is frequent enough to dominate end-to-end throughput; placing TP across IB is often worse than “using less PP”. A more robust default is TP≤8 on NVLink, PP across IB, DP/FSDP across p
With large PP stage count P and memory-limited microbatches M, bubbles are not second-order: zero-bubble 1F1B splits backward into input-grad/weight-grad and drives theoretical bubble to 0 (ZB-H2), typically recovering ~2–5 MFU points under
Zero-bubble 1F1B turns bubbles from a heuristic into a constructive condition: ZB-H2 splits backward (input-grad/weight-grad) to drive theoretical bubble to 0 [Qi2023ZeroBubble], often translating to ~2–5 MFU points recovered under the same
The long-context breakpoint is closer to “attention share” than a fixed L: once attention exceeds ~30% of FLOPs/time, SP is usually the lowest-risk first step; above ~50%, CP (Ring/Ulysses/USP) becomes required, otherwise 4BL²HL_layer memor
In public evidence, reproducible MFU at >10K GPUs and 100B+ still mainly comes from hand-tuned 4D with explicit topology mapping (dense 175B MFU 55.2%) [Jiang2024MegaScale]. Auto-parallel has paper-backed closeness below 100B, but lacks pub
“FSDP-only is enough” is more a choice about org/code shape than throughput optimality: modern systems can make FSDP more flexible and faster [Wang2026veScaleFSDP], but public materials still lack a quantitative dense-pretraining ceiling an
[counter to Camp A: hand-tuned 4D (Megatron / MegaScale style)] High organizational cost: requires strong control over topology, kernels, scheduling, and fault tolerance; model changes (MoE, long context) demand continuous manual adaptation
[counter to Camp B: auto-parallel (Alpa / GSPMD / pjit)] Lacks public matched-scale 100B+ comparisons and topology details; once failure domains, retries, and congestion control become constraints, automatic plans can be hard to explain and
[counter to Camp C: FSDP-only is enough (low intrusion first)] Public evidence still lacks a quantitative dense-pretraining ceiling and dominant bottleneck; when TP/PP/CP become necessary, FSDP-only often shifts bottlenecks to cross-node sy
[counter to Camp C: FSDP-only is enough (low intrusion first)] Public evidence still lacks a dense-pretraining ceiling regime: when model state, activation state, and long-context attention all grow, TP/PP/CP often shift from optional to ma
[counter to Camp D: 3D (DP+TP+PP) is enough; SP/CP optional] As L rises, attention’s 4BL²HL_layer turns “extreme cases” into the norm; relying on 3D alone often burns bandwidth at the wrong layer (e.g., shifting CP pressure onto PP/DP), pus
Mixture phase plots provide a transferable threshold: below ~10–15% a data type tends to sit in a synergy regime (high marginal gains), while pushing toward ~40% more often enters an interference regime (capability displacement). Treating c
At comparable training loss, the code domain shows lower downstream variance than web (reported on the order of ~30%), consistent with an optimization account: lower token entropy → lower gradient noise → lower effective LR. Hence, part of
Organizing code tokens into “usable context” is more reliable than merely increasing ratios: repo-level packing places cross-file dependencies into the same sequence and improves cross-file completion and long-context understanding; these g
[counter to Camp A: generalists should push code past >40%; more is alwa] Public evidence does not support monotonicity. Phase plots predict that >40% is more likely to enter an interference regime and displace other distributions [Aghajany
[counter to Camp B: code helps mostly via optimization/regularization (l] The optimization account covers part of the story but does not fully explain differences in structured ICL signals: induction head formation depends on repeatable pat
[counter to Camp C: keep code <10% to protect NL; generalists should not] Public continual results weaken the strong form of “more code necessarily harms NL”: Code Llama shows <1 pp MMLU drop after code-heavy continuation [CodeLlama2023]. M
[counter to Camp D: code ability must be trained from scratch; continual] The generalized claim “continual inevitably fails” conflicts with public evidence: Code Llama preserves NL metrics under code-heavy continuation [CodeLlama2023]; more
[counter to Camp A: Positional extrapolation is enough to achieve long c] Controlled ablations by Fu et al. [Fu2024DataEngineering] and Xiong et al. [Xiong2023EffectiveLongCtx] show that models with only positional modification without data
[counter to Camp B: The data recipe is the main variable] Pure data optimization without matching positional encoding extension cannot break through the position length limit during pretraining, and nominal context cannot exceed the trainin
[counter to Camp C: Packing engineering is the under-exploited lever] If the data itself has insufficient burstiness and the positional encoding is not adapted to long sequences, the gain ceiling of packing optimization is only 10–15 pp, an
[counter to Camp D: Switch architectures (SSM / linear) to bypass positi] Currently, the long-range reasoning quality of SSM architectures has not exceeded Transformer models of the same scale with optimized data and packing, and only has a
In regimes dominated by low-quality web, quality filtering/selection is usually more reliable than fine mixture tuning; but filtering is not monotonic—over-cleaning can hurt generalization and long-tail coverage [DataCompLM2024][FineWeb2024
Public recipes look like staged control rather than constant ratios: Llama 3 reports ~3–5× late-stage upweighting of code/math/reasoning, consistent with a tail capability acquisition mechanism [MetaLlama32024].
Public large-model recipes are shifting ratios from a static vector to a curriculum: Llama 3 reports ~3–5× late upweighting of code/math/reasoning, consistent with tail capability acquisition rather than a constant mixture [MetaLlama32024].
Public recipes look like phase-wise trajectories rather than constant ratios: Llama 3 reports ~3–5× late upweighting of code/math/reasoning, consistent with tail capability acquisition; encoding ratios as a time function matches training dy
[counter to Camp A: Formal search first (laws/regression/robust optimiza] The counterargument is that transferability is often overestimated: conclusions are sensitive to dataset choice, bucket definitions, and dedup/filter versions, causin
[counter to Camp A: Formal search first (regression/laws/robust optimiza] Formal methods are highly sensitive to bucket definitions and dataset versions: coarse buckets average out signals, and version drift shifts regression targets; in un
[counter to Camp B: Heuristics + curricula are more robust (a few ablati] The counterargument is that heuristics lack transferable budgeting: as bucket count grows and objectives shift from generalist to specialist, pure ablations’ sample c
[counter to Camp C: Online/adaptive mixing beats one-shot offline search] The counterargument is that net gains are often absorbed by extra phases, system complexity, and unstable accounting: if bucket definitions, dedup/filter versions, an
[counter to Camp A: quality classifiers + ablation ladders are sufficien] The pushback: ladders can distort on key capabilities (low HumanEval correlation), and they only answer “is this decision better under this recipe”, not mechanisms or
[counter to Camp B: influence/attribution is the main path (infer data r] The main issues are transferability and actionability: top-k influence drifts with scale (<10% overlap), making “freezing a high-value example library” infeasible; at
[counter to Camp C: full causal inference is the future (solve confoundi] Identification assumptions are a hard cost: IV validity is often untestable, and invalid IV can be directionally wrong. Causal methods also require sharper variable d
[counter to Camp D: skip measurement, rely on intuition and scale (scali] The counterpoint: scale does not decide which tokens are worth paying for. Controlled results in DCLM and FineWeb-Edu show multi-pp downstream differences under fixed
On H100, FA3 [Shah2024FA3] uses warp specialization + TMA to turn attention’s critical path from “waiting on HBM” into “keeping MMA fed”, reporting ~75% of H100 BF16 peak and ~1.2 PFLOPs/s in FP8. This places dense attention closer to compu
With the same exact-attention math, FA2 [Dao2023FA2] achieves ~2× throughput over FA1 [Dao2022FA1] on A100 purely via work partitioning and parallel granularity. Hence, “same algorithm, different scheduling” is a first-order lever for atten
In decode serving (tiny batch, long context), training-time FA kernels underutilize SMs due to insufficient parallel dimensions; FlashDecoding++ [Hong2023FlashDec] restores parallelism by chunking along sequence length, indicating serving o
When KV cache becomes the dominant bandwidth/capacity bottleneck in long-context inference, KV quantization/compression (e.g., KIVI [Liu2024KIVI]) can affect end-to-end throughput more than single-kernel speedups, shifting priority from ker
FlexAttention [Dong2024Flex] elevates mask/score semantics into a compilable interface, letting many variants (ALiBi/SWA/soft masks) reach near fused-kernel performance without CUDA; yet works like FlashMask [Wang2024FlashMask] show that wh
[counter to Camp A: FA is largely the endpoint of attention engineering ] Counterexamples come from (1) complex masks/variants that may need specialized implementations to avoid extra IO or to realize specific semantics, and (2) portability
[counter to Camp B: the main line is Triton/FlexAttention—move optimizat] Concerns are (1) generated kernels may not match hand-tuned ceilings for extreme cases (dynamic sparsity, cross-block state, special layouts), and (2) portability/deb
[counter to Camp C: attention itself should be replaced (SSM / sparse / ] The main pushback is quality and task coverage: on retrieval/ICL-heavy evaluations, replacements can trail and often need hybrids to close gaps. Also, many product bo
[counter to Camp D: FA3 embodies NVIDIA lock-in; avoid binding critical ] The counter-argument is that lock-in cost is often overstated: NVIDIA remains the dominant platform in real training/serving fleets; mature libraries (FA3/FlashInfer)
Perplexity often decouples from long-context task gains in practice: using perplexity as the primary metric for 128K+ training misallocates effort toward “more tokens” rather than “using mid-context evidence on RULER/LongBench/RepoQA” [Gao2
At ~128K, data distribution dominates total tokens: upsampling long documents to ≥25% per domain can saturate 128K NIAH with ~5B tokens [Fu2024DE128K]. With a low long-doc ratio, adding tokens often improves short-span memorization rather t
The practical RoPE extrapolation line evolved from linear interpolation to joint frequency/temperature calibration to non-uniform search: PI→YaRN→LongRoPE map to stability needs at ~32K, ~128K, and 2M+ [PI2023][YaRN2023][LongRoPE2024].
[counter to Camp A: explicit positional encoding is required; RoPE extra] Counterevidence comes from “extrapolates but not used”: RULER shows needles overestimate capability [RULER2024]; Lost in the Middle shows mid-context underuse is stru
[counter to Camp B: SP and DP are orthogonal; scaling to million tokens ] Counterevidence is “systems run but tasks don’t improve”: Fu et al. show that without sufficient long-doc ratio, continued training does not yield effective 128K gain
[counter to Camp C: NIAH/perplexity is sufficient; task benchmarks are t] RULER’s central result is that NIAH overestimates real usable context [RULER2024]. Gao et al. explicitly show perplexity decouples from long tasks and provide a more
[counter to Camp D: alternative architectures (sparse/SSM/linear attenti] Counterpoint is end-to-end engineering cost: even with better asymptotics, evaluation and recipes still determine whether context is used. RULER/Lost in the Middle po
For load balancing, aux-loss-free bias EMA removes balancing signals from the main-loss gradients; the primary win is reduced implementation sensitivity (micro-batch stats, DP sync, detach choices), which in aux-loss setups can dominate out
On expert structure, fine-grained (≥64) + one shared expert acts like explicitly factoring out a “shared component” in a mixture model, reducing identifiability conflicts among routed experts; this aligns with DeepSeekMoE’s reported 1.8–3.4
The routing turnover is less about “smarter routing” and more about removing congestion as the dominant stability variable: expert-choice internalizes capacity constraints into the routing mechanism, reducing token-drop-driven training disr
The DeepSeek template became the de-facto default in 2025–2026 by standardizing operational metrics—bias norm, usage CV, token drop, dead experts—turning MoE from ‘tuning art’ into an observable control system that reproduction platforms ca
[counter to Camp A: MoE becomes the default backbone (dense remains for ] This camp often treats training FLOPs as the only ledger, but ROI also includes all-to-all, memory residency, inference cache hit rates, and post-training transfer co
[counter to Camp B: dense wins on full-lifecycle ROI (especially under u] Most evidence here targets the reuse/upcycling path, while the DeepSeek template’s main regime is scratch training with templated monitoring and systems realization.
[counter to Camp B: dense wins on full-lifecycle ROI (especially under u] The counter is that “training MoE from scratch” and “upcycling” are different problems: DeepSeek-V3’s templated monitoring and bias EMA primarily address from-scratch
[counter to Camp C: learned routing/balancing is overrated (random/froze] These results can be regime-dependent. At LLM scale, routing/balancing value may show less as higher validation scores and more as avoiding failure modes (dead expert
[counter to Camp D: MoE is mainly for pretraining; post-training should ] Evidence is currently more ‘reasonable engineering inference’ than LLM-scale matched comparisons. DeepSeek-V3’s monitoring and bias EMA make sparse training more cont
In 32K–128K retrofitting, replacing global interpolation (PI) with a per-dim ramp (YaRN) turns “high-frequency compression causing local-pattern degradation” from a structural failure mode into a tunable hyperparameter; practically this map
For new/continual pretraining, RoPE base is not an inference-time trick but an upper-bound hyperparameter limiting low-frequency phase coverage: too small a base makes far positions indistinguishable [Xu2024RoPEBaseBounds]. Production recip
At 512K–2M, errors of global scaling formulas are dominated by per-dim mismatch; LongRoPE learns non-uniform per-dim scales via evolutionary search and uses longer fine-tuning to reach 2M [Ding2024LongRoPE], making per-dim fitting more reli
Advertised windows and effective context often differ by 2–4× on task benchmarks: RULER’s 13 task types show many 128K-claimed models are effectively ~32K [Hsieh2024RULER]. Hence PPL/needle tests should not be primary metrics; multi-task an
[counter to Camp A: pretraining-time ABF + curriculum is the clean long-] The counterargument is that many teams lack continual-pretraining budget, and 32K–128K needs can be met with YaRN in a few hundred fine-tune steps; shifting the probl
[counter to Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] The counterargument is that PI is simplest and requires minimal inference-stack changes, and can be good enough on some retrieval-heavy evaluations; YaRN needs infere
[counter to Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] The counterargument is that at million-token scale, bottlenecks are mostly systems (KV cache, bandwidth, approximate attention), and quality gains from per-dim search
[counter to Camp D: bypass RoPE (Mamba / length-extrapolatable Transform] The counterargument is that efficiency is not quality. Task benchmarks like RULER/LongBench show many models fail on recall-heavy and multi-hop tracing far before the
Within the training loop (same tokenizer/objective/model family), validation loss/PPL power-law fits are typically accurate enough for early stopping and compute allocation; using the same scalar to decide “ship readiness” introduces system
Raw PPL is not a common unit across tokenizers or languages: vocabulary size and segmentation change token counts and the probability factorization, altering what the number means; cross-lingual reporting should at least add BPB or a langua
“PPL barely moved” is insufficient for compression/sparsification sign-off: compression can preserve average loss while changing output distributions and failure modes, reducing task scores; behavior-distribution distances (e.g., JS diverge
Alignment and instruction tuning shift “quality” from next-token prediction to instruction following and preference satisfaction, weakening base PPL as an explanatory variable for product quality; post-alignment models should be evaluated i
The practical replacement is not “find another scalar,” but a two-stage pipeline: stage 1 uses PPL/BPB for training monitoring and data/compute tuning; stage 2 uses per-task scaling laws (including overtraining and model ladders) for contin
[counter to Camp A: PPL remains the most reliable primary variable (at l] Counterexamples concentrate in out-of-loop decisions: same loss can yield different downstream outcomes [HongLiu2022SameLossBetterDownstream]; compression/sparsificat
[counter to Camp B: PPL is only stage-1 inside a task-scaling pipeline] The weakness is that evidence is still somewhat “research-setting”: per-task scaling needs stable eval protocols and enough checkpoints/model ladders; for teams frequen
[counter to Camp C: stop searching for a scalar; define quality via stan] The risk is panel-driven overfitting: if panels mismatch product distributions, optimization can drift; panels are also expensive and cannot replace high-frequency in
[counter to Camp D: the issue is ontological—next-token loss is not the ] Evidence is still sparse: even if token weighting is more principled, it must answer how it integrates with engineering controllability and task scaling. Without stan
With modern Transformer components (e.g., QK-Norm, tied embeddings, residual scaling), original µP’s width-wise zero-shot LR transfer is often broken by module-scale mismatches; Complete-P’s practical value is to bring those modules into sc
Extending transfer from “width-only” to “width+depth” requires a new scaling limit: depth-µP is not a heuristic reuse of 1/√width-like rules over layers, but follows from residual dynamics and a distinct infinite-depth feature-learning limi
fp8 turns “is the same LR stable” into a numerical-range problem: u-µP/unit scaling transfers not only LR but also RMS constraints on activations/gradients/weights; otherwise bf16→fp8 transfer error can be dominated by overflow/underflow ra
For teams with stable SP recipes, force-migrating to µP often has negative ROI: empirical formulas/joint scaling laws provide near-zero-engineering starting points for LR and batch; but transferability depends on keeping model shape and rec
Weight decay and β₂ are often dominant error sources outside µP’s transfer closure: under AdamW, wd is not equivalent to L2 and changing wd shifts stability boundaries and the optimal LR; a more pragmatic approach is to hand wd/β₂ to cost-a
[counter to Camp A: µP (upgraded to Complete-P) should be the default; f] Opponents argue (1) migrating mature SP stacks to µP is costly, and formulas are already close enough under a fixed recipe [Dey2023CerebrasGPT][Bi2024DeepSeekLLM]; (2
[counter to Camp B: empirical formulas + small sweeps are enough; µP is ] The counterargument is that formula stability depends on keeping shape and HP combinations fixed: Gemstones shows aspect ratio/HP changes can shift optima by ~3×; add
[counter to Camp C: end-to-end Bayesian/automatic optimization will repl] Opponents argue that pretraining-scale constraints are trial budget and reproducibility: end-to-end search needs many trials or high-fidelity proxies; µP/formulas red
[counter to Camp D: transfer is dominated by non-transferable HPs; wd/β₂] The counterargument is that jointly modeling everything explodes engineering complexity and trial budget; a more controllable approach is to align structural variable
For scraped web corpora, strong exact/near-exact dedup at the substring/MinHash level often lowers perplexity while reducing train-test leakage; Lee et al. [Lee2021Dedup] report near-dup rates reaching double-digit percentages in multiple c
For scraped web corpora, exact/near-exact dedup at the substring + MinHash level often lowers perplexity, train-test leakage, and extractable memorization at once; document-level exact hashing alone is not enough [Lee2021Dedup].
Per-corpus dedup is insufficient to control repeated exposure: the same documents recur across public pretraining corpora, so cross-corpus fingerprinting and global audits are required to account for exposures [Elazar2023WhatsInMyBigData].
On a finite, bleached high-quality pool, uniform repetition is close to fresh-token marginal gains up to ~2–4 epochs; beyond that, the “fresh-token equivalence” drops quickly [Muennighoff2023DataConstrained].
Uniform repetition on a finite high-quality pool is close to fresh-token equivalent for roughly 2–4 epochs, after which marginal returns fall quickly; this heuristic does not apply to hot-subset over-exposure [Muennighoff2023DataConstrained
The main risk is not epoch count but exposure skew: even a small hot subset repeated many times causes measurable degradation and leaves interpretable induction-head fingerprints [Hernandez2022RepeatedData].
For sensitive/copyright/eval data, exposure accounting is safer than “best-effort dedup”: default to 0–1 exposure with provenance and prefix/suffix dedup; repeated exposure increases extractable memorization risk, which grows with model siz
[counter to Camp A: Dedup as much as possible (treat repetition as noise] For a bleached finite high-quality pool, repetition is not equivalent to noise: under data constraints, uniform multi-epoch can spend compute effectively and behaves
[counter to Camp B: Uniform repetition ≤4 epochs is almost free (treat r] “≤4 epochs” is not a universal shield: once sampling creates hot subsets, non-uniform repetition causes measurable degradation and leaves induction-head fingerprints
[counter to Camp C: Semantic dedup is the real battleground (exact dedup] Semantic dedup has higher false-positive costs: embedding choice and thresholds change the retained distribution, and temporal/domain drift can invalidate a fixed thr
[counter to Camp D: Zero repetition for sensitive/eval/copyright data (t] Absolute zero is hard to prove operationally: cross-corpus reuse and template rewrites make “never seen” an unfalsifiable slogan [Elazar2023WhatsInMyBigData]. A more
Under a fixed HP-search budget (e.g., ≤N trials within a fixed schedule family), optimizer gaps often shrink to roughly half of what unconstrained reports claim; otherwise A/Bs mostly compare tuning effort rather than algorithms[Dahl2023Alg
AdamW stays a common default at ≥70B not because it wins every setting, but because μP learning-rate transfer compresses cross-width/cross-family LR search from “start over” to “light calibration”[Lingle2024muPTransfer], with more predictab
Muon’s deployability comes from parameter partitioning: apply approximate second-order (orthogonalized) updates only to hidden 2D weights while keeping sensitive parameters (embeddings/norms/heads) on AdamW, concentrating risk into a contro
Muon’s deployability comes from parameter partitioning: Newton–Schulz orthogonalized updates only on hidden 2D weights, while embeddings/norms/heads stay on AdamW; this isolates near-second-order instability and tuning risk to a controllabl
SOAP reduces Shampoo’s extra HPs from 4 to 1 and reaches near AdamW wall-clock at 360M–1.3B[Vyas2024SOAP]; but evidence for “second-order LR transfer like μP” is still incomplete, and theory/parameterization constraints suggest it won’t com
Under VRAM constraints, low-state methods that don’t change the training loop are often more reliable than methods that change optimization geometry: Apollo replaces per-param second moments with per-tensor scalars, pushing state toward SGD
[counter to Camp A: AdamW won’t be retired (highest default priority)] Critics argue many “AdamW advantages” come from schedule and tuning inertia rather than the algorithm; with equal budgets, second-order or near-second-order methods may
[counter to Camp B: Muon is the next default (but only as a hybrid)] Critics emphasize public evidence is skewed toward speedruns/small-scale, and small proxies can systematically miss large-scale instabilities; without controlled ≥7B/≥30B
[counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Critics note production barriers are not only compute, but transferable tuning. Theory and evidence suggest second-order methods may require specific parameterization
[counter to Camp D: optimizers matter less; many gains are evaluation ar] Opponents argue that even under controlled budgets, second-order/near-second-order can still yield lower loss at the same wall-clock, especially as systems implementa
[counter to Camp D: optimizers matter less; many gains are evaluation ar] Even if gaps shrink, second-order/hybrid routing can still yield stable wall-clock wins under specific tensor and system constraints; attributing everything to artifa
[counter to Camp D: optimizers matter less; many gains are evaluation ar] Revision to c-8a0328a518: even under controlled budgets, second-/near-second-order can still yield lower loss at similar wall-clock, especially with mature systems (d
Once FA2 varlen attention takes document boundaries via cu_seqlens, the systems cost of per-doc causal masking shifts from explicit large mask tensors to lightweight API parameters, making near-100% packing ratio compatible with strict cros
Naive concat with cross-doc visibility systematically underestimates training loss because later tokens can attend to unrelated-document tokens; the primary value of per-doc masking is objective–evaluation alignment, not just “leak preventi
The engineering meaning of short-to-long is to turn long-window training from a full-run cost into a tail-budget decision: LLaMA-3 reports ~95% compute at 8K with a final 128K stage [Llama32024], while Qwen2.5 provides a reproducible RoPE-b
There is direct evidence that “fewer truncations is more robust than aggressive concat-chunk”: preserving document integrity and avoiding token drops improves LM outcomes [Ding2024FewerTruncations], so the default for over-length docs shoul
FIM has a stable code recipe (StarCoder at 50% rate) [StarCoder2023][FIM2022], but making FIM the default objective for NL lacks clean NL-only rate sweeps; UL2’s mixture-of-denoisers provides motivation but also fragments objective definiti
[counter to Camp A: short-to-long + per-doc masking (default engineering] Opponents argue that real usage often crosses documents (prompting/retrieval concatenation), so hard boundary isolation may under-train cross-segment modeling; and sh
[counter to Camp A: per-doc masking + short-to-long (default engineering] Common objections: hard boundary isolation mismatches real usage (prompting/retrieval crosses documents), and length curricula may cause length overfitting or insuffi
[counter to Camp A: per-doc masking + short-to-long (default engineering] Opponents argue that hard isolation of document boundaries reduces cross-segment modeling capability, short-to-long may introduce length overfitting, full-cycle mixed
[counter to Camp B: uniformly mixed-length training (anti-curriculum, al] Public evidence still lacks direct equal-compute, matched-throughput comparisons proving always-mixed is better; long-window training also has higher systems cost and
[counter to Camp C: naive concat + cross-doc visible (cross-boundary by ] Naive concat conflates “cross-doc tasks” with “throughput side effects”: loss underestimation and evaluation mismatch are expected [Krell2021Packing]. A more controll
[counter to Camp D: FIM for everything (infilling as a universal default] Code-side evidence is strong, but NL lacks clean rate sweeps; mixture-of-denoisers also changes the objective definition, making perplexity/downstream comparisons mor
Compute-optimal tokens/param is not a constant: public refits place it in a wide 5–100 band and show sensitivity to data mixture and batch-size schedules [DeepSeek2024LLM]. Copying Kaplan≈1.7 [Kaplan2020ScalingLaws] or Chinchilla≈20 [Hoffma
In data-constrained regimes, repetition has an empirical safe window: repeating up to ≤4 epochs is approximately equivalent to fresh tokens, after which gains decay at a fit-able rate [Muennighoff2023DataConstrained]. Treating D as one-dime
Treating D as a 1D total-token count systematically overestimates “just train longer”: in data-constrained regimes, repetition up to ≤4 epochs is roughly comparable to fresh tokens, after which marginal gains decay at a fit-able rate [Muenn
Loss scaling can extrapolate reliably into over-training, but individual downstream benchmark scores can vary widely during training until a loss threshold is reached [Gadre2024OverTraining]. Thus, “predict a specific task score from loss”
Task-level scaling needs its own model: “model ladders + two-stage regression” achieves 1.9% average prediction error on 7 multiple-choice tasks [Bhagia2024TaskLadders], outperforming approaches that force all tasks onto a single loss power
Data mixture is an independent scaling axis: at fixed model and token budgets, changing filtering/mixture alone can yield ≥7 pp downstream gaps [Li2024DCLM], often larger than second-order gains from tweaking tokens/param (e.g., 20→30).
[counter to Camp A: Kaplan-style—portable exponents, compute-optimal fav] Counterevidence comes from more thorough training and IsoFLOP-style controls: when training is closer to completion, data scarcity becomes the bottleneck earlier and
[counter to Camp B: Chinchilla-style—balance N and D under fixed compute] Two lines of counterevidence point to exponent instability: under data constraints, repeated tokens are only equivalent to fresh tokens up to ≤4 epochs, after which d
[counter to Camp C: Data-mixture pragmatists—data is the first axis; get] The pushback is that “data-first” does not automatically solve budget allocation: even with a good mixture, the compute-optimal N:D must still be re-fit under the spe
[counter to Camp D: Against emergence-as-magic—many “emergent” effects c] The main pushback is: even with more continuous metrics, some capabilities may still accelerate in specific data/training phases; and production often cares about 0-1
For repo-level tasks, cross-file context is a structural variable rather than a detail: both RepoBench[RepoBench2023] and RepoCoder[RepoCoder2023] show single-file setups systematically overestimate capability, so reporting only HumanEval-s
Execution semantics and function synthesis are different bottlenecks: CRUXEval[CRUXEval2024] isolates program-state tracking via I/O prediction; thus BPB/patch-PPL gains without CRUXEval gains should be read as distribution fitting rather t
At the SFT-ready stage, reporting HumanEval alone turns false positives into “capability”: EvalPlus[EvalPlus2023] shows inadequate tests systematically inflate scores, so “HumanEval-only” should not be an acceptable comparison protocol.
Freshness is an evaluation parameter, not a footnote: LiveCodeBench[LiveCodeBench2024] operationalizes time windows; without disclosing training cutoff and eval window, cross-model differences are more likely contamination than method impro
Agent-loop cost structure can flip conclusions: self-debug/self-repair work shows final pass/fail hides differences in retries and feedback quality[SelfDebug2023][SelfRepair2023]; deployment evaluation must separately report retry@k, test-e
[counter to Camp A: HumanEval/MBPP is sufficient to represent coding abi] RepoBench[RepoBench2023] and RepoCoder[RepoCoder2023] show single-file setups miss cross-file dependencies and project structure; EvalPlus[EvalPlus2023] shows weak te
[counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Multi-SWE-Bench[MultiSWEBench2025] shows language/ecosystem shifts can reorder rankings, so a single Verified score does not extrapolate; SWERebench2025 emphasizes de
[counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Counter to c-e8339ebcb6 / c-33451ad4c4: Verified reduces noise but is Python-heavy; Multi-SWE-Bench shows rankings can reshuffle across 7 languages, so a single Verif
[counter to Camp C: trajectory-level PPL (e.g., patch-PPL) is the most r] Public evidence mostly supports “editing modeling matches task form”[InCoder2022], but does not systematically establish correlation between patch-PPL and Verified; m
[counter to Camp D: deployment UX metrics reflect SWE-agent value better] The weakness is the lack of standardized, reproducible deployment-level benchmarks and public logging protocols: many results remain lab-specific and hard to align ac
On evaluations requiring exact copying/addressing, the dominant failure mode of pure SSMs/attention-free models is not “insufficient context length” but “non-invertible state compression”: recall-centric metrics and ICL comparisons show a s
“Attention and SSMs can be unified” should be read as: causal attention contains a subclass representable by structured recurrences, not that all attention can be losslessly replaced. SSD formalizes the translatable subset and explains why
Hybrid gains come from role separation, not naive stacking: attention layers implement copy/retrieval algorithms compatible with induction-head mechanisms, while SSM layers handle long-range routing and compression; Jamba/Griffin designs an
The claim “subquadratic models must be pretrained from scratch” is weak after 2024: distillation and conversion show feasible paths to transfer interaction patterns from pretrained Transformers into SSM/linear-attention backbones, matching
[counter to Camp A: Pure SSMs will eventually replace Transformers] Counterevidence concentrates on “precise addressing” tasks: under controlled training, recall/ICL deficits of attention-free models are reproducible and conflict with mecha
[counter to Camp B: Linear attention and SSMs are fundamentally the same] Same family does not mean equivalent: linear attention often maintains matrix states (or low-rank approximations), closer to attention for “multi-slot memory/addressa
[counter to Camp C: Subquadratic models must be pretrained from scratch;] Post-2024 evidence leans toward “controllable transfer”: explicit objectives and staged strategies for distilling attention structure into SSMs exist [Bick2024Transfo
[counter to Camp D: The engineering optimum is hybrid; minimize attentio] The main pushback is twofold: (1) we lack systematic curves for the minimal number and placement of attention layers; (2) hybrids add implementation complexity and ke
For decoding, the primary KPI is bytes/token rather than attention FLOPs: MQA shares K/V across heads to directly target KV-cache bandwidth [Shazeer2019MQA]; GQA/MLA continue the same objective, turning “architecture changes” into measurabl
FlashAttention is not just “faster attention”; it removes the N×N intermediate from HBM: SMEM tiling + online softmax shifts the ceiling from purely memory-bound to a regime tunable via tiling and parallelism [Dao2022FlashAttention]; on Hop
“Same algorithm, different kernel is equivalent” is falsifiable in training: numerical deviations in fused attention can leak through optimizer state and trigger loss spikes [Golden2024FAStability], while FP8 drift may be invisible in short
Low-precision training behaves like a “numeric contract,” not a one-off trick: MXFP8/MXFP4 recipes codify coupling among per-block scaling, master weights, and optimizer updates, implying kernel scale/accumulation paths are part of the algo
“Kernels decide everything” does not hold: long-context can be extended mainly via positional/training strategies (PI, YaRN, continual pretraining), reaching ~32k without kernel rewrites [Chen2023PI] [Peng2023YaRN] [Xiong2023LongContextScal
[counter to Camp A: algorithms and kernels must be co-designed (bytes/FL] Opponents argue that many extensions (long context, distributed training) can be done at framework/training-strategy level without dragging teams into CUDA details; a
[counter to Camp B: PyTorch/graph-compiler level is sufficient; handwrit] The counterpoint is that attention and low-precision ceilings are often dictated by hardware features (SMEM/TMA/warp specialization, FP8/MXFP8 scale semantics), which
[counter to Camp C: hardware will get faster; algorithms need not adapt ] The counter-evidence has two parts: (1) FP8/MXFP8 stability depends more on scale semantics and per-block mechanisms, so BF16 extrapolation breaks [Micikevicius2022FP
[counter to Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] The counterpoint is that frontier LLM critical paths often depend on specific hardware capabilities and mature libraries (e.g., FlashAttention-3 explicitly encodes Ho
Placing TP/EP across nodes (IB) amplifies per-layer collective latency into a “layers × per-step” tax; accordingly, public >10K GPU recipes default to TP=8 within NVLink islands, and use PP as the first dimension to span IB.[Shoeybi2019Mega
When PP is unavoidable, bubbles are not a minor tweak: zero-bubble constructions drive bubbles near 0 under synchronous semantics, often corresponding to ~2–5 recoverable MFU points in practice.[Qi2023ZeroBubble]
For long context, SP is often the first lever: it reduces activation memory by ~5×, cuts recompute overhead from ~36% to ~2%, and does not increase TP communication volume.[Korthikanti2022SP]
Once attention exceeds ~50% of step time or L≥32K, CP shifts from “faster attention” to a required mesh dimension: Ring distributes KV along a ring [Liu2023RingAttn]; Ulysses reshapes heads via all-to-all [Jacobs2023Ulysses]; USP selects be
Public auto-parallel evidence reads more like “encode manual heuristics into a compiler” than “automatically find the ceiling plan”: Alpa’s ILP search and GSPMD’s compiler sharding reduce annotations, but without matched-scale 100B+ and >1K
[counter to Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Manual plans are experience-heavy and require retuning when architectures shift (MoE, long context, heterogeneous layers); and “reproducible” is not “optimal,” potent
[counter to Camp B: auto-parallel (Alpa / GSPMD / pjit line)] The gap is ceiling evidence, not feasibility: public materials lack matched-scale 100B+ and >1K GPU comparisons (same model, same topology, reporting MFU and NVLink/IB/cross-pod
[counter to Camp D: 3D (DP+TP+PP) is enough; SP/CP are optional optimiza] Public long-context systems already push SP/CP into commonly used dimensions: SP reduces memory without increasing TP comm volume [Korthikanti2022SP]; at L=32K–128K,
On controlled compositional tasks, nominal 128K often behaves like ~32K effective context; NIAH-style retrieval hit rates systematically overestimate utilization, and [Hsieh2024RULER] provides a reproducible decomposition protocol.
“Lost-in-the-middle” is a stable degradation mode: evidence in the middle can be 20+ pp worse than at the ends, and [Liu2023LostInMiddle] shows effective context is not a monotonic function of window size.
Positional extrapolation alone (PI/ALiBi/YaRN) mainly makes longer runs possible but does not guarantee stable gains on non-retrieval long-context tasks; [Kazemnejad2023PEImpact] supports “PE matters but is not a single-point solution”.
On the same base model, continued pretraining with long-document upsampling plus domain distribution preservation more directly lifts long-text tasks than PE-only changes; [Fu2024DataEngineering] and [Xiong2023EffectiveLongCtx] provide repr
Long-doc upsampling plus domain distribution preservation yields ~3–7 pp gains on long-text tasks on the same base model, and tracks effective context more directly than PE-only changes.[Fu2024DataEngineering][Xiong2023EffectiveLongCtx]
PE-only often moves a model to longer position ranges, but on the same base model long-document upsampling and domain preservation restore long-task performance more directly than PE-only changes [Fu2024DataEngineering][Xiong2023EffectiveLo
Related-document concatenation turns cross-span reference/repetition into within-sequence statistics, strengthening ICL and cross-document reasoning; the mechanism is motivated by burstiness-driven ICL [Chan2022DataDist] and evidenced by [S
[counter to Camp A: positional extrapolation is enough; long context is ] Controlled evaluation and ablations suggest “PE makes longer runs possible, but doesn’t automatically teach long utilization”. [Hsieh2024RULER] shows effective-contex
[counter to Camp B: the data recipe is the main variable; long-doc upsam] Data alone can hit operator/systems bottlenecks: without efficient attention kernels, long-sequence training cost limits iteration speed; [Dao2023FlashAttention2] sho
[counter to Camp C: packing/concatenation is under-exploited; sequence c] The evidence chain is still sparse: beyond direct evidence like [Shi2023InContextPretraining], many packing details remain engineering lore without controlled ablatio
[counter to Camp D: switch architectures (sparse / memory / recurrent) t] Architecture bypasses often optimize for “processable length” but may not win on “compositional reasoning”; without controlled compositional alignment, they can becom
For generalists, 15–25% code behaves like a stable default: public continual results show that after adding ~500B code tokens, MMLU drops by <1 pp while coding improves substantially [CodeLlama2023]; treating >30% as an experiment zone requ
Mixture gains are not linear: synergy/interference phase plots predict steep gains at low fractions and more mutual suppression near >~40% [Aghajanyan2023SciMix], so “>40% is monotonically better for generalists” lacks both a mechanism and
Code tokens are often “more stable” but also displace budget: at similar training loss, code-domain downstream variance is lower than web (reported ~30% scale difference) [Gadre2023Overtraining]; under fixed compute, any increase in code to
Before debating ratios, fix token organization and objectives: repo-level packing injects real dependency structure into a single sequence [Shi2023InContextPretraining], and FIM/infilling aligns training with edit-style usage with small sid
[counter to Camp A: generalists should push code past >40%; more is alwa] Public evidence reads more like “specialists are feasible” than “generalists are monotonically better.” High-code generality cost is quantified for from-scratch speci
[counter to Camp B: code helps mostly via optimization/regularization (l] Optimization effects explain “stability,” but not “why code rather than any low-entropy corpus.” NL-PL joint pretraining shows structured transfer across tasks [Feng2
[counter to Camp C: keep code <10% to protect NL in generalists] Public evidence does not support a hard “<10% is the only safe” threshold: Code Llama shows that even large code-token additions can keep MMLU drop under <1 pp [CodeLlama2023]
[counter to Camp D: code ability must be trained from scratch; continual] “Will eventually fail” needs public failure curves or reproducible experiments, but the more salient public datapoint is a counterexample: Code Llama’s continual keep
For new projects, FA1 [Dao2022FA1] is better kept as a historical baseline than a default implementation; for the same exact attention algorithm, FA2 [Dao2023FA2] shows that parallel granularity and work partitioning alone can produce throu
On H100/H200, BF16/FP16 training should default to FA3 [Shah2024FA3]; on A100/H20, FA2 [Dao2023FA2] remains the default because FA3’s main gains depend on Hopper TMA and warp specialization rather than universally portable tricks.
On the serving side, the main ROI has shifted from writing yet another faster kernel to decode parallelism and KV strategy; Flash-Decoding [Dao2023FlashDecoding] and FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] show that single-que
FlexAttention [Dong2024Flex] should be the default entry point for attention variants; specialized implementations become justified only when mask/score semantics directly alter tiling, memory access, or distributed blockwise execution, whi
FP8 attention training is not a toggle but an error-path management problem; without a BF16 control, rescaling-order checks, and outlier-token monitoring, the FP8 path in FA3 [Shah2024FA3] should not be treated as a production default [Gold
[counter to Camp A: FlashAttention is largely the endpoint of attention ] The counterargument is that richer masks, distributed blockwise execution, and new hardware features will keep creating demand for specialized kernels, as illustrated
[counter to Camp B: The main line should move from hand-written CUDA to ] The counterargument is that compiler-based abstraction does not cover every case; once mask/score semantics alter memory access, or distributed blockwise execution is
[counter to Camp C: Attention itself should be replaced by SSM, linear, ] The counterargument is that retrieval and in-context learning remain strong points for attention in general LLM settings; the comparisons in Waleffe et al. [Waleffe20
[counter to Camp D: FA3 embodies NVIDIA generation-specific lock-in, so ] The counterargument is that lock-in cannot be discussed in the abstract; it depends on hardware share and payoff. If H100/H200 already dominate training, avoiding FA3
Token-level weighting (e.g., Rho-1) can beat example-level filtering on data efficiency, but it shifts complexity from the data gate to the training loop (stats, scheduling, packing interactions); without observability and regression tests,
[counter to Camp A: classifiers + ablation ladders are sufficient (bet o] Two risks are often underestimated: (1) proxy domain-of-validity and scale drift can let “old classifiers dominate new-model data gates”; (2) long-horizon training ca
[counter to Camp B: influence/attribution is the main path (infer recipe] The strongest negative evidence also comes from influence: <10% cross-scale top-influence overlap implies the “key example set” is not stable; transferring small-mode
[counter to Camp C: full causal identification is the future (solve conf] Identification assumptions are hard costs: IV exclusion restrictions, instrument strength, and modeling evaluator bias can fail and yield directional errors. In pract
[counter to Camp D: skip measurement, rely on intuition and scale (scali] Two engineering realities constrain pure scale-first: (1) token cost and data availability are harder constraints in 2026; (2) coverage is not effective coverage—nois
Offline mixture search under mid compute behaves more like calibration than global optimization: RegMix/mixing laws fit local response surfaces from few points, but the target drifts when buckets or dedup versions drift; Held et al. show se
[counter to Camp A: Formal search first (laws/regression/robust optimiza] A counterpoint is “complex estimators can be unstable”: Held et al. show several utility estimators underperform simple heuristics; these methods are also sensitive t
[counter to Camp B: Heuristics + curricula are more robust; a few ablati] The blind spot is transferability and interpretability: with finer buckets, or when objectives shift from average metrics to worst-bucket or specific capabilities, pu
[counter to Camp C: Online/adaptive mixing beats one-shot offline search] The main risks are cost and accounting: extra phases, stable domain accounting, and transfer assumptions (similar buckets and horizons). With coarse or drifting bucke
[counter to Camp D: Ratios are second-order; quality/selection is first-] Two counterexamples matter simultaneously: (1) “cleaner” can remove coverage needed for long-horizon training; Nemotron-CC explicitly argues against over-filtering [N
Setting RoPE base from 10000 directly to the target-window scale (e.g., ~500000 for 128K) is not a magic knob: if base is too small, low-frequency dimensions lack phase coverage at the target length, making far positions indistinguishable a
At 512K–2M targets, a single global scaling (PI-style interpolation or a global NTK-aware scale) induces per-dimension mismatch across frequency bands; LongRoPE learns non-uniform per-dim scales via search and pairs it with larger fine-tuni
“Advertised 128K” does not imply “effective 128K”: RULER finds many models effectively usable only up to ~32K across task families; acceptance should rely on cross-length task curves from RULER/LongBench/LV-Eval rather than PPL or single-po
Bypass routes can be cheaper in complexity (e.g., linear-time Mamba, memory-augmented RMT, Infini-attention), but for recall-heavy multi-hop tracing they cannot be justified by throughput or max window alone; they need benchmark-matched com
[counter to Camp A: native long-context (ABF + curriculum) is the cleane] Cost and data are hard constraints: continual pretraining needs long-text data and systems throughput, and is unfriendly to existing deployed models; in the 32K–128K
[counter to Camp B: YaRN, not PI, should be the default for 32K–128K ret] PI’s advantage is simplicity and low barrier: minimal implementation changes and fewer fine-tune steps, useful for quick “make the window run” validation; if tasks ar
[counter to Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] Search and long fine-tuning raise costs, and systems bottlenecks (KV cache/communication) may dominate before RoPE parameterization does; for many product tasks, marg
[counter to Camp D: bypassing RoPE/attention (SSM/external memory/retrie] Cost advantages do not automatically translate to recall-heavy quality: summarization/write-back introduces information bottlenecks, retrieval introduces recall-error
In the 32K–128K regime, positional extrapolation is no longer the main bottleneck; data distribution and evaluation protocol more often determine the ceiling of effective context [PI2023][YaRN2023][Fu2024DE128K][Gao2024EffectiveLongCtx].
Passing NIAH or perplexity is not enough to show that a model will use mid- and far-context evidence on real long tasks; RULER, LV-Eval, LongBench, NoCha, and RepoQA all provide counterexamples [RULER2024][LVEval2024][LongBench2023][NoCha20
Raising long-document ratio to roughly 25%+, controlling packing truncation, and using staged window growth are usually more reliable than simply adding more long tokens [Fu2024DE128K][Xiong2023EffectiveLongContext][Ding2024Packing][Xu2025U
At 1M+, the problem shifts back to systems stability: kernels, parallelism, and positional-scaling error jointly determine whether training can preserve short-context ability [FA2][LongRoPE2024][Gemini15][Xu2025UltraLong].
[counter to Camp A: explicit positional design is required, and RoPE ext] Press et al. [ALiBi2021] already provide a non-RoPE counterexample for length extrapolation; Kazemnejad et al. [Kazemnejad2023PE] and Lu et al. [Lu2024ControlledLongC
[counter to Camp B: scaling to million tokens is mostly a parallelism pr] Systems optimization does determine whether training enters a workable regime, but Hsieh et al. [RULER2024], Gao et al. [Gao2024EffectiveLongCtx], and Fu et al. [Fu20
[counter to Camp C: NIAH/perplexity are sufficient; task benchmarks are ] Zhou et al. [BenchmarkCheater2023] do warn that benchmarks can be contaminated, but that is not a reason to fall back to a single proxy. Hsieh et al. [RULER2024], Kar
[counter to Camp D: sparse, memory, or linear-attention alternatives wil] Alternative architectures are attractive on complexity grounds, but public evidence still leans more toward feasibility and efficiency than head-to-head comparisons a
Decoupling load balancing from the main loss (bias EMA / sign-gradient updates) typically reduces “implementation details dominate balancing” risk by an order: aux-loss stat scope, DP sync, and detach choices can change usage CV and special
Decoupling balancing from main gradients (bias EMA / sign updates) demotes aux-loss implementation sensitivity from a dominant to a secondary factor: stat scope, DP sync, and detach choices can change usage CV and specialization more than t
Fine-grained experts (64–128, up to 256) plus one shared expert is a structural way to lower the specialization barrier: the shared expert carries high-frequency common patterns, reducing pressure for routed experts to learn generic functio
The main value of router z-loss is numerical stability rather than “smarter routing”: it prevents router-logit blow-up and reduces early traffic collapse to a few experts; in stacks that gate on token drop and dead experts, z-loss (often ~1
“Dense→MoE upcycling is always cheaper” does not hold: upcycling scaling laws show early saturation regimes with “more experts without more quality” [Liew2025Upcycling]; ROI must include extra stabilization steps, all-to-all communication,
Evidence is insufficient that “learned routing/balancing is always necessary”: systematic experiments show frozen random routers can be close to learned routers in some settings [Fan2024EmpiricalMoEChoices]; but a second strong LLM-scale re
[counter to Camp A: MoE becomes the default backbone (dense kept for sma] Counterarguments focus on full-lifecycle ROI: all-to-all, dispatch, and router stabilization can negate compute advantages in wall-clock and engineering cost; upcycli
[counter to Camp C: learned routing/balancing is overrated (random/froze] The counter is extrapolation risk: current evidence lacks a second strong LLM-scale replication and does not cover training control like DeepSeek-V3’s hard gates on d
[counter to Camp D: MoE is mainly for pretraining; post-training should ] The counter is an evidence gap: there are few public LLM-scale matched comparisons of sparse SFT/RLHF vs dense SFT/RLHF; systems intuition alone is insufficient to de
On modern transformer stacks, original µP is not 'usable by default'; without fixing modules such as QK-Norm, tied embeddings, and residual scaling, LR transfer shows architecture-dependent drift, and coord check usually reveals the mismatc
When the SP recipe is fixed, empirical formulas plus 1–2 small sweeps often get target-scale LR/batch initialization within about 10%, but once aspect ratio, training duration, or inference-cost objectives change, the fit drifts substantial
Under AdamW, weight decay is an independent control axis and should not be treated as a fixed background setting; in several practical settings, wd can affect LR transfer error more than the parameterization itself [Kosson2025WDMoreThanMuP]
Depth and precision are not secondary corrections to width transfer: when depth doubles, first run a small controlled comparison using depthwise rules; when moving from bf16/fp16 to fp8, include activation/gradient/weight RMS monitoring in
[counter to Camp A: Complete-P should be the default starting point, and] The counterargument is that modern training stacks keep changing modules, optimizers, precision, and norm design, and few teams actually satisfy these assumptions. Un
[counter to Camp B: empirical formulas plus small sweeps are enough; µP ] The counterargument is that formula stability depends heavily on an unchanged recipe; once aspect ratio, training duration, or inference-cost objectives change, the o
[counter to Camp C: end-to-end BO / automatic optimization will replace ] The counterargument is that end-to-end automation is still too expensive at LLM budgets, and starting from a poor initialization wastes large amounts of compute in ob
[counter to Camp D: transfer error is dominated by non-transferable hype] The counterargument is that if every hyperparameter is modeled separately, the simplicity of transfer disappears and the process collapses back into full search. The
The main risk variable is not simply whether repetition exists, but how total exposure is distributed globally; cross-corpus overlap makes per-pool dedup underestimate true exposure [Elazar2023WhatsInMyBigData].
For benchmarks, PII, and copyrighted text, a default of 0–1 exposure is safer than mixing them into the main pool and training for multiple passes, because repeated exposure raises extractable memorization risk as model size and context len
Semantic dedup addresses a second redundancy layer that exact dedup misses, but it should sit on top of strong exact/near-exact dedup rather than replace it [SemDeDup2023][D42023].
[counter to Camp A: Dedup as much as possible; repetition is mostly nois] The weakness is that it can collapse “passive web redundancy” and “active multi-epoch training on a finite pool” into one bucket. Muennighoff et al. [Muennighoff2023D
[counter to Camp C: The real battleground is semantic dedup; exact dedup] The weakness is that exact/near-exact duplication remains a cheap and dense redundancy source. Lee et al. [Lee2021Dedup] already show that substring- and MinHash-leve
[counter to Camp D: For sensitive/eval/copyrighted data, zero repetition] The usual counterargument is that if the mixture ratio is tiny and the total token count is huge, the risk is negligible. Carlini et al. [Carlini2022Memorization] arg
Within the same tokenizer/objective/model family, validation cross-entropy (log PPL) follows reproducible power-law fits over multiple orders of magnitude and is cost-effective for budgeting and early stopping; the extrapolation error is ty
Compute-optimal conclusions are sensitive to training regime (overtraining vs not, duration assumptions) and fitting details; if PPL drives budgeting, training-duration assumptions and fit uncertainty must be reported, otherwise the same ev
There is reproducible evidence that downstream scores can differ materially at near-identical pretraining loss; therefore the strong claim “same PPL implies same capability” is false, and PPL behaves more like a necessary condition than a s
Compression/sparsification acceptance cannot be replaced by “PPL barely changed”: at high sparsity PPL can stay nearly flat while task scores drop, and behavioral distribution drift (e.g., JS divergence) tracks risk more closely.[KhanalCapo
Defining scaling targets directly on per-task scores can yield stable fits and extrapolations under regimes like overtraining; the more robust engineering decision variable is “task curve + extrapolation error,” not PPL as a task proxy.[Isi
[counter to Camp A: PPL remains the most reliable primary variable (at l] Counterexamples where downstream differs at the same loss show that “PPL-driven selection” can fail under recipe tweaks, optimization-path changes, or post-training;
[counter to Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Per-task modeling is expensive and panels drift; many task metrics have thresholds and discrete jumps, making extrapolation unstable and “curve-driven” workflows hard
[counter to Camp C: stop searching for a scalar; define quality via stan] Multi-panels can devolve into metric sprawl without a unified budgeting/extrapolation framework; training still needs a cheap continuous signal for control loops, whe
[counter to Camp D: the issue is ontological—next-token loss is not the ] More complex objectives make stable scaling laws and reproducible budgeting harder; training still needs a monitor aligned with the optimized objective, and alignment
Under packing, naive concatenation with cross-doc visibility systematically underestimates training loss and changes the objective from within-document conditional likelihood to cross-document likelihood; per-doc causal masking keeps the ob
Fewer truncations and better document integrity yield measurable LM gains, so split-then-pack is a safer default for over-length docs than truncate-and-drop or concat-then-chunk [Ding2024FewerTruncations].
Long-context behaves like a late-stage purchase: public recipes often allocate 85–95% compute at 8K and 5–15% to a 128K mid-train; this requires stable, controllable long-sample construction and sampling in the data pipeline [Llama3Blog2024
RoPE extrapolation is not a single trick: changing RoPE base (e.g., 10K→1M) and positional interpolation are different levers—one changes frequency distribution, the other remaps positions—and they differ in how much they rely on mid-train
FIM has a mature default recipe for code models (StarCoder 50% rate), but making it universal changes training from pure causal to mixed objectives; without NL-only rate-sweep ablations, a safer default is on for code and off for NL [StarCo
[counter to Camp B: always-mixed lengths (anti-curriculum)] Public evidence leans toward curriculum-style acceleration and budgeted long-context; and without near-zero-padding packing, always-mixed distorts compute accounting and makes equa
[counter to Camp C: cross-doc visibility by default (beyond-boundaries c] It systematically underestimates loss and rewrites the objective, breaking semantic alignment between training and evaluation metrics; it also turns cross-doc depende
[counter to Camp C: cross-doc visibility by default (beyond-boundaries c] Krell et al. [Krell2021Packing] prove that cross-doc visibility systematically underestimates training loss, rewrites the pretraining objective, leading to inconsiste
[counter to Camp D: FIM/denoising-style objectives as default (infilling] Making FIM the default changes tokenizer conventions, data transforms, and evaluation framing; it has a mature recipe for code, but NL lacks no-regression rate sweeps
On exact copying/recall paradigms, pure SSMs more readily accumulate interference and blur: Jelassi et al. [Jelassi2024RepeatAfterMe] provides systematic evidence that Transformers outperform generalized SSMs on copying; Mamba also lags siz
The core of Hybrids is not “make attention cheaper” but “use less attention”: Jamba keeps downstream quality with a 1:7 (attention:Mamba)+MoE recipe while fitting 256K inference on a single 80GB GPU [Lieber2024Jamba]; Zamba goes further by
At ~128K context, end-to-end training throughput is dominated by kernel form: Samba reports ~4× faster training at 128K [Ren2024Samba]; SSD duality explains why matmul-friendly implementations can yield 2–8× training speed ranges by relatin
“Usable long context” cannot be substituted by perplexity: Lost in the Middle shows systematic degradation in using mid-context information [Liu2023LostInTheMiddle], so Hybrid/SSM gains must be compared on standardized long-context suites l
RWKV/linear recurrence is strongest for constant-memory inference and streaming, but matched-scale comparisons against Hybrids on addressable retrieval remain thin; theory suggests constant-memory recurrence may need stronger state structur
[counter to Camp A: Pure SSM will be the endgame for long context] Exact recall/copying and ICL look like structural gaps: Transformers more readily learn near-discrete retrieval on copying tasks [Jelassi2024RepeatAfterMe], and Mamba lags o
[counter to Camp B: Hybrids are the production default (tunable 1:3 to 1] Hybrids add implementation complexity: two kernel stacks, two numerical-stability regimes, and (with MoE) more complex parallelism and load balancing [Lieber2024Jamba
[counter to Camp C: No architecture change—Transformer + long-context ex] Two issues remain unresolved by “only changing attention form”: (i) length extrapolation is highly sensitive to positional encoding and positional descriptions [Kazem
[counter to Camp D: RWKV/linear RNN is the correct RNN revival path] Matched-scale exact-recall comparisons against Hybrids are missing, making “quality will catch up” hard to turn into testable engineering decisions; theory also suggests s
Under fixed HP-search budgets, many “new optimizer beats AdamW” gaps shrink materially; without reporting schedule family and trial counts, gaps often reflect tuning capacity rather than algorithmic advantage [Dahl2023AlgoPerf][Agarwal2020L
Muon’s deployability comes from hybrid routing, not wholesale replacement: orthogonalize updates only for hidden 2D weights via a Newton–Schulz approximation and fall back to AdamW for other tensors, concentrating failure modes into one ten
SOAP compresses Shampoo’s “hard to tune / unstable” burden into one extra HP by running Adam’s momentum/adaptivity in Shampoo’s eigenbasis, achieving near-AdamW wall-clock and lower loss at 360M–1.3B [Vyas2024SOAP][Gupta2018Shampoo].
When VRAM is tight, low-state methods that don’t change the training loop (Apollo, Adam-mini) are often a safer first move than adding second-order matrix state; GaLore changes gradient representation and increases both upside and engineeri
[counter to Camp A: AdamW won’t be retired (highest default priority)] If second-order or hybrid routing can reliably win on wall-clock without increasing tuning degrees of freedom, the default should move; sticking to AdamW could be paying
[counter to Camp B: Muon is the next default (but only as a hybrid)] Public evidence is still skewed toward speedrun/mid-small scale, and systematic ablations on why non-2D tensors break are missing. At ≥7B/≥30B, long context, and distribut
[counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Public ≥7B head-to-heads (equal-budget HPO + wall-clock) are still missing; μP-style LR transfer for second-order is not systematically validated, so tuning cost may
On copy/retrieval/dense entity binding, pure recurrence with fixed-size state shows a reproducible gap; recall/copy metrics surface it reliably while perplexity often does not [Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL][Pa
Hybrids with “a few attention layers + many recurrent/SSM layers” are more stable in practice: attention handles exact addressing/copying, the rest handles routing/compression; Jamba and Griffin provide reusable instantiations [Lieber2024Ja
Linear attention and SSMs are mathematically mappable via duality/rewriting, but engineering equivalence depends on state form (vector vs matrix), reversibility under finite dimension, and bandwidth/numerical-stability costs [Dao2024Transfo
Subquadratic backbones need not be pretrained from scratch: Transformer→SSM/hybrid distillation and layer replacement can transfer interaction patterns, turning training budget from high-risk exploration into controlled retrofit [Bick2024Tr
[counter to Camp A: Pure SSM/RWKV will eventually replace Transformers] Counterexamples concentrate on exact addressing/copy: the fixed-state bottleneck is reproducible on copying and dense binding, and recall metrics surface it reliably; t
[counter to Camp B: Linear attention and SSMs are the same class and mut] In engineering, “rewritable” is not “interchangeable”: GLA’s matrix state shifts the budget from O(d) to O(d^2)-scale read/write and storage constants, where hardware
[counter to Camp C: Subquadratic models must be pretrained from scratch;] Recent evidence supports controlled retrofit: Bick et al. distill attention structure progressively [Bick2024TransformersToSSMs]; Wang et al. retrofit pretrained Tran
[counter to Camp D: The engineering optimum is hybrid; minimize attentio] The main weakness is the lack of a reproducible specification for the minimal attention configuration: current practice is recipe-driven, with insufficient systematic
Compute-optimal tokens/param is not a fixed constant; public refits already place the optimum anywhere from about 5 to about 100, driven mainly by training steps, batch/LR schedules, data quality, and deduplication [Kaplan2020ScalingLaws][H
Under data constraints, total tokens must be decomposed into fresh tokens × repetition; repetition is roughly comparable to fresh tokens up to ≤4 epochs, after which gains decay at a fitable rate [Muennighoff2023DataConstrained].
Loss scaling does not directly yield per-task scaling; in the over-training regime, validation loss remains smoothly extrapolatable, while individual benchmark scores fluctuate substantially until a loss threshold is crossed [Gadre2024OverT
Per-task capability can be predicted, but only by modeling task heterogeneity explicitly; model ladders plus a two-stage regression reduce average error to about 1.9% on multiple-choice tasks, clearly outperforming direct extrapolation from
Data mixture is an independent scaling axis; at fixed model and token budgets, changing filtering, deduplication, and mixture alone can move downstream results by at least 7 points, large enough to dominate many parameter-scale increments [
[counter to Camp A: Kaplan-style — portable exponents, fixed-compute sho] Hoffmann et al. [Hoffmann2022Chinchilla] rebut [Kaplan2020ScalingLaws] by showing that the Kaplan setup folded undertraining into the inferred optimum; DeepSeek-AI et
[counter to Camp B: Chinchilla-style — balance N and D under fixed compu] Muennighoff et al. [Muennighoff2023DataConstrained] and Li et al. [Li2024DCLM] rebut the direct portability of “20” because token effectiveness depends on freshness,
[counter to Camp C: Data-mixture pragmatists — data is the first axis, g] The counterargument is that data experiments are noisy and hard to standardize, so model scaling is the cleaner first step. But Li et al. [Li2024DCLM] weaken exactly
[counter to Camp D: Against emergence-as-magic — many “emergent” effects] The opposing side points to capability jumps in cases like PaLM [Chowdhery2022PaLM] and the GPT-4 technical report [OpenAI2023GPT4TR], arguing that loss cannot explai
Any HumanEval/MBPP result without EvalPlus augmented tests systematically overestimates correctness; HumanEval is better treated as a regression check than the primary claim for “engineering capability gains”.[EvalPlus2023][Chen2021Codex][A
[counter to Camp A: HumanEval/MBPP is sufficient to represent coding abi] Two counterexamples are more direct. First, weak tests turn “passes samples” into “correct”; EvalPlus shows false positives are not a corner case [EvalPlus2023]. Seco
[counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Two issues distort the “only truth” reading. First, Verified remains sensitive to harness and search budget; agent-loop knobs change the effective search space [SWEbe
[counter to Camp C: patch-PPL/code BPB in pretraining is the best predic] Existing evidence favors “task constraints are needed” rather than “likelihood alone”: CRUXEval explicitly constrains execution semantics, and RepoBench/RepoCoder con
[counter to Camp D: deployment UX metrics reflect SWE-agent value better] If task-level ground truth is abandoned, UX metrics can devolve into a contest of “searches more / spends more”: CodeT and self-repair show search budget and selector
At fixed parameter scale, raising the continued-pretrain code ratio from a general mix (~25–30%) to ≥70% is more likely to change failure modes on SWE-bench-like tasks: from wrong localization/wrong abstraction level to last-mile detail err
[counter to Camp A: scaffolding and test-time compute are everything] Agentless [Agentless2024] shows many “agent gains” can be reproduced by simpler pipelines, suggesting the bottleneck is often base-model priors rather than scaffold compl
[counter to Camp B: RL and verifiers are the true drivers] Execution feedback does yield gains [RLEF2024], but correcting structural failures requires longer exploration chains and denser environment interaction, with costs amplified by rep
[counter to Camp C: just mix more code (code is all you need)] Textbooks Are All You Need [Textbooks2023] is a counterexample showing quality/structure changes sample efficiency; more importantly, SWE-bench’s target distribution includes is
[counter to Camp D: data shape first (repo/patch/process/execution first] The weakness in public evidence is the lack of strict ablations: many technical reports change model size, token scale, filtering, and post-training recipes together,
When mid-train budget is <10% of total compute, gains often look unstable: they are confounded with backbone undertraining and the distribution pull is weak; 10–30% more reliably produces a reproducible shift [Chinchilla2022][Phi3Report][Ll
[counter to Camp A: synthetic-first can be a primary route (especially u] The weak point is scale extrapolation: success on small models or narrow tasks does not imply a sustainable recipe for large general models. Without strong verifiers,
[counter to Camp B: web-heavy backbone + (real/synthetic) mid-train is t] The main challenge is budgeting and controls: public materials rarely provide explicit mid-train fractions and matched-budget filtering baselines, making it easy to m
[counter to Camp C: avoid synthetic as much as possible; stronger filter] Filtering alone struggles with target-distribution scarcity: long-context needs long texts and long-sequence training; math/reasoning needs high-density verifiable si
[counter to Camp D: synthetic scales almost without bound; collapse is m] Public evidence is closer to the opposite: recursive replacement loses tail modes first and narrows generation by generation [MAD2023][GoMAD2023]; and the condition f
Per-token PPL is systematically misleading across tokenizers because the denominator (token steps) changes; comparable evaluation should use BPB or character-string likelihood [Vieira2024Characters][Gastaldi2024Foundations].
Per-token PPL is not comparable across tokenizers because the denominator and reachable token-string set change; primary comparable metrics should be BPB/character-string likelihood [Vieira2024Characters][Gastaldi2024FoundationsTokenization
The main shippable benefit of a 128K vocab is sequence shortening via bytes-per-token/fertility, not a vague “better language understanding”: an industrial report shows 0.02–0.04 nats lower loss and explicitly ties it to shorter non-English
Bigger vocab is not monotonic: merging multi-digit numbers or date fragments into tokens creates 10–20 pp gaps on 3–5 digit, carry-sensitive arithmetic and temporal reasoning; single-digit numeral tokenization is more robust [Singh2024Token
‘Bigger vocab is always better’ has stable counterexamples on digits/dates: merging multi-digit numerals or date fragments creates 10–20 pp gaps on 3–5 digit carry-sensitive arithmetic and temporal reasoning, with no clear automatic converg
Non-unique encodings are an underpriced stability risk: multiple token trajectories for the same surface string make reasoning and RL treat equivalent trajectories as different sequences, injecting inconsistency into the objective; larger v
[counter to Camp B: bigger vocab is always better; default to 256K+] Non-monotonicity is driven by reproducible failure modes: merging multi-digit numbers and date fragments yields 10–20 pp gaps [Singh2024TokenizationCounts][Bhatia2025DateF
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Current evidence is stronger on evaluation/formalization than on a shippable replacement: character-string likelihood fixes comparability but does not automatically provid
[counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Tail tokens and non-unique encodings do harm stability [LandBartolo2024Magikarp][LiuEllis2026SayAnything], but shrinking vocab is not the only lever: forbidding digit
For decoder-only models ≤70B under typical serving constraints, replacing MHA with GQA is a low-risk default: KV-cache drops roughly linearly with the number of KV heads, while quality degradation tends to accelerate mainly near the MQA ext
In decoder-only inference, KV-cache scales roughly linearly with KV heads: reducing KV heads from h to ~h/8 (a typical GQA sweep point) cuts attention cache to ~1/8, while quality degradation tends to accelerate mainly near KV heads=1 (MQA)
Loss spikes are not random noise: Wortsman et al. attribute them to attention-logit variance and output-norm growth, making QK-Norm a targeted mitigation; in contrast, sandwich norm is currently more of a correlated ingredient in the Gemma
Loss spikes are not random noise: Wortsman et al. tie them to attention-logit variance and output-norm growth, making QK-Norm a targeted suppressor; in contrast, sandwich norm still lacks broad independent ablations for pricing [Wortsman202
With a stable base, depth-up-scaling / block expansion often behaves closer to pay-as-you-go than from-scratch: SOLAR supports 7B→10.7B depth growth with 200B tokens of continued pretraining [Kim2023Solar], and LLaMA Pro reduces disruption
With a stable base and target size not exceeding ~30B, depth-up-scaling / block expansion often behaves like “pay-as-you-grow”: SOLAR uses 200B tokens to support 7B→10.7B, and LLaMA Pro trains only inserted blocks to reduce disruption to ba
[counter to Camp A: architecture is mostly done; keep scaling from scrat] This logic fits regimes where training compute dominates, but it distorts serving-dominated regimes: under long context and high-concurrency inference, KV-cache and b
[counter to Camp A: architecture details are mostly constants; keep clea] This reading fits training-compute-dominated objectives, but can be misleading when serving cost dominates: under long context and high-concurrency inference, KV-cach
[counter to Camp B: the next backbone should move to SSM / RetNet / Mamb] The public weak spot is budget-matched, data-matched, evaluation-matched head-to-head comparisons at LLM scale under real serving constraints. Many claimed gains can
[counter to Camp C: a second scaling path should be default—grow first, ] Public evidence is skewed toward positive cases, with limited systematic negatives: which data distributions make growth introduce irrecoverable bias, whether compute
[counter to Camp D: QK-Norm / sandwich norm are optional details] Wortsman et al. [Wortsman2023Instabilities] localize loss spikes to attention-logit variance and output-norm growth, making QK-Norm a targeted intervention rather than “one m
[counter to Camp D: stability is mostly LR/optimizer/data; QK-Norm/sandw] Wortsman et al. [Wortsman2023Instabilities] provide an observable mechanism: attention-logit variance and output-norm growth trigger loss spikes and can be reproduced
On compositional long-context tasks, nominal-128K models often collapse to ~32K effective context; RULER’s variable tracking / multi-hop / aggregation separates this collapse from NIAH-style surface retrieval hits.[Hsieh2024RULER]
RoPE extrapolation (PI/YaRN-style) usually makes 32K stable, but provides weak supervision for actually using far tokens; controlled ablations show long-task recovery is limited without changing long-doc ratio and distribution preservation.
The 20+ pp “lost in the middle” drop is not just PE distortion: it resembles attention budget dilution under long noisy inputs, and is amplified by “same task, more irrelevant tokens” controls.[Liu2023LostInMiddle][Levy2024SameTaskMoreToken
Related-doc packing increases repetition/alignment event density, triggering induction-head-like copying circuits more often; this yields a testable mechanism chain for ICL and cross-span reference.[Shi2023InContextPretraining][Chan2022Data
[counter to Camp A: PE/extrapolation is enough; long context is mainly a] Refining c-6a2e99f979 / c-435bd5ac5f: PE makes it runnable, but does not imply compositional usability. RULER shows many nominal-128K models plateau at ~32K on compos
[counter to Camp B: data recipe is the main variable; long-doc ratio and] Countering the strong version “data explains everything”: even with the same data pool, sequence construction changes the prior of cross-span relatedness; related-doc
[counter to Camp C: packing/concatenation is underused; sequence constru] Refining the “zero-cost” framing in c-2d04dd042e: packing may not increase token count, but it adds retrieval/clustering pipelines, separator design, and harder dedup
[counter to Camp D: switch architectures to bypass Transformer long-rang] Countering the strong version “architecture switch yields better effective context”: existing evidence more often proves efficiency/max length, not consistent wins on
Moving from 32K to 128K vocab is reported alongside 0.02–0.04 nats lower training loss, with gains largely explained by improved bytes-per-token/fertility for non-English and code; such sequence shortening linearly reduces KV-cache footprin
Public industrial evidence for moving from ~32K to 128K vocab can be stated as 0.02–0.04 nats lower training loss [Dubey2024Llama3]; gains are primarily from improved bytes-per-token/fertility on non-English and code, not from “better Engli
128K vocab gains can be expressed in loss and systems terms: ~0.02–0.04 nats lower training loss [Dubey2024Llama3], largely from better bytes-per-token/fertility for non-English and code, which linearly reduces KV-cache footprint and improv
Larger vocabularies amplify tail debt: multiple LMs exhibit 3k–10k+ under-trained tokens that can be detected via scanning and repaired with short continued pretraining; tokenizers therefore require post-training governance rather than one-
3k–10k+ under-trained tail tokens are manageable training debt: they can be detected and repaired with short continued pretraining [LandBartolo2024Magikarp], so tokenizers need post-training workflows rather than one-shot finalization.
[counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Public evidence supports “structure before size”: local merges for digits/dates create 10–20 pp gaps [Singh2024TokenizationCounts][Bhatia2025DateFragments], and larger vo
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The main cost of tokenizer-free is longer sequences and higher systems sensitivity; in long-context and online settings, both tokenization/runtime and KV cache are hard co
[counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Public evidence still lacks controlled quantification of “pruning → more stable RL”. What is more reliable today is to make tail tokens and non-unique encoding scanna
Placing TP across IB is often worse than adding PP: TP’s per-layer collectives scale linearly with depth [Shoeybi2019Megatron], and MegaScale pins TP=8 within NVLink domains while reproducing 55.2% MFU for dense 175B at >10K GPUs [Jiang2024
Attention time share is a more reliable trigger for SP/CP: SP cuts recompute overhead from ~36% to ~2% without increasing communication order [Korthikanti2022SP], so enable SP when attention>30%; when attention>50% or L≥32K, CP (Ring/Ulysse
The main gap for auto-parallel and FSDP-only is not feasibility but missing public head-to-head at 100B+, >1K GPUs with comparable MFU/topology: Alpa/GSPMD provide a compiler path [Zheng2022Alpa][Xu2021GSPMD], FSDP provides a low-intrusion
A 2026 “healthy MFU band” is a practical debugging threshold: dense 40–55% (MegaScale reports 55.2%) [Jiang2024MegaScale]; MoE 25–45% (public MoE system reports often fall here) [DeepSeek2024V3TechReport]. If far below, first inspect mesh/t
[counter to Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] The cost is engineering complexity and manual sweeps: mesh shape, topology binding, schedules, and kernels require explicit decisions; with rapidly changing architect
[counter to Camp B: auto-parallel (Alpa / GSPMD / compiler-search)] Counter to c-b301c4c17a / c-76982c2425 / c-c512acfaf7: public artifacts still lack matched-scale MFU head-to-head against hand-tuned 4D at 100B+ and >1K GPUs under comparab
[counter to Camp C: FSDP-only / ZeRO-only (low-intrusion first)] Counter to c-225cae10ca / refinement of c-1f3caad103: once TP/PP/CP becomes necessary, FSDP-only often pushes bottlenecks into cross-node synchronization and fragmentation/sch
[counter to Camp D: classic 3D (DP+TP+PP) is enough; SP/CP are optional] Counter to c-ab68e1f89c / c-ac979a61a5 / c-4086500505: when attention share rises above >30%, SP is often a low-risk win even before CP [Korthikanti2022SP]; when atten
Numeric deviations in fused attention are not “local errors”: they can accumulate across steps via optimizer state and gradient statistics and surface as loss spikes; kernel numerics must be part of stability regression and diagnostics [Gol
FP8 stability can show “late drift” at trillion-token scale, which short-run ablations cannot cover; specifying per-block scaling, FP32 masters, and accumulation paths as reproducible recipes (MXFP8/MXFP4) turns the problem into a testable
Long-context capability does not always require kernel rewrites: positional interpolation and RoPE scaling/continual pretraining can reach ~32K context with limited systems changes; the boundary for “kernel co-design is needed” should be wh
Reversing the optimization order (fusion/launch tuning before roofline classification) typically yields only marginal gains on memory-bound kernels; roofline-first more quickly decides whether to change structure, dataflow, or numeric contr
[counter to Camp A: algorithms and kernels must be co-designed (bytes/FL] Correction to c-aadcc38d4b / c-d75cab8d5c: high-level compilers and generated kernels can cover part of the performance delta, but they do not automatically handle “n
[counter to Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Rebuttal to c-aadcc38d4b: the train-time key is not “can you generate a kernel,” but “does the generated kernel preserve the numeric contract.” FlashAttention stabili
[counter to Camp C: hardware will get faster; dense MHA + BF16 need not ] Rebuttal to c-a986f239ff / c-bf6e936d9d: even if compute gets cheaper, budgets get reallocated to more tokens and longer contexts; Chinchilla’s compute-optimal result
[counter to Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Correction to c-c34d4fa100 / c-7d9c971185: portability is easier for bandwidth-limited kernels, but best-in-class attention/GEMM performance often depends on hardware
The main lever in bulk filtering is proxy family choice, not threshold tuning: DCLM controlled ablations show 4–6 pp gaps (e.g., on MMLU) across filter families, while threshold sweeps are typically <1 pp.[DCLM2024]
[counter to Camp A: classifiers + ablation ladders are sufficient (contr] Refines c-2a5bd5b489: bulk+ladder covers a large surface area, but “95% of decisions” does not hold for code/math/reasoning. Proxy-to-target flips are more common on
[counter to Camp B: example-level influence/attribution is the main path] Counters c-230595d964 / c-bb37143a19: attribution can answer “which training snippets are associated with a behavior,” but not the net effect after deletion/weighting
[counter to Camp C: full causal identification is the future (IV/mediato] Refines c-cdd0499162 / c-e385bd2b45: confounding is real, but “use IV as the gate” is higher risk. Instrument assumptions are hard to audit on web data; [CausalLL2024
[counter to Camp D: skip measurement, rely on intuition and scale (cover] Counters c-0cc7a00d89: scaling plans provide floors, but “no tooling, rely on intuition” turns data bias into invisible risk. [DCLM2024]’s thousands of ablations show
Under exact softmax attention semantics, the first-order source of throughput differences shifted from “HBM I/O” to “parallel scheduling”: FA1 [Dao2022FA1] reduces I/O by never writing the L×L score matrix; FA2 [Dao2023FA2] changes work par
Decode bottlenecks are not solved by training-shaped attention kernels: Flash-Decoding [Dao2023FlashDecoding] restores parallelism by chunking KV; FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] productizes these ideas into a configur
Variant maintenance cost is the long-run driver: FlexAttention [Dong2024Flex] lifts variants into score_mod/mask_mod semantics and reaches throughput close to hand-written fused kernels on many masks; but when mask semantics change tile reu
FP8 risk is dominated by error paths rather than the format itself: FP8 formats and scaling assumptions are set by Micikevicius et al. [Micikevicius2022FP8], while Fujii et al. [Fujii2024FP8vsBF16] shows stability strongly depends on rescal
[counter to Camp A: FA1/FA2/FA3 already solved the core; what remains is] Counter to c-c3d509c165: variants and serving are not “derivatives”. FlexAttention [Dong2024Flex] and FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] shift the
[counter to Camp B: the mainline should move from hand-written CUDA to T] Refinement of c-fdcdf25f0c / c-f00a0f8184: lower authoring barrier does not imply peak is replaced. FA3 [Shah2024FA3] deeply integrates Hopper TMA/warp specialization
[counter to Camp C: attention should be replaced by SSM/linear/sparse st] Counter to c-f4696a7b06 / c-ff80f3a405 / c-bf5e97f8c9: complexity advantages do not automatically translate into a universal default. Softmax attention has a clear ex
[counter to Camp D: FA3 implies generation lock-in; critical paths shoul] Refinement of c-574ffb3b22 / c-0d086b357f: lock-in concentrates on the peak path of “FP8 + Hopper async movement”, not the entire attention ecosystem. FA2 [Dao2023FA2
The practical engineering regime for offline w* search is often “proxy model ladders + regression/law extrapolation”, not complex utility estimators: in several settings, LLM-utility mixing is less stable than token-count / heuristic warm s
[counter to Camp B: Heuristics + curricula are more robust; 2–3 ablation] Heuristics can overfit to the current evaluation suite; when bucket definitions change (e.g., adding multilingual/long-tail topics) or objectives shift from average t
[counter to Camp C: Online/adaptive mixing beats one-shot offline ratios] Online methods add extra channels (per-domain loss, sampler state, delayed stats) that amplify risks under unstable accounting; when bucket definitions, dedup thresho
Fine-grained (≥64) + one shared expert more reliably yields zero-shot gains at matched active-parameter budgets (reported 1.8–3.4 pp), because it explicitly separates common components from routed experts, reducing identifiability conflicts
Dense→MoE upcycling has quantifiable ceilings and saturation regimes: it is more likely to pay off on token-rich checkpoints, with token-equivalent factors around 0.4–0.6×; ignoring this constraint leads to ROI misreads where adding experts
Systems implementation determines whether sparsity pays off: dispatch/all-to-all, token packing, and kernel scheduling can erase theoretical gains to near zero; therefore “is MoE cheaper” must be evaluated on the same training/serving stack
[counter to Camp A: MoE becomes the default backbone for frontier pretra] Systems realization and full-lifecycle costs must be counted: dispatch/all-to-all can erase sparsity gains; and upcycling is more common in real orgs, with saturation
[counter to Camp B: dense wins on full-lifecycle ROI; upcycling makes Mo] This argument is mostly about the upcycling path; when the goal is frontier from-scratch pretraining and the DeepSeek monitoring/balancing template is reproducible, M
[counter to Camp C: learned routing/balancing is overrated; random/froze] Conflating “router intelligence” with “congestion controllability” misleads engineering decisions: DeepSeek’s key is not a more complex router but bias EMA plus early
[counter to Camp D: MoE is mainly for pretraining; post-training should ] DeepSeek-V3 reports completing alignment while keeping the 256-expert structure with strong results, suggesting “MoE post-training is impossible” is more an engineeri
Setting RoPE base from 10k directly to the target-window scale (often ~500k for 128K) moves the learnable effective-context ceiling upward; otherwise low-frequency phase coverage is insufficient at target lengths, far positions become less
The gap between “advertised 128K” and “usable ~32K” is common under recall/aggregation/multi-hop suites like RULER; reporting only PPL or a single needle systematically overestimates usable context.[Hsieh2024RULER][Yuan2024LVEval][Zhang2024
At 512K–2M targets, the dominant error shifts from “phase overflow” to per-dim mismatch: different frequency bands need different extrapolation, and a single global scaling causes both under-covered low frequencies and over-compressed high
Bypass routes (retrieval/memory/SSM/compression) can reduce complexity from O(n^2) toward linear or subquadratic, but often lag attention baselines on recall-heavy tasks; without head-to-head native-long-context controls, teams can mistake
[counter to Camp A: native long context (base-aligned + curriculum/ABF) ] Counter to c-7094e4510a / c-8b90a3a5a5: token budget and data-engineering requirements are real; for ≤128K, retrofits can reach similar effective-context curves at lo
[counter to Camp B: for 32K–128K retrofits, default to YaRN, not PI] Refinement to c-457e01b168: even if YaRN is more stable, it does not mean “RoPE-only is enough.” Without long-sequence finetuning and distribution alignment, RULER can sti
[counter to Camp C: ≥512K needs LongRoPE-style per-dim search/learning; ] Counter to c-802b6e8f2c: search cost and engineering complexity are real, and below 512K the advantage may be unstable; per-dim search is better treated as a 512K+ th
[counter to Camp D: bypass RoPE/attention (SSM/external memory/retrieval] Counter to c-9e220ff288: complexity wins are not recall-heavy reliability. Zoology [Arora2023Zoology] shows many efficient models still lag attention on recall-sensit
In the 64K–256K regime, perplexity/NIAH often fail to track task-based long-context evaluations: a model can be near-saturated on NIAH “needle” retrieval yet still fail RULER stressors and LongBench cross-span tasks with mid-context underus
When extending 32K→128K, the dominant uncertainty is often not the RoPE extrapolation formula but whether long documents actually appear in the training distribution: with low long-doc ratio, adding more long-sequence tokens tends to boost
Ultra-long scaling (≥1M) behaves like staged rollout rather than a one-shot jump: Xu et al.’s 128K→4M recipe binds continual pretraining and long-dependency SFT per stage and returns to task evaluation loops each time to avoid short-context
The claim “dense O(L^2) attention is unsustainable, so sparse/memory/linear architectures must replace it” still lacks consistent task-level dominance: Zoology shows recall remains a hard constraint for efficient models; a more realistic di
[counter to Camp A: positional extrapolation is the main line; evaluatio] RULER [RULER2024] and Lost in the Middle [LostInTheMiddle2023] show that even if the model “fits,” it can systematically underuse mid-context evidence; Gao et al. [Ga
[counter to Camp B: long-sequence training is mainly a systems-paralleli] Systems solve throughput and memory, but not task utility. Gao et al. [Gao2024EffectiveLongCtx] shows that even when continued training reduces ppl, long-task gains c
[counter to Camp C: perplexity/NIAH are sufficient; task benchmarks are ] RULER [RULER2024] argues a single NIAH probe systematically overestimates usable context; Lost in the Middle [LostInTheMiddle2023] shows reproducible mid-context unde
[counter to Camp D: dense O(L^2) attention is unsustainable; alternative] Alternatives offer cost advantages, but task-level costs often show up as recall and fidelity. Zoology [Zoology2023] measures hard recall constraints in efficient mod
Making coord check / pre-activation RMS overlap a merge gate shifts LR-transfer debugging from post-hoc large runs to pre-hoc small diagnostics; with modern Transformer components, skipping this check often lets module-scale mismatches brea
“LR transfers across width” does not extrapolate to depth: under residual dynamics, the optimal LR can drift with depth, and whether width/depth limits commute depends on residual scaling; depth changes therefore require an explicit validat
Under a fixed SP recipe, empirical formulas/joint scaling laws often keep target-scale LR and batch starting-point error around ~10%; but once aspect ratio, precision, or schedule changes, the optimum can drift multiplicatively, requiring r
Under AdamW, wd is an independent axis: changing wd shifts stability boundaries and the optimal LR, so “transfer LR only” often misattributes errors; a more reliable approach treats (LR, wd, β₂) as a coupled surface and patches it with ~10–
[counter to Camp A: Complete-P as the default; formulas are a stopgap] Correcting c-7d1c22d4b6: even with Complete-P/u-µP patching modules and low precision, it does not follow that “all pretraining should switch to µP”. Lingle [Lingle2024E
[counter to Camp B: empirical formulas + small sweeps are enough; µP is ] Countering c-b807a6f58d: formulas are not a universal “accurate prediction without touching the stack”. Gemstones shows sensitivity to aspect ratio, schedules, and HP
[counter to Camp C: end-to-end automation (BO/auto-optimizers) will repl] Countering c-bec6705c6f: end-to-end search cannot easily bypass proxy fidelity and failure attribution. Small proxies can miss large-scale instabilities [Wortsman2023
[counter to Camp D: transfer error is dominated by non-transferable HPs,] Refining c-05156d3ad8 / c-bdb7a65e9e: wd/β₂ matter, but that does not make parameterization useless. Complete-P’s module-wise patches and coord check reduce structura
Under protocols that fix HP-search budgets and align schedule families, many reported optimizer advantages shrink to roughly half and rank flips occur; any A/B that does not report trial count, tunable HP set, and schedule family should not
For ≥70B dense training, AdamW’s practical edge is largely μP/LR transfer: it turns width scaling from “retune from scratch” into “light calibration,” often keeping sweep budget to ≤10% of total compute [Lingle2024muPTransfer][Noci2024Super
SOAP reduces Shampoo’s extra HPs from ~4 to 1 and reports near-AdamW wall-clock with lower loss at 360M–1.3B; the strong claim “second-order is always slower/harder to tune” no longer holds at mid-scale, but public ≥7B head-to-heads are sti
For VRAM-limited full-parameter training, prefer low-state methods that do not change the training loop: APOLLO replaces per-parameter second moments with per-tensor scalars and stays close to AdamW on 7B/13B; GaLore saves state via low-ran
[counter to Camp A: AdamW won’t be retired (highest default priority)] Counter to c-6cf8d6c199: under matched budgets, Muon/SOAP can still deliver visible gains in wall-clock or loss; extrapolating “default” into “irreplaceable” can block l
[counter to Camp B: Muon is the next default (but only as a hybrid)] Revision to c-c7ea2fc38c: public evidence for Muon on ≥7B–70B dense LLMs is still sparse, and matched-schedule + matched-budget head-to-heads are missing; until evidence c
[counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Counter to over-extrapolating c-9c69cc911f: SOAP’s near-AdamW wall-clock at 360M–1.3B does not automatically extend to ≥7B; distributed communication, preconditioner
In scraped web pools, doc-level exact hashing misses long repeated substrings and near-duplicates; Lee et al. [Lee2021Dedup] show substring + MinHash near-dedup reduces PPL, verbatim memorization, and train-test leakage, implying repetition
“Uniform repetition ≤4 epochs is close to free” holds only under near-uniform exposure: Muennighoff et al. [Muennighoff2023DataConstrained] give a ~2–4 pass fresh-token-equivalent window; Hernandez et al. [Hernandez2022RepeatedData] show ho
Per-corpus dedup cannot bound total exposure: Elazar et al. [Elazar2023WhatsInMyBigData] observe overlap across public pools (C4, RedPajama, Dolma), so a cross-corpus fingerprint ledger (hash/MinHash/embedding) is needed to count how many t
For benchmarks/PII/copyrighted text, exposure caps dominate mixture ratios: Carlini et al. [Carlini2022Memorization] show extractable memorization risk rises with repetition, model size, and longer context; Deng et al. [Deng2023BenchmarkCon
[counter to Camp A: Dedup-first and aggressive (exact → near → semantic)] Counter to c-841441e807 / c-bce7a179d3: in data-constrained high-quality pools, “dedup to the extreme” is not always feasible; Muennighoff et al. [Muennighoff2023Data
[counter to Camp B: Epochs-first (fill compute under data constraints)] Counter to c-766e70d1a1 / c-2c9c18bcfc: average epochs cannot substitute exposure-distribution control. Hernandez et al. [Hernandez2022RepeatedData] show hot-subset ove
[counter to Camp C: Semantic dedup is the main battleground (exact/MinHa] Refinement to c-8720d900a9 / c-2a25c7719f: semantic dedup does not address two key ledgers: long repeated substrings (fragment-level debt) and cross-corpus exposure (
[counter to Camp D: Zero exposure (or 0–1 exposure) for risky data overr] Refinement to c-b0ed1be495 / c-45889dd38a: “absolute zero” is often non-falsifiable operationally—risky text can enter via multiple versions and reused pools. The ove
Within a training loop with the same tokenizer/objective/model family, validation cross-entropy (log PPL) is the highest signal-per-cost metric for keep-training/anomaly/data-mix decisions; using it for external model selection mixes tokeni
Compute-optimality is an interval decision conditioned on training-duration assumptions: Kaplan-style vs Chinchilla-style optima can be explained by regime assumptions such as overtraining, so budgeting fits must report regime and fit uncer
Raw PPL is not a common unit across tokenizers: segmentation changes information per token, so a “10% PPL drop” is not semantically comparable; external comparisons should at least use BPB/information-normalized metrics and validate on a la
Stage-2 changes (post-training/compression) systematically break “lower PPL ⇒ better tasks”: near-identical pretraining loss can yield different downstream outcomes,[HongLiu2022SameLossBetterDownstream] pruning/sparsification can keep PPL n
A more actionable signal for keep-training/ship decisions is per-task scaling laws plus extrapolation error: downstream metrics can be reliably fit in overtrained regimes,[Gadre2024OvertrainingDownstream] and model ladders reduce the cost o
[counter to Camp A: PPL remains the primary variable (at least for train] The pushback is concentrated on non-comparability across stages/settings: Hong Liu et al. [HongLiu2022SameLossBetterDownstream] shows same loss can still yield differ
[counter to Camp B: PPL is stage-1 only; stage-2 must use per-task scali] The main cost is evaluation and statistical stability: per-task curves need dense model checkpoints and consistent protocols; if panels change frequently, extrapolati
[counter to Camp C: stop searching for a scalar; define quality via stan] Panels do not automatically solve decision-making: under limited budgets, ranking and stopping rules are still needed; without per-task extrapolation or explicit busi
[counter to Camp D: the issue is ontological—next-token loss is not the ] The pushback is engineering controllability: even if the final target is not next-token loss, stage 1 still needs a stable, differentiable, low-cost monitoring signal
In public refits, compute-optimal tokens/param does not converge to a single constant: for Transformer LMs the optimum can slide across 5–100 and is sensitive to batch-size schedules and data recipes [DeepSeek2024LLM].
Kaplan’s “fixed-compute favors larger models” and Chinchilla’s “favor more tokens” are not mutually exclusive truths: when undertraining is modeled explicitly and corrected via learning-curve extrapolation, the fixed-compute optimum shifts
At fixed model and token budgets, data filtering and mixtures alone can create ≥7-point downstream gaps—large enough to dominate small shifts in tokens/param [Li2024DCLM].
Loss scaling can extrapolate smoothly in over-training, but per-task benchmark scores wobble and only stabilize after thresholds; reaching ~1.9% per-task prediction error requires a two-step mapping (task perplexity→accuracy), not loss→accu
[counter to Camp A: Kaplan-style — portable exponents; under fixed compu] Countering c-861b5bafc8 / c-1f9ceebe32 / c-a34e28d5d3: once undertraining is modeled explicitly and corrected via IsoFLOP+extrapolation, the fixed-compute optimum shi
[counter to Camp B: Chinchilla-style — tokens/param≈20 as the default re] Countering c-0f12d82e0e / c-6669a9cdef / c-e2361a4007: tokens/param≈20 is often a good starting point but not a portable constant. DeepSeek-AI [DeepSeek2024LLM] widen
[counter to Camp C: Data-mixture pragmatists — get the data recipe right] Refining c-8a6b54a19e / c-b54170330d / c-f7286a59b8: data recipes can dominate, but “more curated is always better” is not stable. RefinedWeb provides a web-only coun
[counter to Camp D: Against emergence-as-magic — many “emergent” effects] Countering the “explain everything away” version of c-b6d5738eb8 / c-f19d4b4475 / c-2017e034b2: not all thresholds are metric artifacts. The GPT-4 technical report sh
Per-doc causal mask packing via FlashAttention2/3 varlen interface can achieve >98% packing ratio, reduce mask computation overhead to <3% throughput loss, and avoid 0.5-2% training loss underestimation caused by cross-doc contamination [Kr
FIM has no performance degradation only in code pretraining scenarios; enabling 50% FIM on pure NL corpus leads to 1-3pp stable degradation in downstream NL tasks [StarCoder2023][CodeLlama2023].
Processing over-length documents with split-then-pack improves perplexity by 0.02-0.05 nats and long-document tasks by 3-6pp compared to truncate-and-drop [Ding2024FewerTruncations].
[counter to Camp B: always-mixed lengths (anti-curriculum)] Full-cycle mixed length has 20-40% higher wallclock time than short-to-long under equal compute, long-context evaluations of Llama3 and Qwen2.5 show no performance advantage of mix
[counter to Camp D: FIM/denoising-style objectives as default (infilling] Code Llama [CodeLlama2023] observes 1-3pp downstream task degradation after enabling FIM on pure NL corpus. Existing FIM benefits are only reproduced in code scenario
Compression ratio and mean token length do not fully explain tokenizer quality differences [Goldman2024UnpackingTokenization][Schmidt2024MoreThanCompression]; a more stable explanation is segmentation-induced inductive bias and long-tail tr
Local merges for digits/dates break compositional structure: 10–20 pp gaps appear on 3–5 digit carry-sensitive arithmetic and temporal reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic], and should not
Local merges for digits/dates create reproducible reasoning gaps: 10–20 pp differences on 3–5 digit carry-sensitive arithmetic and temporal reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments], so vocabulary structure can domina
Under-trained tail tokens are not rare edge cases: 3k–10k+ are observable across models and require scan-and-repair with short continued pretraining [LandBartolo2024Magikarp]; this makes tokenizer versioning and post-training governance a s
[counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Refinement to c-f70d90a418 / c-83bd08f6d0: public evidence supports 128K yielding 0.02–0.04 nats lower loss [Dubey2024Llama3], but controlled 32K/64K/128K/256K+ fixed-com
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Counter to c-2b8f288a97 / c-f7fdd5ced2: tokenizer-free is mechanistically clean, but shipping requires stronger controlled comparisons—especially fixed-compute parity agai
[counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Refinement to c-d35e1355ac: the strongest public evidence today is that tail issues exist and are scannable/repairable [LandBartolo2024Magikarp], not that shrinking v
HumanEval/MBPP pass@k should not be the primary SFT-ready metric: EvalPlus shows weak tests create structural false positives; external comparisons should at least report EvalPlus, with raw HumanEval/MBPP demoted to a regression signal.[Eva
[counter to Camp A: HumanEval/MBPP is sufficient—cheap, stable, reproduc] Counter to c-3c1d1582fe / c-fc0eb62ec3 / c-90f8f3ed32: weak tests systematically inflate scores—EvalPlus shows solutions can fail under stronger tests;[EvalPlus2023]
[counter to Camp C: pretrain BPB/patch-PPL best predicts SWE; downstream] Correction to c-23dce50372 / c-867a4775a4 / c-b6d1926f11: public evidence does not establish a systematic correlation strong enough to let patch-PPL replace downstrea
[counter to Camp D: deployment UX/cost metrics reflect value better than] Counter to the strong version of c-c980753890 / c-88b722216c: UX/cost alone loses a success anchor—systems can appear more stable by being more conservative or more e
[counter to Camp A: synthetic-first (a primary route under data constrai] Counter to c-b93955c99f: public recipes do not show “synthetic can replace real-world coverage.” When synthetic becomes replacement rather than an incremental layer,
[counter to Camp B: web-heavy backbone + (real/synthetic) mid-train (a m] Correction to c-958f96c37a: gains from high-density synthetic are easily confounded with an undertrained backbone. If the backbone is not trained sufficiently, any la
[counter to Camp C: minimize synthetic; stronger filtering + more real d] Counter to c-1da7d58c52: filtering can only select from existing pools and cannot actively expand scarce distributions. For math/code/long context, public evidence lo
[counter to Camp D: synthetic scales almost without bound; collapse is m] Counter to c-8768256855 / c-fea6da20b0 / c-588cd52729: public evidence supports only “with accumulate-real pool policy, degradation is not inevitable,” which is not e
[counter to Camp B: Transformer state cost is near its limit; move to re] Public evidence shows intra-Transformer “KV reduction + locality” already covers a large fraction of serving bills: ≤70B with default GQA reduces cache by h_kv/h [Ain
[counter to Camp C: default a second scaling path—grow first, then decid] Public negative results and boundary conditions are scarce: when growth breaks generality, makes alignment harder, or introduces new instabilities is underreported, w
Introducing process data such as commits/PRs/issues during pretraining can increase the sample efficiency of the RL stage by more than 3 times, and the convergence speed of reward signals by 2 times [SWERL2025][LingmaSWEGPT2024].
With the same base model and post-training configuration, the SWE-bench gain from training data shape optimization is 1.8-2.5 times that of pure scaffolding optimization [Agentless2024][SWEagent2024].
[counter to Camp A: Inference-time scaffolding and test-time compute are] Agentless [Agentless2024] achieves more than 85% of the performance of complex agents with a simple pipeline. When the base model lacks cross-file priors, no matter h
[counter to Camp B: RL and verifiers are the true drivers] Experiments from SWERL [SWERL2025] show that when pretraining does not include commit-shaped data, the exploration space of RL expands by more than 10 times, sample efficiency drops
[counter to Camp C: just mix more code (code is all you need)] Comparative experiments from StarCoder2TheStackV22024 [StarCoder2TheStackV22024] show that under the same code token scale, the cross-file repair rate of models trained with rep
[counter to Camp D: data shape first (repo/patch/process/execution first] There is currently a lack of strictly controlled head-to-head ablations, making it impossible to quantify the individual gain of each data component. The pretraining
Treating “bigger vocab = better compression = better model” as default is misleading: compression correlates with quality but is insufficient [Goldman2024UnpackingTokenization], leaving room for non-compression mechanisms [Schmidt2024MoreTh
[counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Two evidence lines push “monotonic” toward “structure first, size second”. (1) Compression correlates with quality but is insufficient [Goldman2024UnpackingTokenization],
[counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The key bar for tokenizer-free is not “trainability” but “cost-effectiveness under matched FLOPs and latency budgets”. BLT fills part of the evidence gap [Pagnoni2024BLT],
[counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Controlled evidence that directly pins down “smaller vocab → more stable alignment” is still limited; much of it is mechanistic reasoning and systems plausibility. An
On compositional long-context tasks, many nominal-128K models plateau at roughly 32K effective context; this does not contradict high NIAH hit rates because RULER measures “can compose and use,” not just “can find” [Hsieh2024RULER].
The 20+ pp U-shaped drop in Lost in the Middle shows that effective context is not a monotonic function of window size; when evidence sits in the middle, attention allocation and training-distribution bias jointly amplify degradation [Liu20
Related-document packing is not merely a throughput optimization; by increasing the density of cross-span repetition, alignment, and reference events, it provides weak supervision for ICL and long-context use, consistent with the mechanisms
Retrieval, external memory, and architecture switches are often more compute-efficient on retrieval-heavy tasks, but public evidence is still insufficient to show that they have replaced strong Transformer long-context models on effective-c
[counter to Camp A: PE / extrapolation is enough; long context is mainly] The rebuttal to c-2218c6a6ff / c-0e06feed14 is not that PE is useless, but that PE mainly solves “can run long.” Fu et al. [Fu2024DataEngineering] and Xiong et al. [X
[counter to Camp B: the data recipe is the main variable; long-document ] What needs correction in c-acb79e4a69 is that data is not the whole story. Shi et al. [Shi2023InContextPretraining] and Staniszewski et al. [Staniszewski2023Structure
[counter to Camp C: packing / concatenation is underestimated; sequence ] c-b51a8309a9 needs a careful correction: the mechanism is plausible, but direct causal evidence is still limited. The crawler’s open question states this explicitly:
[counter to Camp D: changing the architecture or system boundary is more] The key rebuttal to c-083d546514 / c-f5078308ed is the evaluation lens. Public evidence is stronger on maximum length, complexity, or retrieval-heavy tasks than on ef
Writing out-of-doc Z back into training context C and masking C from loss (100% evidence masked) targets prior-filling more directly than merely increasing window length; otherwise the model treats evidence text as target y, and repeated-da
[counter to Camp A: keep Classic NTP + scale; hallucination is mostly so] This explains average metric gains but under-explains “window is enough yet hallucination persists”: long-form factual errors accumulate with length [LongFormFactuali
[counter to Camp B: HDP / retrieval-aware pretraining—make retrieval and] Main risks are cost and unclear controls: training-time retrieval adds system complexity and lacks apples-to-apples comparisons against inference-time RAG under match
[counter to Camp C: Method-2 rewriting / reverse prompt-plan—structured ] Negative transfer risks are subtle: distribution shift, templating, and “cleaner-looking but detail-losing” rewrites. Also, when tasks require citable evidence (citat
[counter to Camp D: trajectory distillation / self-reflection first—CoT ] Two negative results should be treated as default constraints: imitating stronger-model outputs does not imply capability transfer [FalsePromise2023]; CoT may not tra
[counter to Camp A: Long context only requires engineering length scalin] Experiments from this camp are all based on unstructured stitched long context, without structured augmentation. Stack v2 [2402.19173] shows that a 32k context model
[counter to Camp B: Hyper-Doc pretraining is a collection of scattered m] Methods from this camp are all single-cell implementations in the 2×3 matrix, only covering specific task scenarios. The 4D framework can integrate all existing metho
[counter to Camp C: Inference-time RAG can completely replace training-t] Inference-time RAG has problems such as high retrieval overhead, middle information loss, and context length limits. Stack v2 [2402.19173] shows that a code model wit
After replacing exact-match-style discrete metrics with continuous surrogates, many “jumps” from Wei et al. [Wei2022Emergent] collapse into monotonic curves; only bends that survive this swap should be treated as candidate real thresholds [
For downstream extrapolation, pretraining loss is a more transferable state variable than compute; it aligns trajectories more consistently across architectures, token budgets, and dense/sparse settings [Du2024LossPerspective].
If the decision is data-mix screening, observational scaling is often cheap enough; if the decision is architecture or mid-train recipe, an in-house model ladder should take priority, otherwise extrapolation error is dominated by recipe mis
A single power law is not the default truth but the default approximation; when task curves are multi-phase, slope-reversing, or thresholded, a broken law is the more honest fit and the better basis for ship/no-ship risk bands [Caballero202
Pretraining curves are not the same as product curves; once SFT, preference optimization, or distillation enters, the loss-to-task mapping can reorder by task family, so final acceptance needs at least one extra transfer calibration layer [
[counter to Camp A: Downstream abilities are fundamentally threshold-eme] The rebuttal has two layers. First, Schaeffer et al. [Schaeffer2023Mirage] show that many jumps come from discrete metrics. Second, Du et al. [Du2024LossPerspective]
[counter to Camp B: The compute axis is already sufficient; there is no ] Du et al. [Du2024LossPerspective] revise this view: compute is closer to resource accounting than to learning state. Under architecture, token-budget, or sparsity cha
[counter to Camp C: Observational scaling over public models is sufficie] Bhagia et al. [Bhagia2024ModelLadders] do not argue that observational methods are useless, but that they are more fragile under recipe shift. If the public pool lack
[counter to Camp D: A single power law covers most tasks; broken laws ar] The correction from Caballero et al. [Caballero2023Broken] is that when slope changes come from real mechanism shifts, the error of a single power law is not random n
The main gain from V2 to V3 does not come from MLA or MoE alone, but from the coupled package of MLA, fine-grained MoE, shared experts, and a more aggressive training stack; reproducing any one component in isolation usually does not recove
MLA is not an unconditional replacement for GQA: it pays off most in long-context, KV-limited decode regimes; at larger batch and shorter context, the extra projections and latent path eat into the gain [DeepSeekAI2024V2][Mistral7B2023].
The stable default in DeepSeekMoE is closer to “finer experts plus 1–2 shared experts” than to increasing the shared ratio indiscriminately; too many shared experts collapse toward dense behavior, while too few make training and generalizat
Aux-loss-free balancing is not truly free: it reduces direct interference with the language-model loss, but moves part of the complexity into bias updates, a sequence-level backstop term, and harder-to-reproduce training dynamics [DeepSeekA
The novelty of R1 is not the slogan that “RL beats SFT,” but an executable RL-first pipeline: DeepSeekMath’s GRPO removes the critic, R1-Zero shows pure RL can start, and R1 then uses cold-start data and multi-stage RL to repair readability
[counter to Camp A: MLA will become a general replacement for GQA] The counterpoint uses the GQA route: Mistral 7B [Mistral7B2023] shows that simpler KV sharing is already sufficient in many short-context, high-batch regimes, and MLA’s extr
[counter to Camp B: many-expert plus shared experts is the stable endgam] The counterpoint has two parts: Mixtral [Mixtral2024] achieves strong results with coarser experts, and Lu et al. [NotAllExpertsEqual2024] further show that expert ut
[counter to Camp C: data quality mainly comes from curated mixtures, not] The counterpoint comes from three directions: RefinedWeb [RefinedWeb2023] argues that high-quality web-only data can match curated corpora; Dolma [Dolma2024] argues t
[counter to Camp D: the main path for reasoning has shifted from SFT/RLH] The counterpoint is that process supervision and conventional alignment have not stopped working. Lightman et al. [LetsVerify2023], Math-Shepherd [MathShepherd2023],
We present LongLoRA, an efficient fine-tuning approach that extends the context sizes of pre-trained large language models.
Beyond 32K, single-needle NIAH is near saturation for a range of frontier models; RULER’s multi-key, variable-tracking, and aggregation subtasks still open 20–30+ point gaps, so “can find one needle” is not a sufficient proxy for effective
A small subset of retrieval heads carries most long-range retrieval; on LLaMA-2, Mistral, and Yi, masking the top 5% retrieval heads drives NIAH-style performance close to random, showing that long-context factuality depends on sparse speci
[counter to Camp A: NIAH can still serve as the primary long-context met] The counter comes from [RULER2024], [LongBenchV22024], [BABILong2024], and [Goldman2024RetrievalOnly]: single-needle retrieval covers only one path, while tracking, a
[counter to Camp B: Long context is mostly a retrieval problem, and RAG ] The counter comes from [LongBenchV22024], [BABILong2024], [LooGLE2023], and [ZeroSCROLLS2023]: once the task requires cross-span aggregation, comparison, timeline tra
[counter to Camp C: Lost-in-the-middle is mainly a positional-encoding p] The counter comes from [LostMiddle2023], [StreamingLLM2023], [SinkEmergence2024], and [DataEngineering128K2024]: U-shape mixes at least positional extrapolation limit
[counter to Camp D: Long-context capability is distributed across the wh] The counter comes from [Wu2024RetrievalHead] and [Olsson2022InductionHeads]: once masking a small set of retrieval heads can drive NIAH close to random, and induction
[counter to Camp A: Global filtering and deduplication are already stron] Programming Every Example [ProgrammingEveryExample2024] does not dispute that filtering works; it disputes that filtering is sufficient for high-value tail samples. G
[counter to Camp B: Low-value data should be pruned directly; repair is ] Programming Every Example [ProgrammingEveryExample2024] revises this claim: not all low-quality samples should be repaired; only those that are high-value but locally
[counter to Camp C: A model-generated loop can directly take over qualit] Programming Every Example [ProgrammingEveryExample2024] needs stronger constraints in pre-training: in alignment, “more like the preferred answer” is not the same as
[counter to Camp D: High-density synthetic data is a more direct path th] Programming Every Example [ProgrammingEveryExample2024] pushes back on coverage: synthetic data works better for narrow skills and strongly formatted tasks, but open-
In public evidence, depth-looping has moved beyond toy models to 3.5B-scale, 8T-token pretraining; the claim that shared recurrence only works in toy settings is directly contradicted by Huginn [Geiping2025Huginn].
At matched FLOPs, loop gains look closer to “1.1–1.3× compute for roughly 1× parameter-equivalent” than to “parameter sharing nearly replaces dense scaling for free”; Huginn, MoR, and depth-vs-width counter-evidence jointly support this rea
The clearest value proposition of looping is at inference time: increasing loop count on the same checkpoint yields monotonic gains on reasoning benchmarks, making it a test-time compute knob parallel to, but not equivalent to, explicit CoT
Latent-looping and depth-looping are not the same proposition: the former mainly competes with explicit CoT token budgets, while the latter mainly competes with dense scaling or adaptive depth; failing to separate them leads to wrong attrib
[counter to Camp A: Looping can largely substitute for adding parameters] The counter comes from Levine et al. [Levine2020DepthWidth], Liu et al. [Liu2020VeryDeepTransformers], and Kaplan et al. [Kaplan2020ScalingLaws]: self-attention may n
[counter to Camp B: Fixed loop counts are enough; adaptive routing only ] The counter comes from MoR [Bae2025MoR], ACT [Graves2016ACT], Depth-Adaptive Transformer [Elbayad2019DepthAdaptive], LayerSkip [Elhoushi2024LayerSkip], and gate-train
[counter to Camp C: The real loop should live in latent state, not in th] The counter is not that latent reasoning fails, but that it answers a different question. Huginn [Geiping2025Huginn], RRT [Bae2024RRT], and MoR [Bae2025MoR] care abou
[counter to Camp D: Looping gains come from recurrent inductive bias, no] The counter comes from Liu et al. [Liu2022AutomataShortcuts], Csordás et al. [Csordas2021SystematicGeneralization], and Furrer et al. [Furrer2020CompositionalGenerali
puzzling phenomena like ``aha moments", ``length-scaling'' and entropy dynamics
high-quality training data is scarce... decontaminated evaluation of Software Engineering Agents
[counter to Camp A: Outcome-only RLVR is enough] Liu et al. [R1ZeroCritical2025] refute the broad extrapolation: without matched base model, data, distillation, and inference compute, RL alone cannot be credited for the gain. From Reasoning
[counter to Camp B: Process and step rewards] PRMs introduce bias from a second model, and public experiments still lack direct comparisons of outcome, process, step, turn, and attribution rewards under identical models, tasks, and budgets.
[counter to Camp C: Attribution and causal credit] Attribution scores do not automatically equal causal contribution. Tree-structured CA [TreeCA2025] and CARL [CARL2025] are better suited to structured or replayable trajectories; in open we
[counter to Camp D: Agentic RL versus non-RL agent baselines] ReAct [ReAct2022], Toolformer [Toolformer2023], and WebGPT [WebGPT2021] show that prompting, self-supervised tool traces, imitation, and human feedback remain strong controls. Wi
enables Reinforcement Learning (RL)-based training of Large Language Models (LLMs) for any AI agent
Reusing model parameters through recursive refinement to extract stronger performance from large multimodal models.
RoPE allows length extrapolation but decays attention at extreme lengths.