📚Papers

观点

共识 / 非共识命题挖掘

1984 条命题·1137 条分歧·12 个领域·
·

已隐藏 824 条无引用观点(仅保留有论文出处或属于 Camp 辩论锚点的观点)

标签
数据scaling870
Scaling Law174
训练与优化385
架构514
系统与工程310
分析与评估356
智能体127
其他738
contestedtrainingc-d7a29d1c2d
R1-Zero式推理增益可能不能仅归因于强化学习;基础模型选择是关键混杂因素。
1 观测RL AblationBase ModelReasoning
证据 (1)
contestedevaluationc-ee8be921a3
许多长文基准上的提升可能是幻象,因为大量任务只需检索即可完成,而非真正的长程推理。
1 观测BenchmarkRetrievalReasoning
证据 (1)
contesteddatac-62d72a94e7
在足够大规模下,仅过滤网页语料也可能优于含精选语料的混合数据,挑战“必须人工精选”的假设。
7 观测↳ spawned: web-only-filtering-vs-curated-mixturesCurationFilteringMixturePretrainingQualityScaling
证据 (7)
  • ref_triagescaling-laws-llmgpt-5.4-2026-03-05

    Outperforming Curated Corpora with Web Data, and Web Data Only.

  • ref_triagedata-mixturegpt-5.4-2026-03-05

    Outperforming Curated Corpora with Web Data, and Web Data Only

  • ref_triagedata-mixturegpt-5.4-2026-03-05

    Outperforming Curated Corpora with Web Data, and Web Data Only

  • ref_triagedata-mixturegpt-5.4-2026-03-05

    RefinedWeb... Outperforming Curated Corpora with Web Data, and Web Data Only

  • ref_triagescaling-laws-llmgpt-5.4-2026-03-05

    RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

  • ref_triageagent-context-scaling-hyperdocgpt-5.4-2026-03-05

    Outperforming Curated Corpora with Web Data, and Web Data Only

  • + 1 more
consensustrainingc-57352c262a
稀疏终局奖励使长程 LLM 智能体的信用分配成为核心瓶颈,推动了步骤、轮次、片段和 Shapley 奖励设计。
1 观测CreditAgentsSparse Reward
证据 (1)
contestedtrainingc-c542733891
长思维链收益可能取决于推理计算扩展和涌现条件,而不只是 RL 后训练。
1 观测ReasoningAblationTest-Time
证据 (1)
  • ref_triagebot-topicarXiv 2502.03373gpt-5.5-2026-04-24

    Scaling inference compute enhances reasoning... RL has emerged as a crucial method... yet the conditions under which long CoTs emerge remain...

contestedinferencec-404f6f357d
额外推理 token 的收益很可能主要来自增加隐藏状态计算,而不一定来自文本化思维链本身的语义结构。
1 观测CotComputeMechanism
证据 (1)
contestedinferencec-18dba4dedb
inference scaling 工作强化了“looping 是第三根计算旋钮”的框架,但也提示 loop 收益可能只是另一种 test-time compute 开销形式。
1 观测Scaling LawsTest-TimeCompute
证据 (1)
contestedinferencec-ccf238281b
长上下文与检索并非简单替代关系;直接对比二者已成为判断更大窗口是否真正提升下游任务的核心对立框架。
1 观测Retrieval Vs ContextEffective ContextEvaluation
证据 (1)
contestedinferencec-f0cc56bbc0
非智能体式的软件工程流程可能解释并匹敌复杂 LLM SWE 智能体,挑战“必须依赖智能体自主性或 RL”的说法。
1 观测SweNon RLBaseline
证据 (1)
  • ref_triagebot-topicarXiv 2407.01489gpt-5.5-2026-04-24

    developed various autonomous LLM agents to perform end-to-end software development tasks... Agentless: Demystifying LLM-based Software Engineering Agents

consensusalignmentc-be8ccd6d6d
面向推理的过程监督可能优于只看最终答案的结果监督,因此 verifier 式对齐是 R1 风格 RL 叙事的重要替代基线。
1 观测Process SupervisionReasoningVerifier
证据 (1)
consensustrainingc-ed3c5b6799
FP8 训练在不改超参数的情况下也可能保持 LLM 精度,说明 DeepSeek-V3 的 FP8 更像工程默认项而非一次性技巧。
1 观测Fp8StabilityEfficiency
证据 (1)
contesteddatac-0ae01a9d47
长文能力可能主要是数据工程问题:模型已具备任意位置利用潜力,关键在于合适的持续预训练配方。
1 观测128kTrainingEffective Context
证据 (1)
consensusarchc-a0100552d7
潜空间审议工作正在收敛到一个观点:压缩的内部推理是冗长 CoT 的有力替代,这可能把“循环”从深度转移到状态空间。
1 观测Latent LoopCompressed CotReasoning
证据 (1)
contestedalignmentc-e1f8962fd3
欺骗性LLM策略可能在当前安全训练后仍然存在,这挑战了简单的RLHF安全叙事。
1 观测DeceptionSafety TrainingFailure
证据 (1)
contesteddatac-a234f495bf
在人类直觉下的“干净数据”在大规模训练中可能会误导:模型感知的数据集选择声称可优于人工过滤的“高质量”数据。
1 观测Data QualitySelectionScaling
证据 (1)
contestedinferencec-c78cace642
自适应深度思想早于当前 looped LM 多年,但多数证据证明的是可变计算分配,而非语言模型中真正的 quality-per-FLOP 提升。
1 观测Adaptive DepthRoutingCompute
证据 (1)
consensustrainingc-ea3c18297c
Agentic RL 被表述为从被动序列生成转向动态环境中自主决策的范式转变。
1 观测Agentic RLSurveyParadigm
证据 (1)
  • ref_triagebot-topicarXiv 2509.02547gpt-5.5-2026-04-24

    reframing LLMs from passive sequence generators into autonomous, decision-making agents embedded in complex, dynamic worlds

contestedalignmentc-f94fe5c194
LLM 智能体在逼真的模拟环境中承受压力时,可能会策略性欺骗用户。
1 观测DeceptionAgentsSafety
证据 (1)
  • ref_triagebot-topicarXiv 2311.07590gpt-5.5-2026-04-24

    Large Language Models... can display misaligned behavior and strategically deceive their users about this behavior without being instructed to do so.

contestedtrainingc-a0f6628066
在自注意力模型中,深度并不天然是最佳计算投入方向;既有理论与实证表明宽度可与深度相当甚至更优,这提高了 looped LM 的举证门槛。
1 观测Depth Vs WidthScalingCounterevidence
证据 (1)
consensusevaluationc-7773587878
推理表现不会随 token 增加而单调提升;输入变长本身会损害推理,因此衰减曲线应成为一级测量对象。
1 观测DecayReasoningLength
证据 (1)
consensusdatac-a5178a3471
开放语料研究正把数据选择而非单纯 token 数量视为一等缩放杠杆,这能更好解释 DeepSeek 各版本的 mixture 重组。
1 观测Data SelectionScalingMixture
证据 (1)
contestedtrainingc-3fa270405c
一些原本归因于特殊架构的组合泛化收益,在加入强预训练或简单训练技巧后可能会消失。
1 观测CompositionalityBaselinesInductive Bias
证据 (1)
contestedinferencec-ff19c660f8
Thinking tokens 正在成为中间路线:它们为推理提供额外隐藏计算,但不必采用完整的共享层递归。
1 观测Thinking TokensHidden ComputeReasoning
证据 (1)
contestedtrainingc-851212d6ca
LM 中的条件深度路由,瓶颈可能不在路由思想本身,而在跨多层的延迟信用分配;门控训练本身就是核心系统问题。
1 观测Adaptive DepthRoutingOptimization
证据 (1)
contestedc-39d4cfa213
mesh、schedule、kernel 都由工程师显式决策,auto-parallel 与 FSDP 只用作零件。100B+ 规模唯一可复现。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].stanceclaude-opus-4.7

    [Camp A: hand-tuned 4D (Megatron / MegaScale style)] Mesh, schedule and kernels are all decided explicitly by engineers; auto-parallel and FSDP serve only as components. The only reproducible route at 100B+.

contestedc-76982c2425
把并行决策编译化,只写 sharding annotation,cost model + ILP/RL 搜索 mesh。<100B 逼近 hand-tuned 90–95%。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].stanceclaude-opus-4.7

    [Camp B: auto-parallel (Alpa / GSPMD / pjit)] Compile parallelism decisions: write sharding annotations and let cost-model-driven ILP/RL search the mesh. Reaches 90–95% of hand-tuned below 100B.

contestedc-225cae10ca
代码极简,用 ZeRO-3 + offload 应对所有规模;TP/PP 可以避免。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[2].stanceclaude-opus-4.7

    [Camp C: FSDP-only is enough] Minimal code: ZeRO-3 plus offload handles any scale; TP/PP can be avoided.

contestedc-ab68e1f89c
保留经典 3D 栈,长序列和 MoE 作为后续补丁引入。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[3].stanceclaude-opus-4.7

    [Camp D: 3D (DP+TP+PP) is enough, SP/EP optional] Keep the classic 3D stack, add long-context and MoE as later patches.

contestedc-b6fd752a5d
既然 code 帮 reasoning,且 Orbay et al. [CodeScalingLaws2023] 没看到 saturation,那就继续加——40% 甚至 60% 对 generalist 也应该更好。
来源论文· 2[CodeScalingLaws2023]arXiv 2308.08998arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.08998· report.positions[0].stanceclaude-opus-4.7

    [Camp A: more code is always better, generalists should push ] Since code helps reasoning and Orbay et al. [CodeScalingLaws2023] see no saturation, keep adding—40% or even 60% should also be better for generalists.

contestedc-c0fec56a8d
Gadre et al. [Gadre2023Overtraining] 显示 code downstream variance 低,说明 code 只是起 'lower effective LR' 的作用——任何稳定数据源都可以替代。
来源论文· 2[Gadre2023Overtraining]arXiv 2312.00752arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2312.00752· report.positions[1].stanceclaude-opus-4.7

    [Camp B: code's benefit is purely regularisation / lower effe] Gadre et al. [Gadre2023Overtraining] show code has lower downstream variance, suggesting its role is just 'lower effective LR'—any stable data source can substitute.

contestedc-b65d1cd159
code 抢 NL 的 capacity,generalist 应尽量压到 5–10%;NL 能力是产品线的根,绝不能被牺牲。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[2].stanceclaude-opus-4.7

    [Camp C: keep code below 10% to protect NL] Code competes with NL for capacity; generalists should compress it to 5–10%; NL is the product foundation and must not be sacrificed.

contestedc-e98b6bb551
continual code-heavy 会 catastrophic forgetting NL,generalist→specialist 的唯一干净路径是 from-scratch。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[3].stanceclaude-opus-4.7

    [Camp D: code ability must be trained from scratch; continual] Continual code-heavy causes catastrophic NL forgetting; the only clean generalist→specialist path is from-scratch.

contestedc-2218c6a6ff
只要 RoPE base / NTK 缩放 / 分维插值做对,就能把短上下文模型扩成长上下文模型,数据和 packing 是次要问题。代表工作 PI [Chen2023PI]、YaRN [Peng2023YaRN]、LongRoPE [Ding2024LongRoPE]、LM-Infinite [Han2023LMInfinite]、InfLLM [Xiao2024InfLLM]。
来源论文· 6[Chen2023PI][Peng2023YaRN][Ding2024LongRoPE][Han2023LMInfinite][Xiao2024InfLLM]arXiv 2306.15595arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2306.15595· report.positions[0].stanceclaude-opus-4.7

    [Camp A: Positional extrapolation is enough] Get RoPE base / NTK scaling / per-dim interpolation right and you can extend any short-context model; data and packing are secondary. Representatives: PI [Chen2023PI], YaRN [Peng2023YaRN], LongRo

contestedc-acb79e4a69
有效上下文主要由长文档 upsample 比例、领域分布平衡与 continual pretrain 的 token 数决定;位置编码是调优而非主战场。代表工作 Fu et al. [Fu2024DataEngineering]、Xiong et al. [Xiong2023EffectiveLongCtx]、Bai et al. [Bai2024LongAlign]、Cai et al. [Cai2024InternLM2]、01.AI [01AI2024Yi]。
来源论文· 4[Fu2024DataEngineering][Xiong2023EffectiveLongCtx][Bai2024LongAlign][Cai2024InternLM2]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[1].stanceclaude-opus-4.7

    [Camp B: The data recipe is the main variable] Effective context is set by long-doc upsampling ratio, domain balance, and continual-pretrain tokens; positional encoding is tuning, not the main battle. Representatives: Fu et al. [Fu2024DataE

contestedc-8202803d9b
在同样的数据池和位置方案下,用相关文档聚类 packing + 少截断 + 学习式分隔 token,可以把长上下文能力再抬一档。代表工作 Shi et al. [Shi2023InContextPretraining]、Staniszewski et al. [Staniszewski2023StructuredPacking]、Ding et al. [Ding2024FewerTrunc]、Liu et al. [Liu2024RingAttention]。
来源论文· 4[Shi2023InContextPretraining][Staniszewski2023StructuredPacking][Ding2024FewerTrunc][Liu2024RingAttention]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].stanceclaude-opus-4.7

    [Camp C: Packing engineering is the under-exploited lever] Given the same data pool and positional scheme, related-doc clustered packing + fewer truncations + learned separators unlock another tier of long-context capability. Representative

contestedc-083d546514
Transformer 的二次 attention 是根源问题;用 Mamba [Gu2023Mamba]、RWKV [Peng2023RWKV]、Jamba [Lieber2024Jamba]、LongNet [Ding2023LongNet] 这类线性 / 混合架构直接扩到百万级序列。
来源论文· 4[Gu2023Mamba][Peng2023RWKV][Lieber2024Jamba][Ding2023LongNet]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].stanceclaude-opus-4.7

    [Camp D: Switch architectures (SSM / linear) to bypass positi] Quadratic attention is the root issue; Mamba [Gu2023Mamba], RWKV [Peng2023RWKV], Jamba [Lieber2024Jamba], LongNet [Ding2023LongNet] scale directly to million-length sequences wi

contestedc-04d2c9e5c7
架构选择是 kernel 物理的投影:GQA/MLA、FP8-per-block、grouped GEMM、head_dim ∈ {64,128,192,256} 都由 roofline / tensor core tile / memory hierarchy 推出。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].stanceclaude-opus-4.7

    [Camp A: kernels and algorithms must be co-designed] Architecture is a projection of kernel physics: GQA/MLA, per-block FP8, grouped GEMM, head_dim ∈ {64,128,192,256} all fall out of the roofline, tensor-core tiles, and memory hierarchy.

contestedc-aadcc38d4b
torch.compile + FlexAttention + 标准 transformer 原语已经吃下大多数 kernel 增量;算法师不需要读 CUTLASS。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].stanceclaude-opus-4.7

    [Camp B: PyTorch level is enough] torch.compile + FlexAttention + standard transformer primitives already absorb most kernel wins; algorithm authors don't need CUTLASS.

contestedc-a986f239ff
Blackwell / Rubin 每代 2× bandwidth + 2× FLOPs,scale-up 足以让 dense MHA + BF16 继续跑,算法复杂化是自找麻烦。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].stanceclaude-opus-4.7

    [Camp C: hardware keeps improving, algorithms don't need to a] Blackwell / Rubin deliver 2× bandwidth + 2× FLOPs per gen; scale-up alone keeps dense MHA + BF16 viable, and architectural complication is self-inflicted.

contestedc-c34d4fa100
TPU v5p / MI300X / Trainium2 会在 2026 内把 frontier pretrain 的硬件依赖多元化,kernel ecosystem 也会随之分散。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].stanceclaude-opus-4.7

    [Camp D: non-NVIDIA hardware will catch up] TPU v5p / MI300X / Trainium2 will diversify frontier-pretrain hardware through 2026, and kernel ecosystems will fragment accordingly.

contestedc-2a5bd5b489
以 DCLM [DCLM2024]、FineWeb-Edu [FineWeb2024]、RegMix [RegMix2024] 为代表——bulk filter 加 per-decision ablation 已经覆盖 frontier 需要的 95% decisions,influence 与 causal inference 是奢侈品。
来源论文· 4[DCLM2024][FineWeb2024][RegMix2024]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.positions[0].stanceclaude-opus-4.7

    [Camp A: Quality classifier plus ablation ladder is enough] Exemplified by DCLM [DCLM2024], FineWeb-Edu [FineWeb2024], and RegMix [RegMix2024]: bulk filters plus per-decision ablations cover 95% of frontier decisions; influence and causal t

contestedc-230595d964
以 Anthropic [AnthropicInfluence2023]、TRAK [TRAK2023]、Simfluence [Simfluence2023] 为代表——只有做到样本级归因才能真正回答 data value;classifier 与 ablation 只是启发式。
来源论文· 4[AnthropicInfluence2023][TRAK2023][Simfluence2023]arXiv 2308.03296arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2308.03296· report.positions[1].stanceclaude-opus-4.7

    [Camp B: Influence functions are the main path] Exemplified by Anthropic [AnthropicInfluence2023], TRAK [TRAK2023], and Simfluence [Simfluence2023]: only per-example attribution can truly answer data value; classifiers and ablations are mer

contestedc-cdd0499162
以 Causal-LL [CausalLL2024]、Skill-it [SkillIt2023] 为代表——data value 本质是因果量,IV / mediator 才能去 confounding;其它方法都会被 distribution shift 污染。
来源论文· 3[CausalLL2024][SkillIt2023]arXiv 2404.04188arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2404.04188· report.positions[2].stanceclaude-opus-4.7

    [Camp C: Full causal inference is the future] Exemplified by Causal-LL [CausalLL2024] and Skill-it [SkillIt2023]: data value is fundamentally causal; only IV and mediator methods remove confounding, while other approaches are polluted by di

contestedc-0cc7a00d89
部分 startup 与开源社区的隐式立场——data 决策靠 senior researcher 直觉 + 大规模训练试错,不建 internal tool,节省工程开销。
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[3].stanceclaude-opus-4.7

    [Camp D: Skip measurement, rely on intuition and scale] An implicit stance in some startups and open-source communities: make data calls via senior-researcher intuition plus large-scale trial-and-error, skipping internal tooling to save eng

contestedc-c3d509c165
attention kernel 的主要答案已经出在 FA1/2/3 [Dao2022FA1][Dao2023FA2][Shah2024FA3],后续工作只是移植到新硬件;变体和 serving 都只是衍生问题。
来源论文· 4[Dao2022FA1][Dao2023FA2][Shah2024FA3]arXiv 2205.14135arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2205.14135· report.positions[0].stanceclaude-opus-4.7

    [Camp A: the FA series is the endpoint of attention engineeri] The main answers for attention kernels are already in FA1/2/3 [Dao2022FA1][Dao2023FA2][Shah2024FA3]; later work is just porting to new silicon, and variants / serving are deriva

contestedc-fdcdf25f0c
真正的转折是 kernel 作者门槛从博士 3 个月降到工程师 1 周 [Tillet2019Triton][Dong2024Flex];FA3 的 CUDA 内核相对普通工程师不可维护,长期会被 Triton 路径蚕食。
来源论文· 2[Tillet2019Triton][Dong2024Flex]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[1].stanceclaude-opus-4.7

    [Camp B: Triton / FlexAttention is the real revolution] The real inflection is the kernel-author barrier dropping from 'PhD × 3 months' to 'engineer × 1 week' [Tillet2019Triton][Dong2024Flex]; FA3's CUDA is unmaintainable for the median eng

contestedc-f4696a7b06
继续优化 O(L²) 的 attention 在 long-context 时代是错位;RetNet [Sun2023RetNet]、Mamba [Waleffe2024Mamba] 已在短 context LM loss 上追平且拥有 O(L) 推理。
来源论文· 3[Sun2023RetNet][Waleffe2024Mamba]arXiv 2307.08621arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2307.08621· report.positions[2].stanceclaude-opus-4.7

    [Camp C: attention itself should be replaced (SSM / linear RN] Optimizing O(L²) attention in the long-context era is mis-targeted; RetNet [Sun2023RetNet] and Mamba [Waleffe2024Mamba] already match short-context LM loss with O(L) inference.

contestedc-574ffb3b22
FA3 [Shah2024FA3] 的核心依赖 Hopper 专属指令(TMA、warp-specialization、FP8 tensor core)[Luo2024HopperBench],在 AMD / 国产芯片上无可移植路径。
来源论文· 3[Shah2024FA3][Luo2024HopperBench]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[3].stanceclaude-opus-4.7

    [Camp D: FA is the embodiment of NVIDIA lock-in] FA3's [Shah2024FA3] core depends on Hopper-exclusive instructions (TMA, warp specialization, FP8 tensor cores) [Luo2024HopperBench], with no portable path to AMD or domestic silicon.

contestedc-a25fb78820
在同 training FLOPs 下 MoE 的 quality/FLOP 有 1.5–2× 优势,fine-grained + shared + aux-loss-free 模板已解决大半工程坑,MoE 即将成为所有 frontier pretrain 的 default [Shazeer2017Outrageous][Dai2024DeepSeekMoE][DeepSeek2024V3][Kang2025FLAMEMoE]。
来源论文· 4[Shazeer2017Outrageous][Dai2024DeepSeekMoE][DeepSeek2024V3][Kang2025FLAMEMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].stanceclaude-opus-4.7

    [Camp A: MoE is the inevitable replacement for dense] At matched training FLOPs MoE offers 1.5–2× quality-per-FLOP advantage; fine-grained + shared + aux-loss-free has resolved most engineering footguns, and MoE is poised to be the default

contestedc-f64cdccd11
MoE 的 total-param 占用、跨机 all-to-all 通信、post-train 稳定性在很多实际业务下抵消了 quality-per-active-param 优势;fully-open dense 家族在研究可复现、fine-tune 稳定、单机 serve 上更有优势 [OLMo2025Olmo3][Walsh2024OLMo2]。
来源论文· 2[OLMo2025Olmo3][Walsh2024OLMo2]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[1].stanceclaude-opus-4.7

    [Camp B: dense paths yield better ROI in the end] MoE's total-param footprint, cross-node all-to-all, and post-train fragility offset its quality-per-active-param edge in many real deployments; fully-open dense families win on reproducibili

contestedc-01ba9bd6e9
把负载均衡做成结构约束而非软惩罚,是 MoE 路由的正确方向;expert-choice [Zhou2022ExpertChoice] 开了头,aux-loss-free bias EMA [Wang2024AuxFree][DeepSeek2024V3][Han2025AuxFreeTheory] 完成收官;后续 similarity-preserving router [Omi2025SimilarityRouter]、AdaMoE [Zeng2024AdaMoE]、Mo
来源论文· 7[Zhou2022ExpertChoice][Wang2024AuxFree][DeepSeek2024V3][Han2025AuxFreeTheory][Omi2025SimilarityRouter][Zeng2024AdaMoE]arXiv 2202.09300arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2202.09300· report.positions[2].stanceclaude-opus-4.7

    [Camp C: expert-choice / aux-loss-free is the future] Making load balance a structural constraint rather than a soft penalty is the right direction; expert-choice [Zhou2022ExpertChoice] opened it, aux-loss-free bias EMA [Wang2024AuxFree][De

contestedc-b6a3328e9b
router gradient 稀疏 + 专家分化让 SFT/RLHF 稳定性下降;实践上 task-specific expert pruning [Chen2022TaskPrune][Koishekenov2022NLLBPrune] 或 dense-to-dynamic-k 转换 [Szatkowski2023DenseToMoE] 在部署时把 MoE 压回近 dense 结构已多次实证。
来源论文· 3[Chen2022TaskPrune][Koishekenov2022NLLBPrune][Szatkowski2023DenseToMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].stanceclaude-opus-4.7

    [Camp D: MoE matters only for pretrain; post-train should rev] Sparse router gradients + expert specialization hurt SFT/RLHF stability; task-specific expert pruning [Chen2022TaskPrune][Koishekenov2022NLLBPrune] and dense-to-dynamic-k conver

contestedc-201429a9e4
domain weight 是可以 search 的变量;付出 proxy / regression 开销换一个量化最优的 w*,transfer 到大 scale 基本成立。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stanceunknown

    [Camp A: Formal mixture search (DoReMi / RegMix / Data Mixing] Domain weights are searchable variables; pay the proxy or regression overhead to obtain a quantitatively optimal w*, which transfers to large scale reliably.

contestedc-1fdafb305c
mixture 不是一个向量而是一条轨迹;手工先验 + 2–3 次 ablation 在 curriculum 视角下已经足够,关键在 stage 切分和 annealing mix,而不是 w* 精度。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stanceunknown

    [Camp B: Heuristic + curriculum (Llama 3 / MiniCPM route)] The mixture is a trajectory, not a vector; expert priors plus 2–3 ablations suffice once the problem is viewed as curriculum scheduling—stage boundaries and the annealing mix matter

contestedc-1fe9547ef0
把 domain 当 bandit arm,用 per-step loss signal 边训边调,彻底绕开 proxy training。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stanceunknown

    [Camp C: Online adaptive mixing] Treat domains as bandit arms and adjust weights online from per-step loss; skip proxy training entirely.

contestedc-20e1ba3a18
一旦 filter 足够强(DCLM-Baseline、FineWeb-Edu、textbook-级 synth),mixture weight 就是 secondary knob,优化 budget 应当全部花在 content quality 上。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].stanceunknown

    [Camp D: Ratio doesn't matter, quality does] Once filters are strong enough (DCLM-Baseline, FineWeb-Edu, textbook-grade synthesis), mixture weights become a secondary knob and the optimization budget should flow to content quality.

contestedc-d013b3f6e2
长上下文能力必须用某种学到/参数化的位置编码支持;ALiBi 与 RoPE 家族覆盖全部实际可行的选择。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].stanceunknown

    [Camp A: Explicit PE is necessary; RoPE interpolation is the ] Long-context capability must be backed by some learned or parameterized PE; ALiBi and the RoPE family cover all practically viable options.

contestedc-65f78fd3e6
SP 沿 L 切,DP 沿 batch 切,二者解耦;把它们分别压到极致即可。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].stanceunknown

    [Camp B: SP and DP are orthogonal and can be optimized indepe] SP slices along L and DP slices along batch; decouple and push each to its limit separately.

contestedc-d52cdad417
共享 KV 头的 GQA 已经是 KV cache 压缩的合理终点,进一步压缩会伤下游。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[2].stanceunknown

    [Camp C: KV compression = GQA/MQA is enough] GQA (shared KV heads) is a reasonable endpoint for KV-cache compression; going further hurts downstream.

contestedc-5f30eb3feb
在 sufficient long 上训练,ppl 下降自然带来长上下文能力。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[3].stanceunknown

    [Camp D: Perplexity is still a valid long-context metric] Train on sufficiently long sequences and ppl improvements translate into long-context capability.

contestedc-642d4683d9
既然低频维度没被训练过,就在 pretrain 里一次训好。base=500000 + 6 阶 curriculum 让模型在推理时不需要任何频率重映射,RULER 上天然占优。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].stanceunknown

    [Camp A: pretrain-time ABF is the clean path] Since low-freq dims were never trained, train them once in pretraining. base=500000 plus a 6-stage curriculum needs zero inference-time frequency remapping and wins naturally on RULER.

contestedc-7094e4510a
≤128K 规模,YaRN + 400 步微调在 RULER 上已接近 ABF(<3 pp 差距),且不需要重跑 pretrain;对于大多数已有模型这是最佳 ROI。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].stanceunknown

    [Camp B: YaRN is the de-facto retrofit tool] At ≤128K, YaRN + 400 FT steps sits within 3 pp of ABF on RULER and needs no re-pretraining — the best ROI for most existing models.

contestedc-8512c07aa9
smooth NTK/YaRN 在 1M 段存在可测量的 per-dim 错配;evolutionary search 找到的非均匀 scale pattern 是唯一能顶住百万 token RULER 的方案。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].stanceunknown

    [Camp C: ≥1M requires LongRoPE] Smooth NTK/YaRN show measurable per-dim mismatch at 1M; only the non-uniform scales found by evolutionary search survive million-token RULER.

contestedc-ec6e97312f
位置编码本身是 bug 源头。用 LM-Infinite 式 Λ-mask 做零样本外推,或直接换 RetNet/LongNet 等非 attention 架构,彻底绕开 RoPE 外推问题。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].stanceunknown

    [Camp D: bypass the whole PI/NTK/YaRN lineage] Position encoding itself is the bug source. Use LM-Infinite-style Λ-masks for zero-shot extrapolation, or switch to non-attention architectures like RetNet/LongNet and sidestep RoPE extension e

contestedc-7d1c22d4b6
µP 是唯一数学上严谨的迁移方式,Complete-P 与 u-µP 已经解决了架构组件与低精度的兼容性问题,所有预训练都应切换至 µP。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].stancegemini-3.1-pro

    [Camp A: µP is the absolute default] µP is the only mathematically rigorous transfer method; Complete-P and u-µP have resolved compatibility with architectural components and low precision, so all pretraining should switch to µP.

contestedc-b807a6f58d
不需要修改底层参数化,直接利用 Cerebras 或 DeepSeek 的经验公式,配合少量 proxy runs 即可准确预测目标规模的 LR 与 Batch Size。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].stancegemini-3.1-pro

    [Camp B: Empirical formulas + a few sweeps suffice] No need to modify low-level parameterization; simply use empirical formulas from Cerebras or DeepSeek, combined with a few proxy runs, to accurately predict target-scale LR and Batch Size.

contestedc-bec6705c6f
分析解总是存在盲区,应该用 CARBS 这样的 cost-aware 贝叶斯方法在帕累托前沿上直接搜索所有超参。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].stancegemini-3.1-pro

    [Camp C: End-to-end Bayesian search is the endgame] Analytical solutions always have blind spots; we should use cost-aware Bayesian methods like CARBS to search all HPs directly on the Pareto frontier.

contestedc-ad2e434ab6
≥70B 稠密生产的默认是 AdamW [Loshchilov2017AdamW];所有新 optimizer 的边际收益在控住 HP search budget 后都会塌缩一半;AdamW 的真正护城河是 muP LR-transfer 生态 [Lingle2024muP][Yang2020TensorPrograms3],而不是 Adam 本身。
来源论文· 4[Loshchilov2017AdamW][Lingle2024muP][Yang2020TensorPrograms3]arXiv 1711.05101arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 1711.05101· report.positions[0].stanceclaude-opus-4.7

    [Camp A: AdamW is never retired] The ≥70B dense production default is AdamW [Loshchilov2017AdamW]; every new optimizer's marginal gain halves under matched HP-search budget; AdamW's real moat is the muP LR-transfer ecosystem [Lingle2024muP]

contestedc-7250d7e7d3
Muon [Jordan2024Muon] 以 Newton-Schulz orthogonalization 把 Shampoo 的 precond 近似成最便宜的可用形式,NanoGPT speedrun 从 5 min 到 3.3 min(−34%),≤30B hidden 2D 权重应默认 Muon + AdamW 混合(embedding/norm/head 交回 AdamW)。Lion-Lyapunov 分析 [Chen2023LionLyapunov] 与 Ra
来源论文· 2[Jordan2024Muon][Chen2023LionLyapunov]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].stanceclaude-opus-4.7

    [Camp B: Muon is the next default] Muon [Jordan2024Muon] reduces Shampoo's preconditioner to its cheapest usable form via Newton-Schulz orthogonalization, cutting NanoGPT speedrun from 5 min to 3.3 min (−34%); ≤30B hidden 2D weights should

contestedc-9d392e1054
Shampoo [Gupta2018Shampoo] 的 Kronecker precond 近似的是 Gauss-Newton [Morwani2024Shampoo],SOAP [Vyas2024SOAP] 把它的 grafting 问题消掉,wall-clock 与 AdamW 打平,额外 HP 从 4 减到 1。Lu et al. [Lu2025SOAPWhitening] 把它解释成 gradient whitening,理论上与 Adam / Shampoo 统一
来源论文· 5[Gupta2018Shampoo][Morwani2024Shampoo][Vyas2024SOAP][Lu2025SOAPWhitening]arXiv 1802.09568arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 1802.09568· report.positions[2].stanceclaude-opus-4.7

    [Camp C: Shampoo / SOAP is the proper endgame] Shampoo [Gupta2018Shampoo]'s Kronecker preconditioner approximates Gauss-Newton [Morwani2024Shampoo]; SOAP [Vyas2024SOAP] eliminates its grafting pathology, matches AdamW wall-clock, and cuts e

contestedc-cc783fad4b
Agarwal et al. [Agarwal2020LRConfound] 与 Dahl et al. [Dahl2023AlgoPerf] 表明:控住 HP search budget 与 LR schedule 后,adaptive-vs-SGD 和大部分 optimizer 间 gap 塌缩一半以上;任何 'X 优于 Y' 报告在不控 HP budget 时都是噪声。
来源论文· 3[Agarwal2020LRConfound][Dahl2023AlgoPerf]arXiv 2002.11803arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2002.11803· report.positions[3].stanceclaude-opus-4.7

    [Camp D: optimizers don't matter, data does] Agarwal et al. [Agarwal2020LRConfound] and Dahl et al. [Dahl2023AlgoPerf] show: once HP-search budget and LR schedules are controlled, adaptive-vs-SGD and most cross-optimizer gaps collapse by mo

contestedc-a217dd02de
主 pretrain 短窗 + 末段 long-context mid-train,packing 用 per-doc causal mask;代表 LLaMA-3 [Llama32024]、Qwen2.5 [Qwen25Tech]。
来源论文· 1[Llama32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[0].stanceclaude-opus-4.7

    [Camp A: short-to-long + per-doc mask (mainstream)] Short-window main pretrain with a long-context mid-train tail, packing under per-doc causal mask; representatives LLaMA-3 [Llama32024], Qwen2.5 [Qwen25Tech].

contestedc-d83fba0aaa
在整段 pretrain 里按分布混合短长 seq,避免末段 mid-train 产生的分布漂移。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].stanceclaude-opus-4.7

    [Camp B: uniformly mixed-length training] Interleave short and long seqs across the whole pretrain to avoid the distribution drift of a tail mid-train.

contestedc-9c0218b660
沿用 GPT-2 [GPT2] / 初版 LLaMA [LLaMA2023] 的 EOS-only packing、不做 per-doc mask,理由是工程简洁。
来源论文· 2[LLaMA2023]arXiv 2302.13971arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2302.13971· report.positions[2].stanceclaude-opus-4.7

    [Camp C: naive concat + cross-doc visible] Keep GPT-2 [GPT2] / early LLaMA [LLaMA2023] style EOS-only packing without per-doc mask, citing engineering simplicity.

contestedc-7d3e946a7f
把 OpenAI FIM [FIM2022] 的 'free lunch' 结论直接推广到 NL pretrain,对所有 doc 按 50% 启用 FIM。
来源论文· 2[FIM2022]arXiv 2207.14255arxiv.org
1 观测Packing Masking Length
证据 (1)
contestedc-dff018ef68
这一阵营认为,只要训练协议控制得当,loss/PPL 的幂律最稳定、成本最低,也最适合外推大模型训练。下游任务噪声大、评测协议不稳,因此不应取代 PPL 成为核心决策量 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Porian2024ResolvingScaling]。
来源论文· 3[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Porian2024ResolvingScaling]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].stancegpt-5.4

    [Camp A: PPL remains the most reliable primary variable] This camp argues that when training protocol is controlled, loss/PPL power laws are the most stable, cheapest, and best suited for extrapolating large runs. Downstream tasks are noisy

contestedc-ead068fe9a
这一阵营接受 PPL 有信息量,但认为真正需要拟合的是任务分数本身。PPL 可以作为中间变量、域匹配信号或先验,但最终决策应由逐任务缩放律给出 [Gadre2024OvertrainingDownstream][Bhagia2024TaskScalingLadders][Isik2024DownstreamScaling][LeiLiu2025PerplexityAwareCPT]。
来源论文· 3[Bhagia2024TaskScalingLadders][Isik2024DownstreamScaling][LeiLiu2025PerplexityAwareCPT]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].stancegpt-5.4

    [Camp B: PPL is only an intermediate state inside task scalin] This camp accepts that PPL contains signal, but argues that the quantity worth fitting is task performance itself. PPL may serve as an intermediate variable, domain-match signal

contestedc-b562dd20ea
这一阵营认为,问题不是找错了 scalar,而是很多决策本来就不可压缩成一个数。应同时跟踪训练健康、任务收益、行为漂移和知识容量,并把评测协议标准化 [Biderman2023Pythia][Gu2024OLMES][AllenZhuLi2024KnowledgeCapacity][Magnusson2025DataDecide]。
来源论文· 4[Biderman2023Pythia][Gu2024OLMES][AllenZhuLi2024KnowledgeCapacity][Magnusson2025DataDecide]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].stancegpt-5.4

    [Camp C: Stop searching for one scalar; use multi-panel diagn] This camp argues that the issue is not choosing the wrong scalar, but that many decisions are not compressible into one number in the first place. One should jointly track train

contestedc-21deacd299
这一阵营认为,PPL 与下游能力之间不存在稳定的一一映射,因为两者测量的是不同对象:一个是平均 token 编码效率,另一个是行为成功率、阈值事件或对齐后的偏好结构 [Schaeffer2023Mirage][Schaeffer2024WhyElusive][McKenzie2023InverseScaling][Rafailov2024RewardOveroptimization]。
来源论文· 3[Schaeffer2023Mirage][Schaeffer2024WhyElusive][McKenzie2023InverseScaling]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].stancegpt-5.4

    [Camp D: The problem with PPL is ontological, not merely pred] This camp argues that there is no stable one-to-one mapping between PPL and downstream capability because they measure different objects: average token coding efficiency versus

contestedc-841441e807
主张把重复当成纯噪声:对 web 语料做最大力度的 exact/near dedup,宁可删多也不留;理由是重复会带来记忆化、污染与无效 compute,且 dedup 往往直接改善 PPL 与下游 [Lee2021Dedup][Tirumala2023D4]。
来源论文· 2[Lee2021Dedup][Tirumala2023D4]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].stancegpt-5.2

    [Camp A: Dedup as aggressively as possible] Treat repetition as pure noise: apply maximal exact/near dedup on web corpora, preferring false positives over leaving duplicates; the rationale is that repetition drives memorization, contaminati

contestedc-69c037458d
主张在数据受限下把 epochs 当成主要杠杆:对已清洗的高质量池做 2–4 epochs,收益接近新增同量新鲜 token;超过后再考虑扩数据或换策略 [Muennighoff2023DataConstrained][Kaplan2020ScalingLaws]。
来源论文· 2[Muennighoff2023DataConstrained][Kaplan2020ScalingLaws]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[1].stancegpt-5.2

    [Camp B: Uniform repetition ≤4 epochs is (almost) free] Treat epochs as the main lever under data constraints: run 2–4 epochs on a cleaned high-quality pool, with returns close to adding the same amount of fresh tokens; only beyond that con

contestedc-8720d900a9
认为 exact dedup 解决的是“显眼重复”,真正吞掉预算的是语义近重复与主题簇内的冗余;因此应优先做 embedding 级语义去重与多样化采样,把 token 预算换成覆盖率 [Abbas2023SemDeDup][Tirumala2023D4]。
来源论文· 2[Abbas2023SemDeDup][Tirumala2023D4]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[2].stancegpt-5.2

    [Camp C: Semantic dedup is the real battleground] Exact dedup only removes obvious duplicates; the real budget sink is semantic near-duplication and within-topic redundancy. Prioritize embedding-based semantic dedup and diversification to t

contestedc-ba96f9a3cd
主张对 benchmark、版权文本、PII 与可识别来源的高风险数据执行 zero-repeat(或严格 0–1 次曝光),并把治理做成数据层配置(opt-out、来源追踪、可审计)[Deng2023BenchmarkContamination][Carlini2022Memorization][Li2023StarCoder]。
来源论文· 2[Carlini2022Memorization][Li2023StarCoder]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].stancegpt-5.2

    [Camp D: Zero repetition for sensitive/eval/copyright data] Enforce zero-repeat (or strict 0–1 exposure) for benchmarks, copyrighted text, PII, and high-risk identifiable sources, and implement governance as data-layer configuration (opt-ou

contestedc-861b5bafc8
固定 compute 下把预算主要投在参数 N 上,tokens 数量只要 “够用” 即可;GPT-3 (175B / 300B tokens) 与 Gopher-280B 是这条路径的代表。
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].stanceclaude-opus-4.7

    [Camp A: Kaplan — bigger model, fewer tokens] At fixed compute, allocate budget primarily to parameters N; tokens need only be 'sufficient'. GPT-3 (175B / 300B tokens) and Gopher-280B embody this line.

contestedc-0f12d82e0e
compute-optimal 比例 tokens/param ≈ 20 是一条普适斜率,按此训练得到的 frontier 模型最高效。LLaMA 家族与 LLaMA 2 是开源侧的代表。
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[1].stanceclaude-opus-4.7

    [Camp B: Chinchilla — balance N and D under compute] A compute-optimal ratio of tokens/param ≈ 20 is a universal slope; models trained under it sit on the frontier. The LLaMA family and LLaMA 2 represent the open-source wing.

contestedc-8a6b54a19e
在同样 (N, D) 预算下,数据筛选与 mixture 的差距能盖过 scaling law 预测的差距。DCLM、phi-1、RefinedWeb、DsDm 都指向同一个判断:数据质量是 scaling 之上的独立变量。
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].stanceclaude-opus-4.7

    [Camp C: Data-mixture pragmatists — data is the first axis] At matched (N, D) budgets, data filtering and mixture differences can swamp what scaling laws predict. DCLM, phi-1, RefinedWeb, and DsDm all point the same way: data quality is an

contestedc-b6d5738eb8
Wei et al. 等报告的多数 ‘能力突现’ 来自评测指标的非线性化(exact match / 0-1 accuracy);连续指标下能力随 scale 平滑提升,不存在模型内部相变。
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].stanceclaude-opus-4.7

    [Camp D: Against emergence-as-magic] Most 'emergent abilities' reported by Wei et al. come from metric nonlinearities (exact match, 0-1 accuracy). Under continuous metrics, capability rises smoothly with scale — no internal phase transition

contestedc-31aa118543
认为通过扩大隐状态维度($d_{\text{state}}$)和改进选择性门控机制,纯 SSM 能够克服信息压缩带来的瓶颈,实现端到端的 $O(1)$ 推理。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].stancegemini-3.1-pro

    [Camp A: Pure SSMs will eventually replace Transformers entir] Believes that by scaling hidden state dimensions ($d_{\text{state}}$) and improving selective gating mechanisms, pure SSMs can overcome compression bottlenecks and achieve end-t

contestedc-3a450c2a22
认为 RWKV、RetNet 等线性注意力变体与 Mamba 在数学形式上都可归结为带有不同衰减策略的线性 RNN,差异仅在于工程实现。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].stancegemini-3.1-pro

    [Camp B: Linear Attention and SSMs are fundamentally the same] Argues that linear attention variants like RWKV and RetNet, along with Mamba, mathematically reduce to linear RNNs with different decay strategies; the differences are purely en

contestedc-efd7cf7a49
认为 SSM 的隐状态动态与 Transformer 的注意力分布存在根本差异,必须通过从头预训练才能让模型学会正确的状态压缩启发式规则。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[2].stancegemini-3.1-pro

    [Camp C: Subquadratic models must be pretrained from scratch] Believes that the hidden state dynamics of SSMs fundamentally differ from Transformer attention distributions, requiring pretraining from scratch for the model to learn correct s

contestedc-3c1d1582fe
这一路读法认为函数级 execution benchmark 足以代表代码能力;如果 HumanEval、MBPP 一类分数上升,模型的编程能力也就同步上升 [Chen2021Codex] [Austin2021MBPP]。
来源论文· 3[Chen2021Codex][Austin2021MBPP]arXiv 2107.03374arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2107.03374· report.positions[0].stancegpt-5.4

    [Camp A: HumanEval is enough] This view treats function-level execution benchmarks as sufficient representatives of coding ability; if HumanEval or MBPP rises, programming ability is assumed to rise with it [Chen2021Codex] [Austin2021MBPP].

contestedc-33451ad4c4
这一路读法强调真实 GitHub issue 修复最接近软件工程任务,因此其他代码 benchmark 只能算弱代理;最终应以 Verified 分数排序 [SWEbench2023] [SWEbenchVerified2024]。
来源论文· 2[SWEbench2023][SWEbenchVerified2024]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[1].stancegpt-5.4

    [Camp B: SWE-bench Verified is the only truth] This view argues that real GitHub issue resolution is closest to software engineering, so other code benchmarks are weak proxies; final ranking should therefore follow Verified scores [SWEbench

contestedc-23dce50372
这一路读法倾向于认为,只要模型在 patch、编辑轨迹或代码分布上的似然足够好,下游 SWE 就会顺带提升,因此 pretrain 指标可以替代复杂 benchmark。
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].stancegpt-5.4

    [Camp C: trajectory-level PPL is the most real pretrain-stage] This view tends to assume that if a model models patch, edit trajectories, or code distribution well enough in likelihood terms, downstream SWE will come along, so pretrain metr

contestedc-c980753890
这一路读法认为用户最终感受到的是交互成本、重试收益、测试运行稳定性与轨迹质量,因此 pass@1 只是很小一部分 [SelfDebug2023] [ReCode2022] [ClaudeCode2025]。
来源论文· 4[SelfDebug2023][ReCode2022][ClaudeCode2025]arXiv 2304.05128arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2304.05128· report.positions[3].stancegpt-5.4

    [Camp D: agent UX metrics matter more than pass@1] This view argues that what users ultimately feel is interaction cost, retry gains, test stability, and trajectory quality, so pass@1 is only a small part of the picture [SelfDebug2023] [ReC

contestedc-958f96c37a
核心主张是:高质量 synthetic / textbook 数据的 token 密度远高于粗糙 web,尤其在小模型和 code 场景下,质量可以部分替代规模。[Textbooks2023][Phi3Report]
来源论文· 1[Textbooks2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].stancegpt-5.4

    [Camp A: synthetic data-first (the Phi line)] The core claim is that high-quality synthetic and textbook-style data has much higher token density than noisy web text, so quality can partially substitute for scale, especially for small model

contestedc-4a602b70da
核心主张是:真实 web 仍然负责广覆盖,synthetic 负责在后段定向塑形。这样既保留长尾分布,又能把剩余 compute 投到 code / math / reasoning 等高价值区域。[Llama3Herd][DeepSeekMath2024][DataCompLM2024]
来源论文· 2[DeepSeekMath2024][DataCompLM2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].stancegpt-5.4

    [Camp B: web-heavy backbone + synthetic mid-train (the LLaMA ] The core claim is that real web data still provides broad coverage, while synthetic data performs targeted shaping later in training. This preserves long-tail coverage while spe

contestedc-5f2bcc9a07
这一路线认为,真实世界分布足够复杂,任何 synthetic 都会引入 teacher 偏差和模式收缩;与其生成,不如做更强过滤、更大抓取、更长训练。[DataCompLM2024][MAD2023]
来源论文· 2[DataCompLM2024][MAD2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].stancegpt-5.4

    [Camp C: pure web-scale, avoid synthesis as much as possible] This camp argues that the real-world distribution is already complex enough, and any synthetic data injects teacher bias and mode contraction; stronger filtering, larger crawling

contestedc-fea6da20b0
这一路线假设:只要 teacher 足够强,synthetic token 可以近乎无限扩展,真实数据最终只需要极少量 seed。
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].stancegpt-5.4

    [Camp D: unlimited synthetic scaling] This camp assumes that once the teacher is strong enough, synthetic tokens can scale almost without bound and real data will eventually be needed only as a tiny seed.

contestedc-2ff2192b35
通过多智能体框架、任务图与重复采样,可以在推理期弥补基础模型的缺陷,解决复杂的 SWE issue。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[0].stancegemini-3.1-pro

    [Camp A: Scaffolding and Test-Time Compute are Everything] Through multi-agent frameworks, task graphs, and repeated sampling, inference-time compute can compensate for base model deficits to resolve complex SWE issues.

contestedc-fdad135134
在可执行环境中进行大规模 RL 和验证器训练,是让模型掌握真实软件工程推理的关键。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[1].stancegemini-3.1-pro

    [Camp B: RL and Verifiers are the True Drivers] Large-scale RL and verifier training in executable environments are the keys to endowing models with real-world software engineering reasoning.

contestedc-a388cf6cd1
只要使用更大规模、更宽松 license 的代码语料(如 The Stack),模型的代码与工程能力就会自然涌现。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[2].stancegemini-3.1-pro

    [Camp C: Just Mix More Code] As long as larger, permissively licensed code corpora (like The Stack) are used, the model's coding and engineering capabilities will naturally emerge.

contestedc-fd6b7ab00c
认为 Mamba-2 的 SSD 理论已经解决了硬件效率问题,纯 SSM 凭借 O(1) 推理内存足以替代 Transformer。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[0].stancegemini-3.1-pro

    [Camp A: Pure SSM is the ultimate solution for long context] Argues Mamba-2's SSD theory solves hardware efficiency, and pure SSMs' O(1) inference memory is sufficient to replace Transformers.

contestedc-0df8e8851e
主张按 1:3 到 1:7 比例混合 Attention 与 SSM,用少量 Attention 解决召回问题,用 SSM 保持高吞吐。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[1].stancegemini-3.1-pro

    [Camp B: Hybrid is the pragmatic production path] Advocates mixing Attention and SSM at 1:3 to 1:7 ratios, using sparse Attention for recall and SSMs for high throughput.

contestedc-c48e043c7e
认为通过 YaRN 或 StreamingLLM 等技术扩展 RoPE,Transformer 足以处理长文本,无需引入 SSM。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[2].stancegemini-3.1-pro

    [Camp C: Transformer + long-context extensions suffice] Argues that extending RoPE via YaRN or StreamingLLM makes Transformers sufficient for long contexts without needing SSMs.

contestedc-c13fcc519e
坚持“Transformer 骨架 + RNN 前向”,通过 WKV 算子和矩阵值状态实现线性复杂度。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[3].stancegemini-3.1-pro

    [Camp D: RWKV is the correct RNN revival path] Sticks to 'Transformer skeleton + RNN forward', achieving linear complexity via WKV operators and matrix-valued states.

contestedc-11a0777021
默认沿用主流 BPE/WordPiece 配置,把优化预算放在 data mixture、训练配方与后训练;tokenizer 只做最小化的脚本覆盖与特殊符号处理。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizers are frozen preprocessing; coverage is eno] Reuse mainstream BPE/WordPiece defaults and spend optimization budget on data mixture, training recipes, and post-training; only ensure script coverage and a small set of specia

contestedc-5fa07c177d
把更大 vocab 视为近似“免费压缩”:序列更短、attention 更省,收益会单调增加;embedding/softmax 的额外参数相对可忽略。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is always better; push to 256K+] Treat larger vocab as near-free compression: shorter sequences reduce attention cost and gains should be monotonic; extra embedding/softmax parameters are assumed negligible.

contestedc-87a1db4fa0
用 byte/char/patch 直接建模,消除分词偏置与 OOV,获得更好的鲁棒性与多语公平性;把系统复杂度交给模型结构解决。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE] Model bytes/chars/patches directly to eliminate tokenization bias and OOV, gaining robustness and multilingual fairness; let architecture absorb the complexity.

contestedc-a555cd79f3
把 tokenizer 版本化并纳入回归:vocab 64K–128K、数字 single-digit、空格/换行独立 token;训练后期用 Magikarp 扫 under-trained tokens 并做短 continued pretrain;跨 tokenizer 一律报 BPB。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp D: tokenizer is a pretrain product spec—make BPE right ] Version tokenizers and regression-test them: 64K–128K vocab, single-digit numerals, standalone whitespace/newline tokens; run Magikarp and short continued pretraining for under-

contestedc-9049bf1c3e
这一路读法认为,loss 与能力主要由参数、数据、算力和训练 recipe 决定,架构细节大多是二阶项。Kaplan et al. [Kaplan2020ScalingLaws]、Touvron et al. [LLaMA2023]、Grattafiori et al. [Llama3Report] 与 Fu et al. [Fu2024DataEngineering] 都能支撑这条线。
来源论文· 3[Kaplan2020ScalingLaws][LLaMA2023][Fu2024DataEngineering]
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvements· report.positions[0].stancegpt-5.4

    [Camp A: the architecture is mostly done; just keep scaling f] This reading argues that loss and capability are driven mainly by parameters, data, compute, and training recipe, while architectural details are mostly second-order. Kaplan et

  • topic_reporttransformer-arch-improvements· report.positions[0].stancegpt-5.2

    [Camp A: architecture details are mostly constants; keep clea] Loss and capability are driven mainly by parameters, data, compute, and training recipe; most architecture tweaks yield constant-factor gains while adding migration and maintena

contestedc-58de3c0b3b
这一路认为 Transformer 的状态成本和长上下文扩展方式已经接近上限,应换到 retention 或状态空间类主干。Sun et al. [RetNet2023] 是代表性论证,Patro and Agneeswaran [SiMBA2024] 反映了更广泛的非 Transformer 动量。
来源论文· 2[RetNet2023][SiMBA2024]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].stancegpt-5.4

    [Camp B: the next backbone should move to SSM / RetNet / Mamb] This camp argues that Transformer state cost and long-context scaling are near their limit, and that retention or state-space backbones should replace them. Sun et al. [RetNet20

contestedc-36eb13128e
这一路认为已有 pretrained base 是资产,扩深或插块往往比重训更省。Wang et al. [Wang2023Grow]、Yao et al. [Yao2023MSG]、Kim et al. [Kim2023Solar] 与 Wu et al. [LLaMAPro2024] 都支持这一点。
来源论文· 5[Wang2023Grow][Yao2023MSG][Kim2023Solar][LLaMAPro2024]arXiv 2303.00980arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2303.00980· report.positions[2].stancegpt-5.4

    [Camp C: the second scaling path should be default—grow first] This camp argues that an existing pretrained base is an asset, and that deepening or block insertion is often cheaper than retraining. Wang et al. [Wang2023Grow], Yao et al. [Ya

contestedc-92356a268e
这一路会认为稳定性主要靠学习率、初始化、clipping 和数据清洗,QK-Norm 或 sandwich norm 只是 recipe 噪声,未必值得进入默认主干。
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[3].stancegpt-5.4

    [Camp D: QK-Norm / sandwich norm are optional details] This camp treats stability as mainly a matter of learning rate, initialization, clipping, and data cleaning, with QK-Norm or sandwich norm viewed as recipe noise rather than default-bac

contestedc-aaf1462b6b
这一阵营认为,domain weight 是一阶优化变量,应该像学习率或 batch size 一样系统搜索,而不是靠经验配方。代表方法包括 DoReMi、RegMix 和 Data Mixing Laws [DoReMi2023][RegMix2024][DataMixLaws2024]。
来源论文· 3[DoReMi2023][RegMix2024][DataMixLaws2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.4

    [Camp A: Formal search first] This camp treats domain weights as first-order optimization variables that should be searched systematically, much like learning rate or batch size, rather than chosen by recipe intuition. Representative method

contestedc-c89ef0e217
这一阵营认为,公开 frontier recipe 已经给出足够强的先验:前期 web-heavy,尾段 capability-heavy,上采样 code/math 3–5×,再配 2–3 轮 ablation 就足够实用 [MetaLlama32024][MiniCPM2024]。
来源论文· 2[MetaLlama32024][MiniCPM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.4

    [Camp B: Heuristics plus curriculum are more robust] This camp argues that public frontier recipes already provide a strong enough prior: web-heavy early, capability-heavy late, with 3–5× upweighting of code and math, plus 2–3 rounds of abl

contestedc-437d0d4bd1
这一阵营认为,训练过程中 domain loss 本来就在变化,离线先搜一个固定 w* 再整段套用并不自然;更合理的是边训边调,让权重跟着学习阶段变化 [ODM2024]。
来源论文· 1[ODM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.4

    [Camp C: Online adaptation beats one-shot offline search] This camp argues that domain losses evolve during training, so searching for one fixed offline w* and applying it throughout is unnatural; a better approach is to adapt weights onlin

contestedc-8bf7c13092
这一阵营强调,强过滤器带来的收益往往大于细调配比,尤其在原始 web 噪声高时,先修质量比先调 ratio 更划算 [DataCompLM2024][FineWebEdu2024]。
来源论文· 2[DataCompLM2024][FineWebEdu2024]
2 观测Data Mixture
证据 (2)
  • topic_reportdata-mixture· report.positions[3].stancegpt-5.4

    [Camp D: Ratio is secondary; quality is first-order] This camp emphasizes that strong filtering often yields larger gains than fine-grained ratio tuning; especially when raw web data is noisy, fixing quality is more cost-effective than tuni

  • topic_reportdata-mixture· report.positions[3].stancegpt-5.2

    [Camp D: Ratios are second-order; quality is first-order] When the pool is dominated by low-quality web, quality filtering/selection often yields larger gains than fine-grained mixture tuning; prioritize filters, dedup, and quality tiers be

contestedc-0d98bf886f
这一路线认为,只要 synthetic token 足够干净、足够教材化,小模型甚至通用模型都能用更少 token 学到更高密度的知识;真实 web 的作用主要是提供少量锚点和补充覆盖。[Phi1Textbooks][Phi3Report][Cosmopedia2024]
来源论文· 1[Cosmopedia2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].stancegpt-5.4

    [Camp A: synthetic-first can be the main route] This camp argues that if synthetic tokens are clean and textbook-like enough, small models and even general models can learn denser knowledge from fewer tokens; real web data mainly serves as

contestedc-8739205b37
这一路线把大规模真实 web / code 数据作为 backbone,再用 mid-train 做专项分布回拉。synthetic 的角色是增密和重定向,而不是替代主干语料。[Llama3Herd][DeepSeekMath2024][DataCompLM2024]
来源论文· 2[DeepSeekMath2024][DataCompLM2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].stancegpt-5.4

    [Camp B: a web-heavy backbone plus synthetic mid-train is the] This camp uses large-scale real web/code data for the backbone, then applies mid-train to pull the model toward specialty distributions. Synthetic data acts as a densifier and r

contestedc-37961336ce
这一路线认为 synthetic 的主要收益其实来自“更干净的数据分布”,而这件事可以通过 ranking、proxy 和大规模过滤完成,不必承担 teacher 风格和递归退化的风险。[DataCompLM2024][Chinchilla]
来源论文· 1[DataCompLM2024]
2 观测Synthetic Data Midtrain
证据 (2)
  • topic_reportsynthetic-data-midtrain· report.positions[2].stancegpt-5.4

    [Camp C: avoid synthetic as much as possible; stronger filter] This camp argues that most synthetic gains really come from “cleaner data distributions,” which can be achieved through ranking, proxies, and large-scale filtering without takin

  • topic_reportsynthetic-data-midtrain· report.positions[2].stancegpt-5.4

    [Camp C: avoid synthetic as much as possible; stronger filter] This camp argues that many gains attributed to synthetic data actually come from cleaner, less redundant data distributions, and these gains can be obtained from real web pools

contestedc-93ffd79e51
这一路线会强调 verifier、强 teacher 和更好的 sampling 已经足以压住退化,因此 synthetic ratio 可以持续上升,尤其在 reasoning 和 agentic 轨迹上也能照搬 code / math 的经验。[DeepSeekMath2024]
来源论文· 1[DeepSeekMath2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].stancegpt-5.4

    [Camp D: synthetic can scale almost without bound; collapse i] This camp emphasizes that verifiers, strong teachers, and better sampling are now enough to suppress degradation, so synthetic ratios can keep rising, and code/math experience c

contestedc-186e7c5761
把 mixture 当作可预测的响应面或可拟合的 mixing law:先用小规模实验拟合,再外推到大规模;或用鲁棒目标(worst-case / Group DRO)直接定义“该偏向哪些域”。优点是可复现、可解释、便于做预算规划。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.2

    [Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as predictable response surfaces or fit-able mixing laws: fit on small-scale experiments then extrapolate; or define preferences via robust objectives (worst-case

contestedc-58dc1d8895
优先把问题做成可控工程:先把桶切细、把质量分层、把 schedule 写成简单的阶段策略,然后用 2–3 轮 ablation 校准。Held et al. [Held2025Utility] 的结果也提示,复杂的“LLM 估计 utility”未必优于简单 token 规则。
来源论文· 1[Held2025Utility]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.2

    [Camp B: Heuristics + curriculum are more robust (few ablatio] Prioritize controllable engineering: fine buckets, quality tiers, simple staged schedules, then calibrate with 2–3 ablations. Held et al. [Held2025Utility] also suggests complex

contestedc-4e7f988b26
把 mixture 视为非平稳问题:训练过程与数据流入都会改变最优权重,因此应当在线根据 loss/信号动态调权,避免为每次变化重做离线搜索 [ODM2024][Aioli2024]。
来源论文· 2[ODM2024][Aioli2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.2

    [Camp C: Online adaptation beats one-shot offline search] Treat mixing as non-stationary: training dynamics and data inflow change optimal weights, so weights should adapt online from loss/signals rather than re-running offline search for e

contestedc-7fc925c2bf
把 mixture 当作可优化变量,依赖 scaling law、回归响应面或鲁棒目标来系统选择权重;主张用小规模实验推断大规模最优点,从而减少“拍脑袋配方”。代表路径包括:RegMix 的回归响应面 [RegMix2024]、BiMix 的 mixing law [BiMix2024]、DoReMi 的 Group DRO 权重学习 [DoReMi2023]、以及面向最优 mixture 的 scaling law [Shukor2025OptimalMixtures]。
来源论文· 4[RegMix2024][BiMix2024][DoReMi2023][Shukor2025OptimalMixtures]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.2

    [Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as optimizable variables, using scaling laws, regressed response surfaces, or robust objectives to select weights systematically. The goal is to infer large-scale

contestedc-41410afebd
主张把 mixture 视为工程控制:先把 web 基座做干净、桶切细,然后用少量 ablation 调出可解释的旋钮;ratio 用 curriculum 表达阶段性目标。公开配方与开源报告提供了可复用的先验:多源混合与显式 source balancing [Gao2020Pile]、LLaMA 的手工 mixture [Touvron2023LLaMA]、Qwen 的工程配方 [Qwen2023Report]、以及 Llama 3 的阶段性上采样 [MetaLlama3
来源论文· 3[Gao2020Pile][Touvron2023LLaMA][Qwen2023Report]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.2

    [Camp B: Heuristics + curricula are more robust (a few ablati] Treat mixture as engineering control: clean the web base, bucketize finely, then tune interpretable knobs with a small number of ablations; express ratios as curricula. Public r

contestedc-f423410ab9
主张把 mixture 视为训练中可更新的策略:根据 loss/泛化信号动态调整 domain 权重,避免一次性选错比例。代表工作包括 DoReMi 的鲁棒权重学习 [DoReMi2023]、更轻量的在线 mixing [Albalak2023OnlineMixing],以及把动态选择与 scaling law 连接起来的框架 [Jiang2024AdaptiveDataOptimization]。
来源论文· 2[DoReMi2023][Albalak2023OnlineMixing]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.2

    [Camp C: Online adaptation beats one-shot offline search] Treat mixture as a train-time policy: dynamically adjust domain weights based on loss/generalization signals to avoid committing to a wrong one-shot ratio. Representative work includ

contestedc-3992502ce8
认为在真实 web 池里,最大收益来自过滤与选择:先把低质噪声移除,模型训练才进入有效区间。FineWeb 把这一点工程化并可复现 [FineWeb2024],DataComp-LM 在受控实验里量化“先修质量更稳” [DataCompLM2024]。同时,DsDm 提供反例:按人类直觉过滤“干净数据”可能伤害模型,model-aware selection 可能更强 [Engstrom2024DsDm]。
来源论文· 3[FineWeb2024][DataCompLM2024][Engstrom2024DsDm]
2 观测Data Mixture
证据 (2)
  • topic_reportdata-mixture· report.positions[3].stancegpt-5.2

    [Camp D: Ratios are second-order; quality/selection is first-] Argues the biggest gains on real web pools come from filtering and selection: remove low-quality noise so training enters an effective regime. FineWeb operationalizes this repro

  • topic_reportdata-mixture· report.positions[3].stancegpt-5.2

    [Camp D: Ratios are second-order; quality/selection is first-] In real web pools, the largest gains often come from filtering and selection: remove low-quality noise first to enter an effective training regime. DataComp-LM supports the regi

contestedc-3e4bab57cf
这一路认为高质量 synthetic/curated token 的信息密度高于普通 web token,因此在数据受限、小模型或高结构领域,synthetic-first 可以直接替代大规模杂质 web 训练 [Phi1Textbooks2023][Phi15Report2023][TinyStories2023][ScalingDataConstrained2023]。
来源论文· 4[Phi1Textbooks2023][Phi15Report2023][TinyStories2023][ScalingDataConstrained2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].stancegpt-5.4

    [Camp A: synthetic-first can be the main route] This camp argues that high-quality synthetic or curated tokens have higher information density than ordinary web tokens, so in data-constrained settings, small models, or highly structured dom

contestedc-f9974c0217
这一路把 synthetic 放在 backbone 之后,用于拉分布而不是替代真实世界覆盖。Llama 3、Phi-3、Code Llama、Llemma、long-context continued pretraining 都符合这个结构 [Llama3Herd][Phi3Report][CodeLlama2023][Llemma2023][LongContextScaling2023]。
来源论文· 3[CodeLlama2023][Llemma2023][LongContextScaling2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].stancegpt-5.4

    [Camp B: a web-heavy backbone plus synthetic mid-train is the] This camp places synthetic data after the backbone, using it to pull the distribution rather than replace real-world coverage. Llama 3, Phi-3, Code Llama, Llemma, and long-conte

contestedc-0338850923
这一路通常依赖两类论据:一类来自 expert iteration、proof search、verifier 递归改进,认为只要反馈足够硬,递归生成会持续变好 [FormalMathCurriculum2022][HyperTree2022];另一类来自“accumulate real 可破除 collapse”的新结果,进而把 collapse 风险视为旧问题 [CollapseInevitable2024]。
来源论文· 3[FormalMathCurriculum2022][HyperTree2022][CollapseInevitable2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].stancegpt-5.4

    [Camp D: synthetic can scale almost without bound; collapse i] This camp usually relies on two arguments: one from expert iteration, proof search, and verifier-backed recursive improvement, claiming that sufficiently hard feedback allows re

contestedc-b76eabe3fa
把长上下文当作预训练配方:直接把 RoPE base 调到与目标窗口匹配的量级(例如 128K 对应 ~500000),并用 short-to-long curriculum 让模型在数据分布与位置分辨率上共同适配;尽量避免“先短窗预训再 retrofit”的不可逆损伤。
2 观测Long Context Rope Ntk
证据 (2)
  • topic_reportlong-context-rope-ntk· report.positions[0].stancegpt-5.2

    [Camp A: pretrain-time ABF + curriculum is the cleaner long-c] Treat long context as a pretraining recipe: set RoPE base to match the target window scale (e.g., ~500000 for 128K) and use a short-to-long curriculum to co-adapt data distribut

  • topic_reportlong-context-rope-ntk· report.positions[0].stancegpt-5.2

    [Camp A: pretraining-time ABF + curriculum is the clean long-] Treat long context as distribution shift: set RoPE base to the target-window scale, then use a short-to-long curriculum so training actually contains long-range supervision and

contestedc-c88c0db7d1
在不重训的约束下,YaRN 把“只动低频 + 稳定远距离注意力分布”组合成可复用实现:per-dim ramp 避免高频损伤,attention temperature 处理长距离 logits 熵塌缩;配合参数高效微调可在小预算内落地 [Peng2023YaRN][Chen2023LongLoRA]。
来源论文· 2[Peng2023YaRN][Chen2023LongLoRA]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].stancegpt-5.2

    [Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] Under no-repretrain constraints, YaRN packages “low-freq-only modification + stabilized long-range attention distribution” into a reusable implementation: per-dim ramp avoids hi

contestedc-3276836295
当目标到 512K–2M,全局平滑公式会出现结构性错配;需要显式学习每个维度的 scale pattern(甚至按 head 区分),否则要么局部退化要么远距离混叠。LongRoPE 给出了可复现的搜索 + 微调配方 [Ding2024LongRoPE]。
来源论文· 1[Ding2024LongRoPE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].stancegpt-5.2

    [Camp C: ≥512K requires per-dim search (LongRoPE-like)] At 512K–2M, global smooth formulas show structural mismatch; you need to explicitly learn per-dim (and sometimes per-head) scale patterns, otherwise you get either local degradation or

contestedc-2bb6feab06
不在 RoPE 上做外推修补:要么换 positional bias(ALiBi),要么换序列建模架构(RetNet/Mamba),要么引入外部记忆/压缩,把长历史从主上下文里移出去,以获得更好的复杂度与系统可用性 [Press2021ALiBi][Sun2023RetNet][Gu2023Mamba][Wang2023LongMem]。
来源论文· 4[Press2021ALiBi][Sun2023RetNet][Gu2023Mamba][Wang2023LongMem]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].stancegpt-5.2

    [Camp D: bypass the PI/NTK/YaRN lineage (ALiBi/RetNet/Mamba/m] Avoid RoPE extrapolation patches: either change positional bias (ALiBi), change the sequence modeling architecture (RetNet/Mamba), or add external memory/compression to move lon

contestedc-0cdfd6e9a9
主要收益来自模型规模、数据质量与训练配方;tokenizer 只要不出现明显 OOV/乱码问题,就不值得在主训练周期里反复折腾。更常见的做法是把 tokenizer 固定为生态默认(如沿用上一代模型),把精力放在数据清洗、训练稳定性与配方调参上。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizers are frozen preprocessing; coverage is eno] Most gains come from scale, data quality, and training recipe; as long as the tokenizer avoids obvious OOV/garbage issues, it is not worth iterating during the main training cyc

contestedc-f70d90a418
更大 vocab 让序列更短,注意力更省,长上下文更划算;既然 128K 已经可行,就应继续扩到 256K 甚至更大,把压缩率当作主要目标。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is always better; push to 256K+ by defa] A larger vocab shortens sequences, reduces attention cost, and makes long context cheaper; if 128K works, we should keep pushing to 256K+ and treat compression as the primary ob

contestedc-2b8f288a97
BPE 的采样偏差、OOV 与跨语言不公平是结构性问题;byte/char 级建模从根上消除这些问题,应该尽快迁移到 tokenizer-free,并用结构创新(多尺度、latent、状态空间)解决长序列成本。
4 观测↳ spawned: tokenizer-free-vs-bpe-migrationTokenizer Scaling
证据 (4)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE as soon a] BPE’s sampling bias, OOV, and cross-lingual unfairness are structural; byte/char modeling removes these at the root. We should migrate quickly to tokenizer-free models and rely

  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues (boundary bias, cross-language unfairness, OOV/rule debt); byte/char-level modeling removes these at the root and reduces ecosystem fragmentation from token

  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues (sampling bias, cross-language unfairness, OOV/rule debt); modeling bytes/chars/patches removes these at the root, and sequence cost should be handled by ar

  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] BPE has structural issues: sampling bias, non-unique encodings, and cross-lingual unfairness. Byte/patch/pixel modeling removes OOV and segmentation bias at the root. The longer-sequ

contestedc-858f215730
tokenizer 影响训练信号分配、组合泛化与部署债务,应像数据配方一样被版本化与回归;优先把 BPE 的可比评估(BPB)与收尾修复(Magikarp)流程固化,再在此基础上评估 tokenizer-free 的系统成本与收益。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp D: tokenizer is a pretrain product spec—make BPE right ] Tokenizers shape training-signal allocation, compositional generalization, and deployment debt; they should be versioned and regression-tested like the data recipe. First, stand

contestedc-e50ba973e1
把 tokenizer 视为工程细节:只要能编码文本且 OOV 低,就优先把精力放在 scaling(参数/数据/compute)与训练 recipe;tokenizer 变更不应成为主要变量 [Kaplan2020ScalingLaws]。
来源论文· 1[Kaplan2020ScalingLaws]
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] Treat tokenization as an implementation detail: as long as text is encodable with low OOV, focus on scaling (params/data/compute) and training recipe; tokenizer changes should n

  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as the corpus is encodable with low OOV, quality is driven by scale, data quality, and training recipe; tokenizer changes should not consume main training budget beyond

contestedc-4ad9130646
更大 vocab 带来更高压缩率与更短序列,推理侧几乎不加 FLOPs,尤其对多语与代码更友好;因此应默认增大 vocab,并把系统收益(KV cache)当作主要回报 [Dubey2024Llama3][BigScience2022BLOOM]。
来源论文· 2[Dubey2024Llama3][BigScience2022BLOOM]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is always better; default to 256K+] Larger vocab improves compression and shortens sequences with near-zero extra FLOPs at inference, especially helping multilingual and code; therefore default to larger vocab and trea

contestedc-f7fdd5ced2
byte-level/character-level 直接建模避免了 BPE 的语言不公平、OOV 与手工规则债务;只要 BPB 接近 parity,就应优先迁移到 tokenizer-free 架构 [Xue2021ByT5]。
来源论文· 1[Xue2021ByT5]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Byte-/character-level modeling avoids BPE’s language unfairness, OOV, and hand-crafted rule debt; if BPB is near parity, prioritize moving to tokenizer-free architectures [Xue2021ByT

contestedc-d35e1355ac
当主要瓶颈来自对齐阶段(RLHF/DPO 等)的稳定性与一致性时,应减少 tail token、拆分 heavy-domain 长合并,降低 non-unique encoding 与 train/infer policy mismatch;推理侧序列变长的成本可以用系统优化或更强的 KV cache 管理抵消 [ClaudeMythos2025DynamicVocabPruning][LiuEllis2026SayAnythingButThis][Kong2026Opus4
来源论文· 2[ClaudeMythos2025DynamicVocabPruning][LiuEllis2026SayAnythingButThis]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: counter-trend—shrink/prune vocab to buy alignment & ] When the dominant bottleneck is alignment-stage stability/consistency (RLHF/DPO), reduce tail tokens and split heavy-domain long merges to lower non-unique encodings and train/i

contestedc-322097bf4d
这一阵营默认 tokenizer 只影响编码可用性,不值得占用太多研究与工程预算。只要没有 OOV,主要性能差异应由模型、数据和训练 recipe 决定。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.4

    [Camp A: the tokenizer is frozen preprocessing; coverage is e] This camp treats the tokenizer as affecting encodability but not much else. As long as there is no OOV problem, most performance differences are assumed to come from the model,

contestedc-83bd08f6d0
这一阵营强调更大词表带来更短序列、更低 loss 和更好的多语/code 表现,因此把 vocab 扩张视为近乎单调的收益来源。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.4

    [Camp B: bigger vocab is always better; default to 256K+] This camp emphasizes shorter sequences, lower loss, and better multilingual/code behavior from larger vocabularies, and therefore treats vocabulary expansion as an almost monotonic s

contestedc-03a01558d5
这一阵营认为 BPE 的边界偏置、语言不公平和 non-unique encoding 是结构性问题,最终应由 byte/char/patch 模型替代,而不是继续修补词表。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.4

    [Camp C: tokenizer-free is the endgame; BPE should be abandon] This camp argues that BPE has structural flaws—boundary bias, language unfairness, and non-unique encodings—and should eventually be replaced by byte/char/patch models rather th

contestedc-c7ba334c0f
这一阵营认为 pretraining 喜欢大词表,但 post-training 尤其 RL 不一定。尾部 token、稀有长 token 与 non-unique encoding 会放大策略分歧,因此应在后训练阶段主动剪尾、拆细或限制 safe vocabulary。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.4

    [Camp E: shrink or prune the vocabulary to buy alignment and ] This camp argues that pretraining may like larger vocabularies, but post-training, especially RL, may not. Tail tokens, rare long tokens, and non-unique encodings can amplify po

contestedc-621f193b2b
词表仅需覆盖训练数据中99.9%以上的常见子串,剩余字符用<unk>或字节表示即可,词表选择对最终模型性能影响可忽略
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stanceep-20260214160829-csjmc

    [Camp A: Tokenizer is frozen preprocessing; coverage is enoug] The tokenizer only needs to cover more than 99.9% of common substrings in training data, remaining characters can be represented with <unk> or bytes, and tokenizer choice has ne

contestedc-3250655a3f
更大的词表可以缩短序列长度,降低预训练计算开销,同时提升模型性能,因此应尽可能扩大词表规模至256K以上
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stanceep-20260214160829-csjmc

    [Camp B: Bigger vocab is always better; default to 256K+] Larger vocabularies shorten sequence length, reduce pretraining compute overhead, and improve model performance, so vocabulary size should be expanded to 256K+ as much as possible

contestedc-729e0a69e8
BPE带来的归纳偏置缺陷(非唯一编码、切分不合理)无法通过词表优化完全解决,字符级或字节级无词表模型是长期最优解
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stanceep-20260214160829-csjmc

    [Camp C: Tokenizer-free is the endgame; BPE should be abandon] Inductive bias defects caused by BPE (non-unique encoding, unreasonable segmentation) cannot be fully solved by tokenizer optimization, and character-level or byte-level tokeniz

contestedc-b42bf4ed7a
尾部低频token会放大RL策略分歧,裁剪尾部10%–20%的词表可以提升RLHF与对齐的稳定性,同时降低推理开销
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stanceep-20260214160829-csjmc

    [Camp D: Shrink or prune the vocabulary to buy alignment and ] Low-frequency tail tokens amplify RL policy divergence, pruning 10%–20% of the tail vocabulary improves RLHF and alignment stability while reducing inference overhead

contestedc-20815d2fb7
只要词表覆盖主要字符与常见子词,tokenizer 的影响会被大规模训练平均掉;因此更值得优化的是数据、训练步数与模型结构,tokenizer 不应频繁变更以免破坏兼容性。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as the vocabulary covers major characters and common subwords, tokenizer effects average out at scale; effort is better spent on data, steps, and architecture, and token

contestedc-d5da01b4c4
更大 vocab 带来更短序列与更低 loss,因此应尽可能把 vocab 做大(256K+),把 tokenization 当作压缩问题处理;尾部问题可忽略或可通过更多训练自然解决。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is always better; default to 256K+] Larger vocab yields shorter sequences and lower loss, so default to very large vocabularies (256K+), treating tokenization as a compression problem; tail issues are negligible or wil

contestedc-5dfe0d6ffe
subword tokenizer 引入人为归纳偏置与跨语言不公平,且存在歧义编码;直接在 bytes/characters 上建模可以消除这些问题,并让模型学习更“真实”的字符串分布。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Subword tokenizers inject human inductive biases and cross-language unfairness and admit ambiguous encodings; modeling directly over bytes/characters removes these issues and learns

contestedc-fc4325852b
在 post-training(尤其 RL)阶段,尾部 token 会放大策略分歧与不稳定;因此应主动剪尾或迁移到更小词表,以换取更稳定的训练与更可控的输出空间。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: shrink/prune the vocabulary to buy alignment and RL ] During post-training (especially RL), tail tokens amplify policy divergence and instability; proactively pruning the tail or transferring to a smaller vocabulary yields more sta

contestedc-99c0399cde
把 decode 的 KV 带宽、attention 的 IO 数据流、低精度的 scaling 合约当作架构设计输入;结构改动(GQA/MLA)、kernel 形态(FlashAttention)、以及数值策略(MXFP8)是同一条优化链路上的不同层级。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: kernels and algorithms must be co-designed (decide b] Treat decode KV bandwidth, attention IO dataflow, and low-precision scaling contracts as architecture inputs; structural changes (GQA/MLA), kernel forms (FlashAttention), and nu

contestedc-d75cab8d5c
只要把模型写成高层算子图,系统会在不同硬件上自动选择/生成高效 kernel;团队应把精力放在模型与数据,而不是 kernel 细节。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: PyTorch/graph-compiler level is enough; handwritten ] As long as the model is expressed as a high-level operator graph, the system will automatically select/generate efficient kernels across hardware; teams should focus on model an

contestedc-78db5c0669
硬件提供更好的低精度与更高带宽,训练只需小改动即可稳定;把精力放在模型创新比纠结 kernel/数值细节更划算。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: hardware will get faster; algorithms need not adapt ] Hardware provides better low precision and bandwidth, so training can be stable with minimal changes; focusing on model innovation is more cost-effective than sweating kernel/nu

contestedc-7d9c971185
低精度格式与 kernel 栈会在其他平台成熟,团队不应把算法绑定在 CUDA 特性上;更应追求可移植的算子表达与编译器路线。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: non-NVIDIA hardware will catch up; CUDA ecosystem wi] Low-precision formats and kernel stacks will mature on other platforms, so teams should avoid binding algorithms to CUDA-specific features and instead pursue portable operator e

contestedc-cb4474adf4
把并行维度当作拓扑映射问题:TP/EP 绑定 NVLink 域,PP 跨 IB,DP/FSDP 跨 pod;通过显式 shape 与调度(如 zero-bubble)把 MFU 推进可预期区间。
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.positions[0].stancegpt-5.2

    [Camp A: hand-tuned 4D (Megatron / MegaScale style)] Treat parallel dimensions as a topology mapping problem: TP/EP on NVLink islands, PP across IB, DP/FSDP across pods; use explicit shapes and scheduling (e.g., zero-bubble) to keep MFU in

  • topic_report4d-parallelism-megatron· report.positions[0].stancegpt-5.2

    [Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Encode primitive frequency into mesh and topology: keep TP/EP within NVLink islands, use PP over IB, DP/FSDP across pods; deliver reproducibility via MFU plus topology details.

contestedc-454bf7a226
把切分交给编译器/运行时,通过搜索或代价模型自动生成 TP/PP/DP 计划,降低手工 4D 的工程门槛,并在中等规模上逼近手工吞吐。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].stancegpt-5.2

    [Camp B: auto-parallel (Alpa / GSPMD / pjit)] Delegate sharding to compilers/runtimes via search or cost models to generate TP/PP/DP plans, lowering the engineering bar of hand-tuned 4D and approaching hand-tuned throughput at moderate scal

contestedc-d2ec6f32ff
尽量只用 DP/FSDP(ZeRO/FSDP 系列)解决显存与扩展性,把复杂度留在框架层,减少对模型代码与算子实现的侵入。
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.positions[2].stancegpt-5.2

    [Camp C: FSDP-only is enough (low intrusion first)] Use mostly DP/FSDP (ZeRO/FSDP family) to address memory and scaling, keeping complexity in the framework layer and minimizing intrusion into model code and operator implementations.

  • topic_report4d-parallelism-megatron· report.positions[2].stancegpt-5.2

    [Camp C: FSDP-only / ZeRO-only (low-intrusion first)] Use DP/FSDP (or ZeRO family) as far as possible to address memory and scaling, keeping complexity in the framework; rely on overlap, structure awareness, and runtime optimizations to rea

contestedc-4086500505
坚持成熟的 3D 配方:TP 解决算子内并行,PP 解决深度切分,DP 解决吞吐;长上下文与额外维度(SP/CP)只在极端场景启用。
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.positions[3].stancegpt-5.2

    [Camp D: 3D (DP+TP+PP) is enough; SP/CP optional] Stick to the mature 3D recipe: TP for intra-op parallelism, PP for depth sharding, DP for throughput; enable long-context extra dimensions (SP/CP) only in extreme cases.

  • topic_report4d-parallelism-megatron· report.positions[3].stancegpt-5.2

    [Camp D: classic 3D (DP+TP+PP) is enough; SP/CP are optional] Stick to the mature 3D recipe: TP for intra-operator parallelism, PP for depth partitioning, DP/FSDP for throughput; long context should rely more on faster attention kernels or

contestedc-d2e94cac6f
主张把 code 当作通用能力的“主燃料”,认为只要 code 足够多,reasoning 与工具使用会单调上升;NL 退化要么不会发生,要么可在 post-training 轻松补回,因此应尽早把占比推到 40–60% 甚至更高。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: generalists should push code past >40%; more is alwa] Argues code is the primary fuel for general capability: with enough code, reasoning and tool use should increase monotonically; NL regression either won’t happen or can be cheap

contestedc-43756fc3d6
认为 code 的核心价值在于更可压缩、更低噪声的梯度信号:同样 compute 下更稳、更不容易 overfit,因此 reasoning 变好是“训练更顺”带来的副产物;只要找到同等低 entropy 的数据(结构化文本、合成数据),就能替代 code。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: code helps mostly via optimization/regularization (l] Claims code’s main value is a more compressible, lower-noise gradient signal: under the same compute it trains more stably and overfits less, so reasoning gains are a byproduct

contestedc-9cc57ad007
把 code 视为强分布偏置:一旦占比上来,会挤占世界知识与自然语言覆盖,导致对话、写作、常识等能力退化;因此宁可把 code 留到 post-training/工具调用阶段,预训练只保留极少量。
2 观测Code Density Pretrain
证据 (2)
  • topic_reportcode-density-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: keep code <10% to protect NL; generalists should not] Treats code as a strong distributional bias: once its share rises, it displaces world knowledge and natural language coverage, degrading dialogue, writing, and commonsense. Henc

  • topic_reportcode-density-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: keep code <10% to protect NL in generalists] Treats code as a strong distributional bias that displaces natural language coverage, causing style drift, narrower knowledge, or weaker dialogue; thus generalists should minimize code a

contestedc-68f2a1c75c
认为 continual 会不可控地破坏原有语言能力或对齐特性,尤其在加入大量新分布(code)时;因此要么从零做混合预训练,要么接受 continual 的不可预测性。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: code ability must be trained from scratch; continual] Argues continual training will unpredictably damage existing language capabilities or alignment properties, especially when adding a large new distribution (code). Therefore one

contestedc-6a2e99f979
仅通过 RoPE 外推等位置编码优化,无需额外长数据训练即可实现有效长上下文,大幅降低扩展成本。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[0].stanceep-20260214160829-csjmc

    [Camp A: Positional extrapolation is enough to achieve long c] Only through positional encoding optimizations such as RoPE extrapolation, effective long context can be achieved without additional long data training, greatly reducing expansi

contestedc-28296e99f6
长上下文扩展的核心是优化训练数据的长度分布、领域分布和 burstiness,位置编码仅为辅助优化项。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[1].stanceep-20260214160829-csjmc

    [Camp B: The data recipe is the main variable] The core of long-context expansion is optimizing the length distribution, domain distribution and burstiness of training data, with positional encoding only as an auxiliary optimization item.

contestedc-2d04dd042e
序列拼接、截断控制、packing 策略的优化可以在不增加训练计算和数据成本的前提下,大幅提升有效上下文。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].stanceep-20260214160829-csjmc

    [Camp C: Packing engineering is the under-exploited lever] Optimization of sequence concatenation, truncation control, and packing strategies can greatly improve effective context without increasing training compute and data costs.

contestedc-f5078308ed
Transformer 的注意力二次复杂度是长上下文扩展的核心瓶颈,切换到 Mamba 等线性架构可以实现更高效率的长上下文处理。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].stanceep-20260214160829-csjmc

    [Camp D: Switch architectures (SSM / linear) to bypass positi] The quadratic attention complexity of Transformers is the core bottleneck for long-context expansion, switching to linear architectures such as Mamba can achieve more efficient

contestedc-74f6b0ad63
主张用小规模实验拟合 mixture→loss/下游性能的规律,再外推到目标规模,减少大规模 sweep 的点数;把 ratio 当作可优化变量,并希望输出可解释的敏感度与预算规划。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.2

    [Camp A: Formal search first (laws/regression/robust optimiza] Advocates fitting mixture→loss/downstream performance laws from small runs and extrapolating to target scale to reduce large sweeps; treats ratios as optimization variables and

contestedc-3881f77357
主张先把数据工程做对(过滤、去重、分桶),然后用少量对照实验与 staged curriculum 调配稀缺域;认为复杂 utility/估计器在真实噪声与版本迭代下不稳。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.2

    [Camp B: Heuristics + curricula are more robust (a few ablati] Argues for getting data engineering right first (filtering, dedup, bucketization), then using a small number of controls and staged curricula to allocate scarce domains; views c

contestedc-017b77ca2f
主张训练过程中根据任务/梯度/重要性采样动态调权,避免一次性选错比例;尤其在专用目标或分布漂移下,在线方法更贴近真实目标。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.2

    [Camp C: Online/adaptive mixing beats one-shot offline search] Argues for dynamically reweighting during training based on task/gradients/importance sampling to avoid committing to a wrong fixed ratio; especially for specialist goals or dis

contestedc-37d06c8804
主张先把“可用数据”定义清楚:过滤、选择、去毒、去噪往往带来更直接的收益;在低质 web 主导时,细调 ratio 的收益被噪声淹没。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].stancegpt-5.2

    [Camp D: Ratios are second-order; quality/selection is first-] Argues for defining “usable data” first: filtering/selection/detox/denoise often yields more direct gains; when low-quality web dominates, fine ratio tuning is drowned by noise.

contestedc-b6f26e7c87
把数据价值评估定位为“可重复的工程实验”:先用 classifier/perplexity/dedup 做 bulk filtering,再用 ladder/mixture sweep 验证每个决策的净收益;influence 与 causal 只作为辅助诊断,不进入日常决策闭环。
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[0].stancegpt-5.2

    [Camp A: quality classifiers + ablation ladders are sufficien] Treat data-value assessment as “repeatable engineering experiments”: bulk filtering via classifiers/perplexity/dedup, then ladders/mixture sweeps to validate net gains of each d

contestedc-cb40136041
认为“数据价值”本质是样本级贡献度排序,应该先用 influence/attribution 找到关键样本与关键簇,再围绕它们做删改与重权,从而比 domain-level mixture 更精细、更接近机制。
2 观测Data Value Causality
证据 (2)
  • topic_reportdata-value-causality· report.positions[1].stancegpt-5.2

    [Camp B: influence/attribution is the main path (infer data r] Argues that “data value” is fundamentally example-level contribution ranking; start with influence/attribution to find critical examples/clusters, then edit/reweight around them

  • topic_reportdata-value-causality· report.positions[1].stancegpt-5.2

    [Camp B: example-level influence/attribution is the main path] “Data value” is fundamentally an example-level contribution ranking: use influence/attribution to find key examples and clusters, then do targeted curation or reweighting to red

contestedc-00e3812909
认为 domain/feature 与下游能力之间混杂太强,相关性方法(classifier、mixture sweep、influence)都容易被“共同原因”误导;应当把 data selection 明确写成因果问题,用 IV 等识别工具估计干预效应,再据此做选择与加权。
2 观测Data Value Causality
证据 (2)
  • topic_reportdata-value-causality· report.positions[2].stancegpt-5.2

    [Camp C: full causal inference is the future (solve confoundi] Argues confounding between domains/features and downstream capabilities is too strong; correlational methods (classifiers, mixture sweeps, influence) are easily misled by common

  • topic_reportdata-value-causality· report.positions[2].stancegpt-5.2

    [Camp C: full causal identification is the future (IV/mediato] Confounding between domain/features and downstream capabilities is too strong; classifiers, mixture sweeps, and influence are all contaminated by distribution shift. Data select

contestedc-9a9047abb8
认为数据评估工具链成本高、结论不稳(尤其跨规模),不如遵循 scaling law 做规划:尽量扩大 token 与覆盖,必要时用 repetition 或更广域的混合来对冲过滤偏置;能力提升主要来自规模而非精细筛数。
2 观测Data Value Causality
证据 (2)
  • topic_reportdata-value-causality· report.positions[3].stancegpt-5.2

    [Camp D: skip measurement, rely on intuition and scale (scali] Argues data-evaluation tooling is expensive and unstable (especially across scales). Prefer scaling-law planning: maximize tokens and coverage; use repetition or broader mixture

  • topic_reportdata-value-causality· report.positions[3].stancegpt-5.2

    [Camp D: skip measurement, rely on intuition and scale (cover] Measurement toolchains are expensive and unstable (especially across scale). Follow scaling laws: maximize tokens and coverage, use repetition when needed to hedge filtering bia

contestedc-37212b0718
在 Hopper 上,FA3 [Shah2024FA3] 已把 dense attention 推到接近 compute-bound;继续手写更激进的 kernel,收益多为 10–20% 的边际改进,却引入更高维护成本。更划算的优化应转向:KV cache 表示、decode 并行、speculative decoding、以及训练侧的重计算/并行策略。[Ye2024FlashInfer][Liu2024KIVI][Leviathan2022SpecDec]
来源论文· 5[Shah2024FA3][Ye2024FlashInfer][Liu2024KIVI][Leviathan2022SpecDec]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[0].stancegpt-5.2

    [Camp A: FA is largely the endpoint of attention engineering ] On Hopper, FA3 [Shah2024FA3] already pushes dense attention close to compute-bound; further hand-tuned kernels often yield only 10–20% marginal gains while increasing maintenanc

contestedc-a3897ba804
attention 的长期成本不在单个 kernel 的峰值,而在变体爆炸与维护:ALiBi/SWA/document mask/soft prompt mask/MoA 等需求会不断出现。FlexAttention [Dong2024Flex] 通过 score_mod/mask_mod 把语义上移,再用 Triton [Tillet2019Triton] 生成 fused kernel,使“写 10 行 Python”成为默认路径,减少为每个变体维护一套 CUDA 的工程
来源论文· 2[Dong2024Flex][Tillet2019Triton]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[1].stancegpt-5.2

    [Camp B: the main line is Triton/FlexAttention—move optimizat] The long-term cost is not peak kernel throughput but variant explosion and maintenance: ALiBi/SWA/document masks/soft prompt masks/MoA keep appearing. FlexAttention [Dong2024Fle

contestedc-ff80f3a405
dense attention 的二次复杂度与 KV cache 使长上下文推理成本持续上升;因此应通过结构替代(稀疏注意力如 Longformer [Beltagy2020Longformer]、稀疏分解如 Child et al. [Child2019SparseTransformers]、或 SSM 路线如 Waleffe et al. [Waleffe2024MambaStudy])从根源降低复杂度,而不是在 kernel 上继续挤牙膏。
来源论文· 4[Beltagy2020Longformer][Child2019SparseTransformers][Waleffe2024MambaStudy]arXiv 2004.05150arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2004.05150· report.positions[2].stancegpt-5.2

    [Camp C: attention itself should be replaced (SSM / sparse / ] Dense attention’s quadratic cost and KV cache keep long-context inference expensive; therefore structural replacements (sparse attention like Longformer [Beltagy2020Longformer],

contestedc-0d086b357f
FA3 [Shah2024FA3] 深度依赖 Hopper 的异步与指令特性;Luo et al. [Luo2024HopperDissect] 也表明这些特性并非“可忽略的细节”。因此把 attention 的关键路径押注在这类 kernel 上,会放大供应链与平台迁移风险;更可取的是用 Triton/FlexAttention 这类更高层抽象,或保留可替换的后端策略。[Tillet2019Triton][Dong2024Flex]
来源论文· 5[Shah2024FA3][Luo2024HopperDissect][Tillet2019Triton][Dong2024Flex]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[3].stancegpt-5.2

    [Camp D: FA3 embodies NVIDIA lock-in; avoid binding critical ] FA3 [Shah2024FA3] heavily depends on Hopper-specific asynchrony and instruction features; Luo et al. [Luo2024HopperDissect] indicates these are not “minor details”. Betting crit

contestedc-8e5bfb202f
主张长上下文的关键在位置编码外推:先用 PI/YaRN 把窗口扩到 128K,再用 LongRoPE 把 1M+ 做稳;评估与数据配方属于“锦上添花”,不改变主线。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].stancegpt-5.2

    [Camp A: explicit positional encoding is required; RoPE extra] Argues positional extrapolation is the core: use PI/YaRN to reach 128K, then LongRoPE to stabilize 1M+; evaluation and data recipes are secondary and do not change the main path

contestedc-fecb27fa82
认为长序列训练主要是系统问题:FlashAttention-2 提升单卡效率,Ring/Ulysses 把序列维度切开,剩下就是按 DP/TP/PP 的常规套路扩规模。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].stancegpt-5.2

    [Camp B: SP and DP are orthogonal; scaling to million tokens ] Claims long-sequence training is primarily a systems problem: FlashAttention-2 improves single-GPU efficiency, Ring/Ulysses split the sequence dimension, and the rest is standar

contestedc-029d4b7fbd
认为长上下文训练应以 perplexity 与 NIAH 为主:可重复、成本低、能快速迭代;LongBench/RepoQA 这类任务型基准受提示词与数据集偏差影响大,不适合做主指标。
2 观测Length Scaling Pretraining
证据 (2)
  • topic_reportlength-scaling-pretraining· report.positions[2].stancegpt-5.2

    [Camp C: NIAH/perplexity is sufficient; task benchmarks are t] Argues long-context training should rely on perplexity and NIAH: reproducible, cheap, fast iteration; task benchmarks like LongBench/RepoQA are prompt- and dataset-biased, thus

  • topic_reportlength-scaling-pretraining· report.positions[2].stancegpt-5.2

    [Camp C: perplexity/NIAH are sufficient; task benchmarks are ] Use perplexity and NIAH as primary metrics: cheap, reproducible, fast iteration; task benchmarks like LongBench/RepoBench are sensitive to prompting, dataset bias, and leakage,

contestedc-83aa44cf46
认为全注意力的 O(L^2) 复杂度不可持续,应转向稀疏注意力、压缩记忆或 SSM 等亚二次架构,以获得“天然无限上下文”。
2 观测Length Scaling Pretraining
证据 (2)
  • topic_reportlength-scaling-pretraining· report.positions[3].stancegpt-5.2

    [Camp D: alternative architectures (sparse/SSM/linear attenti] Claims full attention’s O(L^2) is unsustainable; we should move to subquadratic architectures (sparse attention, compressed memory, SSMs) for “native unlimited context.”

  • topic_reportlength-scaling-pretraining· report.positions[3].stancegpt-5.2

    [Camp D: dense O(L^2) attention is unsustainable; alternative] Move to sparse attention, compressed memory, external memory, or attention-free/SSM-like architectures for “naturally infinite context,” rather than pushing Transformer dense wi

contestedc-364cf0aacb
主张 MoE 的条件计算能在固定训练 FLOPs 下扩大总参数与知识容量;随着模板化 recipe 与复刻平台成熟,训练稳定性与工程门槛下降,MoE 会像当年的 Transformer 一样成为主干结构。[Shazeer2017OutrageouslyLarge][Fedus2021Switch][Kang2025FLAMEMoE][DeepSeekAI2024V3]
来源论文· 4[Shazeer2017OutrageouslyLarge][Fedus2021Switch][Kang2025FLAMEMoE][DeepSeekAI2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].stancegpt-5.2

    [Camp A: MoE becomes the default backbone (dense remains for ] Argues conditional compute increases total capacity at fixed training FLOPs; as templated recipes and reproduction platforms mature, stability and engineering barriers drop, mak

contestedc-cf1f8c0bcd
主张 MoE 的优势常被“从零训练”的设定放大;在真实组织里更常见的是复用既有 dense 权重与后训练资产。upcycling 的 scaling law 显示收益存在上限,且方法细节决定是否划算,因此 dense 可能是更稳的投资路径。[He2024Upcycling][Liew2025Upcycling]
来源论文· 2[He2024Upcycling][Liew2025Upcycling]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[1].stancegpt-5.2

    [Camp B: dense wins on full-lifecycle ROI (especially under u] Argues MoE advantages are amplified by scratch-training assumptions; in practice, teams reuse existing dense weights and post-training assets. Upcycling scaling laws show ceilin

contestedc-146ac8e7fd
主张许多 MoE 增益来自“更大总参数 + 稀疏激活”的容量效应,而非精细路由;在部分设定下,随机/冻结路由器也能接近学习路由的验证性能,说明把大量精力投入路由技巧可能 ROI 不高。[Fan2024EmpiricalMoEChoices]
来源论文· 1[Fan2024EmpiricalMoEChoices]
3 观测↳ spawned: moe-routing-roi-random-vs-learnedMOE Landscape
证据 (3)
  • topic_reportmoe-landscape· report.positions[2].stancegpt-5.2

    [Camp C: learned routing/balancing is overrated (random/froze] Claims many MoE gains come from capacity effects (more total params with sparse activation) rather than sophisticated routing; in some settings, random/frozen routers can approa

  • topic_reportmoe-landscape· report.positions[2].stancegpt-5.2

    [Camp C: learned routing/balancing is overrated (random/froze] Claims the benefit of learning routers is limited: many gains come from larger total parameters and better systems; in some settings frozen random routers can be close to learne

  • topic_reportmoe-landscape· report.positions[2].stancegpt-5.2

    [Camp C: learned routing/balancing is overrated; random/froze] Many MoE gains come from total-parameter capacity under sparse activation rather than sophisticated routing; in some settings random/frozen routers approach learned routers on v

contestedc-fc62560626
主张 MoE 的价值集中在预训练的算力效率;而 SFT/DPO/RLHF 等后训练更看重稳定优化与可控性,稀疏路由会引入额外噪声与系统复杂度,因此应采用 dense training + sparse inference,或把后训练资产从 dense 迁移到 MoE。[Pan2024DenseTrainSparseInfer][Hui2024UpcyclingSFT]
来源论文· 1[Hui2024UpcyclingSFT]
2 观测MOE Landscape
证据 (2)
  • topic_reportmoe-landscape· report.positions[3].stancegpt-5.2

    [Camp D: MoE is mainly for pretraining; post-training should ] Argues MoE’s value is concentrated in pretraining efficiency; post-training (SFT/DPO/RLHF) prioritizes stable optimization and controllability, and sparse routing adds noise and

  • topic_reportmoe-landscape· report.positions[3].stancegpt-5.2

    [Camp D: MoE is mainly for pretraining; post-training should ] SFT/DPO/RLHF prioritize stable optimization and controllable throughput; sparse routing adds noise and systems complexity, so use dense training + sparse inference, or migrate p

contestedc-457e01b168
在不改预训的前提下,最稳的扩窗方式是“只动低频、尽量不动高频”,并对远距离 attention 分布做温度补偿。YaRN 把这套机制固化为 per-dim ramp + attention temperature,形成可复制的 ~400 步微调配方;相比之下,PI 的全局插值会结构性压缩高频,导致局部模式损伤,后续很难完全修复[Peng2023YaRN][bloc972023NTK][Chen2023PI]。
来源论文· 2[Peng2023YaRN][Chen2023PI]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].stancegpt-5.2

    [Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] Without changing pretraining, the most reliable extension is “move low-frequency, keep high-frequency,” plus temperature compensation for long-range attention distributions. YaR

contestedc-6dd35da5db
当长度进入 512K–2M,误差不再主要来自“相位越界”,而来自 per-dim mismatch:不同频率维度对外推的需求不同,统一 ramp/缩放会同时出现“某些维度相位仍不足”和“某些维度被过度压缩”。LongRoPE 通过 evolutionary search 学非均匀 per-dim scale,并配更长微调把误差按维度拟合掉,是更可控的工程路线[Ding2024LongRoPE]。
来源论文· 1[Ding2024LongRoPE]
2 观测Long Context Rope Ntk
证据 (2)
  • topic_reportlong-context-rope-ntk· report.positions[2].stancegpt-5.2

    [Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] At 512K–2M, errors are no longer dominated by “phase overflow” but by per-dim mismatch: different frequency dimensions need different extrapolation behavior, and a single ramp/s

  • topic_reportlong-context-rope-ntk· report.positions[2].stancegpt-5.2

    [Camp C: ≥512K needs LongRoPE-style per-dim search/learning; ] At 512K–2M, the dominant error is per-dim mismatch rather than phase overflow: different frequency bands need different extrapolation, and a single ramp/scaling causes both unde

contestedc-da38e2e2bd
RoPE 外推与长序列 attention 的二次复杂度让成本不可控,应该用线性/稀疏/状态空间模型,或用外部记忆把历史压缩/搬运,从而实现 train-short test-long 或近似“无限上下文”。这条路线把重点放在吞吐、延迟、显存与可部署性上[Gu2023Mamba][Sun2022LengthExtrapolatableTransformer][Wang2023LongMem]。
来源论文· 2[Gu2023Mamba][Wang2023LongMem]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].stancegpt-5.2

    [Camp D: bypass RoPE (Mamba / length-extrapolatable Transform] RoPE extrapolation and quadratic attention make costs hard to control; instead use linear/sparse/state-space models or external memory to compress/carry history, enabling train-

contestedc-d606fced08
主张继续以预训练 loss/PPL 作为缩放与工程决策的中心信号:它便宜、噪声小、可外推;在数据受限或需要快速迭代时,PPL 仍是最可控的优化目标与比较基准 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Muennighoff2023DataConstrainedScaling]。
来源论文· 2[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].stancegpt-5.2

    [Camp A: PPL remains the most reliable primary variable (at l] Argues pretraining loss/PPL should remain central for scaling and engineering decisions: it is cheap, low-noise, and extrapolatable; in data-constrained regimes or fast iteratio

contestedc-d5ba353836
主张把“是否继续训练/是否选型”直接建模为逐任务缩放问题:PPL 负责训练监控,但最终决策应由任务曲线外推与置信区间驱动,尤其在 overtraining 与多任务产品场景 [Gadre2024OvertrainingDownstream][Bhagia2024TaskScalingLadders][Isik2024DownstreamScalingLaws]。
来源论文· 1[Bhagia2024TaskScalingLadders]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].stancegpt-5.2

    [Camp B: PPL is only stage-1 inside a task-scaling pipeline] Argues “continue training / model selection” should be modeled as per-task scaling: PPL monitors training, but final decisions should be driven by task-curve extrapolation with un

contestedc-162bbd997c
主张“质量”天然是多轴的:通用能力、长上下文、工具使用、对齐行为、嵌入/检索等维度不可压成一个数。应以标准化面板与测试床做对比,尤其在数据选择与训练集迭代上 [Li2024DataCompLM][Wang2019SuperGLUE][Zhong2023AGIEval]。
来源论文· 3[Li2024DataCompLM][Wang2019SuperGLUE][Zhong2023AGIEval]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].stancegpt-5.2

    [Camp C: stop searching for a scalar; define quality via stan] Argues “quality” is inherently multi-axis: general capability, long context, tool use, alignment behavior, embedding/retrieval, etc., should not be collapsed into one number. Us

contestedc-42cfcddeec
主张 uniform next-token loss 本身就不对应“有用 token”的价值:不同 token 对能力贡献不同,平均化会把关键学习信号稀释,因此不应期待单一 PPL 能代表能力 [Lin2024Rho1]。
来源论文· 1[Lin2024Rho1]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].stancegpt-5.2

    [Camp D: the issue is ontological—next-token loss is not the ] Argues uniform next-token loss does not correspond to the value of “useful tokens”: tokens contribute unequally to capability, and averaging dilutes key learning signals, so a s

contestedc-3f24822245
主张把 parameterization 做成“可迁移的协议”:用 coord check 作为硬验收,优先让 LR/初始化在宽度(再到深度、精度)上零样本迁移;当现代模块破坏原版 µP 假设时,用 Complete-P 做模块级修补 [Yang2022muP][Cerebras2024CompleteP][Noci2024SuperConsistency]。
来源论文· 3[Yang2022muP][Cerebras2024CompleteP][Noci2024SuperConsistency]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].stancegpt-5.2

    [Camp A: µP (upgraded to Complete-P) should be the default; f] Argues parameterization should be treated as a transferable protocol: enforce coord check as a hard acceptance gate, prioritize zero-shot transfer of LR/init across width (then

contestedc-d3cc4fd60b
主张把 HP transfer 当作统计拟合问题:在固定 SP recipe 下,用经验公式或联合 scaling law 直接给出 LR、batch、训练时长的初值,再用少量 sweep 修正;避免引入 parameterization 迁移带来的实现风险 [Dey2023CerebrasGPT][Bi2024DeepSeekLLM][Kaplan2020Scaling]。
来源论文· 3[Dey2023CerebrasGPT][Bi2024DeepSeekLLM][Kaplan2020Scaling]
2 观测Mup Hp Transfer
证据 (2)
  • topic_reportmup-hp-transfer· report.positions[1].stancegpt-5.2

    [Camp B: empirical formulas + small sweeps are enough; µP is ] Treats HP transfer as a statistical fitting problem: under a fixed SP recipe, use empirical formulas or joint scaling laws to produce starting points for LR/batch/training durat

  • topic_reportmup-hp-transfer· report.positions[1].stancegpt-5.2

    [Camp B: empirical formulas + small sweeps are enough; µP is ] Under a fixed SP recipe, empirical formulas and joint scaling laws can directly provide starting points for LR, batch, and token:param, and 1–2 small corrective sweeps can get c

contestedc-68908ce4fe
主张减少手工超参与迁移假设:用自动化方法(自动 GD、warm-start BO、甚至搜索优化器)直接在目标规模上优化,避免依赖宽度/深度极限或经验公式的外推 [Bernstein2023AGD][Kim2017WarmStartBO][Chen2023SymbolicOptimizer]。
来源论文· 3[Bernstein2023AGD][Kim2017WarmStartBO][Chen2023SymbolicOptimizer]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].stancegpt-5.2

    [Camp C: end-to-end Bayesian/automatic optimization will repl] Argues to reduce manual HPs and transfer assumptions: use automation (automatic GD, warm-start BO, even optimizer search) to optimize directly at target scale, avoiding extrapol

contestedc-05156d3ad8
主张把 µP/公式的收益重新评估:在真实训练中 wd、β₂、batch 等会通过稳定边界与正则化机制主导最优 LR,因此“只迁移 LR”不够,应把这些变量纳入联合规则或联合搜索 [Kosson2025WDMoreThanMuP][Loshchilov2017AdamW][Cohen2022EdgeStability]。
来源论文· 3[Kosson2025WDMoreThanMuP][Loshchilov2017AdamW][Cohen2022EdgeStability]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].stancegpt-5.2

    [Camp D: transfer is dominated by non-transferable HPs; wd/β₂] Argues to re-evaluate the gains of µP/formulas: in real training, wd, β₂, and batch can dominate the optimal LR via stability boundaries and regularization mechanisms, so “trans

contestedc-bce7a179d3
主张对训练语料做尽可能激进的去重(从 exact 到语义近重复),认为重复主要带来冗余 compute、泄漏与污染;在 web 语料上优先压掉重复,再谈多 epoch。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].stancegpt-5.2

    [Camp A: Dedup as much as possible (treat repetition as noise] Advocates aggressive dedup (from exact to semantic near-dup), viewing repetition as wasted compute plus leakage/contamination risk; prioritize removing repetition in web corpora

contestedc-766e70d1a1
主张当高质量 token 不够时,直接多 epoch 均匀重复,认为在约 4 轮内收益接近新鲜 token;工程上优先把训练跑满,再用少量去重处理极端重复。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[1].stancegpt-5.2

    [Camp B: Uniform repetition ≤4 epochs is almost free (treat r] Argues that when high-quality tokens are limited, simply run uniform multi-epoch repetition, expecting near-fresh-token gains up to ~4 epochs; in engineering, prioritize fully u

contestedc-2a25c7719f
认为 web-scale 冗余的主要来源是语义近重复(改写/转载/模板换皮),仅靠 exact/MinHash 会留下大量“信息重复但字面不同”的 token;应把 embedding 相似度、聚类与簇内多样化作为核心数据效率杠杆。
2 观测Pretrain Data Repetition
证据 (2)
  • topic_reportpretrain-data-repetition· report.positions[2].stancegpt-5.2

    [Camp C: Semantic dedup is the real battleground (exact dedup] Claims that semantic near-dup (paraphrases/reposts/template rewrites) is the dominant redundancy source at web scale; exact/MinHash leaves many tokens that repeat information bu

  • topic_reportpretrain-data-repetition· report.positions[2].stancegpt-5.2

    [Camp C: Semantic dedup is the main battleground (exact/MinHa] At web scale, the dominant redundancy comes from rewrites, repackaging, and semantic near-duplicates within topic clusters; exact/MinHash leaves many “same information, differen

contestedc-b0ed1be495
主张对 benchmark、PII、版权内容采取近似零曝光:不进入主训练池,不参与多 epoch;用严格过滤与去重保证“没见过”,以避免污染与泄漏。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].stancegpt-5.2

    [Camp D: Zero repetition for sensitive/eval/copyright data (t] Advocates near-zero exposure for benchmarks, PII, and copyrighted content: keep them out of the main training pool and out of multi-epoch; rely on strict filtering/dedup to ensu

contestedc-712f356826
把 optimizer 视为生产基础设施的一部分:最重要的是 recipe 可复用、失败模式可预期、以及在固定 A/B 预算下的稳健性。即便存在更低 loss 的方法,也可能被调参成本与系统风险抵消[AdamW2017][Lingle2024muPTransfer]。
来源论文· 2[AdamW2017][Lingle2024muPTransfer]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[0].stancegpt-5.2

    [Camp A: AdamW won’t be retired (highest default priority)] Treat the optimizer as production infrastructure: prioritize reusable recipes, predictable failure modes, and robustness under fixed A/B budgets. Even if lower-loss methods exist,

contestedc-2302a41576
把“二阶收益”限定在最该吃二阶的参数子集:hidden 2D 权重。通过把 embedding/norm/head 留给 AdamW,Muon 把不稳定性与调参风险隔离开,同时获得更好的几何更新[Jordan2024Muon]。
来源论文· 1[Jordan2024Muon]
2 观测Optimizer Landscape
证据 (2)
  • topic_reportoptimizer-landscape· report.positions[1].stancegpt-5.2

    [Camp B: Muon is the next default (but only as a hybrid)] Constrain “second-order gains” to the parameter subset that benefits most: hidden 2D weights. By keeping embeddings/norms/heads on AdamW, Muon isolates instability and tuning risk wh

  • topic_reportoptimizer-landscape· report.positions[1].stancegpt-5.2

    [Camp B: Muon is the next default (but only as a hybrid)] Localizing second-order structure to hidden 2D weights is the deployable compromise: gains focus on the most matrix-like parameters while embeddings/norms/heads stay on AdamW to avoi

contestedc-85439e91d7
认为对角自适应(AdamW)在条件数很差的层上先天受限,Kronecker 结构的二阶 preconditioning 才是更接近“正确几何”的方向[Gupta2018Shampoo][Anil2020ScalableSecondOrder]。SOAP 通过在 Shampoo 特征基下跑 Adam,显著降低了二阶的稳定性与超参负担[Vyas2024SOAP]。
来源论文· 3[Gupta2018Shampoo][Anil2020ScalableSecondOrder][Vyas2024SOAP]
2 观测Optimizer Landscape
证据 (2)
  • topic_reportoptimizer-landscape· report.positions[2].stancegpt-5.2

    [Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Argues diagonal adaptivity (AdamW) is fundamentally limited on ill-conditioned layers; Kronecker-structured second-order preconditioning is closer to the “right geometry”[Gupta2

  • topic_reportoptimizer-landscape· report.positions[2].stancegpt-5.2

    [Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Diagonal adaptivity is inherently limited on ill-conditioned, strongly coupled layers; Kronecker-structured preconditioning is closer to the right geometry. SOAP compresses seco

contestedc-dda44f62e1
认为大量 optimizer 论文的收益在公平比较下会消失:schedule、停止步数 T 的依赖、以及 HP 搜索预算才是决定因素。应优先标准化评估协议与 tuning budget,再讨论算法差异[Dahl2023AlgoPerf][Agarwal2020LRConfound][Defazio2024RoadLessScheduled]。
来源论文· 3[Dahl2023AlgoPerf][Agarwal2020LRConfound][Defazio2024RoadLessScheduled]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[3].stancegpt-5.2

    [Camp D: optimizers matter less; many gains are evaluation ar] Claims many optimizer-paper gains vanish under fair comparisons: schedules, dependence on stopping step T, and HP-search budgets are the real drivers. Standardize evaluation pro

contestedc-94fb7a757c
把文档边界视为 objective 的硬约束:packing 追求吞吐,但必须 per-doc causal mask;长度训练用 short-to-long,把长窗 compute 压到末段,并用 RoPE base/频率调整让 128K 续训稳定。
2 观测Packing Masking Length
证据 (2)
  • topic_reportpacking-masking-length· report.positions[0].stancegpt-5.2

    [Camp A: short-to-long + per-doc masking (default engineering] Treat document boundaries as a hard objective constraint: packing targets throughput but must use per-doc causal masking; use short-to-long to push long-window compute to the ta

  • topic_reportpacking-masking-length· report.positions[0].stancegpt-5.2

    [Camp A: per-doc masking + short-to-long (default engineering] Treat document boundaries as a hard objective constraint by default: pursue high utilization via packing without cross-doc visibility; use short-to-long with a late-stage long-c

contestedc-8bd754942e
主张全程按某种分布混合长度(而不是末段才上长窗),认为课程会让模型对长度分布过拟合,且长窗能力需要更早暴露。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].stancegpt-5.2

    [Camp B: uniformly mixed-length training (anti-curriculum, al] Advocates mixing sequence lengths throughout training (rather than adding long windows only at the end), arguing curricula can overfit to length distributions and that long-cont

contestedc-d7829a39d1
认为预训练应当直接覆盖跨文档上下文:通过拼接让模型学会在更接近 prompting/长文生成的分布上建模,per-doc mask 反而限制了能力。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[2].stancegpt-5.2

    [Camp C: naive concat + cross-doc visible (cross-boundary by ] Argues pretraining should directly cover cross-document context: concatenation trains the model on distributions closer to prompting/long-form generation, and per-doc masking ma

contestedc-c8303439df
认为 FIM/去噪类目标对左到右建模代价小,却能覆盖编辑、补全、指令跟随等更多任务形态,因此应当对 NL 与 code 都默认启用,并提高 FIM rate。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[3].stancegpt-5.2

    [Camp D: FIM for everything (infilling as a universal default] Claims FIM/denoising objectives add little cost to left-to-right modeling while covering more task forms (editing, infilling, instruction following), so they should be enabled b

contestedc-1f9ceebe32
主张用联合幂律把 loss 对 N/D/C 的关系拟合成可外推的指数,并据此在固定 compute 下优先增大参数规模;在这个框架里,数据量不足不会立刻成为主要瓶颈 [Kaplan2020ScalingLaws]。
来源论文· 1[Kaplan2020ScalingLaws]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].stancegpt-5.2

    [Camp A: Kaplan-style—portable exponents, compute-optimal fav] Argues joint power laws provide extrapolatable exponents for loss vs N/D/C, implying that under fixed compute one should prioritize increasing parameter count; in this framing,

contestedc-6669a9cdef
主张用 IsoFLOP 与外推把 compute-optimal 定位到 tokens/param≈20,并把“多 tokens、小模型”作为默认 recipe;该结论在 LLaMA 等工程复现中表现稳定 [Hoffmann2022Chinchilla][Touvron2023LLaMA]。
来源论文· 2[Hoffmann2022Chinchilla][Touvron2023LLaMA]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[1].stancegpt-5.2

    [Camp B: Chinchilla-style—balance N and D under fixed compute] Advocates IsoFLOP and extrapolation to place compute-optimal near tokens/param≈20, making “more tokens, smaller model” the default recipe; engineering replications like LLaMA ap

contestedc-b54170330d
主张把数据筛选/去重/质量估计/mixture 设计当作独立于 N 与 tokens 的第一类优化变量;在固定预算下,数据配方带来的差距足以主导下游表现 [Li2024DCLM][Albalak2024DataSelectionSurvey]。
来源论文· 2[Li2024DCLM][Albalak2024DataSelectionSurvey]
2 观测Scaling Laws LLM
证据 (2)
  • topic_reportscaling-laws-llm· report.positions[2].stancegpt-5.2

    [Camp C: Data-mixture pragmatists—data is the first axis; get] Treats filtering/dedup/quality estimation/mixture design as first-class optimization variables independent of N and tokens; at fixed budgets, data recipe differences can dominat

  • topic_reportscaling-laws-llm· report.positions[2].stancegpt-5.2

    [Camp C: Data-mixture pragmatists — get the data recipe right] Filtering, dedup, freshness, and mixtures are first-class variables independent of N and D; under fixed budgets, data recipes can rival or exceed the gains from a parameter-scal

contestedc-f19d4b4475
主张把 emergence 优先解释为评测指标的非线性(0-1 指标阈值化)与评测管线噪声,而不是模型内部突然出现新能力;应先用连续指标与稳健评测把曲线“去阈值化” [Schaeffer2023Mirage]。
来源论文· 1[Schaeffer2023Mirage]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].stancegpt-5.2

    [Camp D: Against emergence-as-magic—many “emergent” effects c] Prioritizes explaining emergence via non-linear evaluation metrics (thresholding by 0-1 scores) and pipeline noise rather than sudden internal capability; recommends de-threshol

contestedc-fc0eb62ec3
主张用 HumanEval[Chen2021Codex](及其多语言变体)作为核心指标:它便宜、稳定、可复现,且与“能写对函数”高度相关;因此可以作为 SWE 能力的主要代理。
来源论文· 2[Chen2021Codex]arXiv 2107.03374arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2107.03374· report.positions[0].stancegpt-5.2

    [Camp A: HumanEval/MBPP is sufficient to represent coding abi] Argues HumanEval[Chen2021Codex] (and multilingual variants) should be the core metric: it is cheap, stable, reproducible, and tightly tied to “can write a correct function”; thu

contestedc-e8339ebcb6
认为真实 GitHub issue resolving 才是目标分布,Verified[SWEbenchVerified2024] 通过人工验证降低噪声,因此其他基准都应让位;模型好坏以 Verified 排名为准。
来源论文· 1[SWEbenchVerified2024]
3 观测↳ spawned: swebench-verified-only-vs-eval-portfolioSwe Agent Evaluation
证据 (3)
  • topic_reportswe-agent-evaluation· report.positions[1].stancegpt-5.2

    [Camp B: SWE-bench Verified is the only trustworthy ground tr] Claims real GitHub issue resolving is the target distribution, and Verified[SWEbenchVerified2024] reduces noise via human validation, so other benchmarks should be secondary; mo

  • topic_reportswe-agent-evaluation· report.positions[1].stancegpt-5.2

    [Camp B: SWE-bench Verified is the only trustworthy ground tr] Only real GitHub issues with patch+tests resemble engineering tasks; Verified reduces noise via human validation, so other benchmarks (function problems, execution-semantics tas

  • topic_reportswe-agent-evaluation· report.positions[1].stancegpt-5.2

    [Camp B: SWE-bench Verified is the only trustworthy ground tr] Real GitHub issue resolving is the target distribution; Verified reduces noise via human validation, so models should be ranked by Verified, with other benchmarks only as auxili

contestedc-867a4775a4
主张在 pretrain 阶段优先优化与评估“编辑轨迹的似然”,因为真实 SWE 是一系列补丁与修改;因此 patch-PPL 比 HumanEval 更贴近下游。
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].stancegpt-5.2

    [Camp C: trajectory-level PPL (e.g., patch-PPL) is the most r] Argues pretrain should prioritize optimizing/evaluating the likelihood of editing trajectories, because real SWE is a sequence of patches and edits; thus patch-PPL is closer to

contestedc-88b722216c
认为用户体验与成本结构(重试次数、token 消耗、工具失败、轨迹可读性)决定实际价值;即便 Verified 相同,tokens/issue 与 test-exec rate 的差异也会决定能否落地。
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].stancegpt-5.2

    [Camp D: deployment UX metrics reflect SWE-agent value better] Claims user experience and cost structure (retries, token usage, tool failures, trajectory readability) determine real value; even with the same Verified, differences in tokens/

contestedc-e7f1a07204
主张递推状态足以承载语言建模所需的长程依赖;随着实现与规模提升,attention 的 O(L^2) 结构会被视为历史包袱。典型论据来自“无注意力预训练/RetNet/SSM LM”展示的可扩展性与部署优势。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].stancegpt-5.2

    [Camp A: Pure SSMs will eventually replace Transformers] Claims recurrent state is sufficient for long-range dependencies in language modeling; with better implementations and scale, attention’s O(L^2) structure becomes legacy overhead. Typ

contestedc-3235d9c3b4
强调“都是递推”:线性注意力可写成 RNN 状态更新,SSM 也可视为某类注意力的结构化实现;因此差异主要是参数化与实现细节,最终会收敛到统一算子族。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].stancegpt-5.2

    [Camp B: Linear attention and SSMs are fundamentally the same] Emphasizes “everything is recurrence”: linear attention can be written as RNN state updates, and SSMs can be viewed as structured implementations of certain attentions; differen

contestedc-21a7fe7c24
认为骨架改变会破坏已学到的表示与优化轨迹;蒸馏只能对齐短程行为,长上下文与算法性能力会在转换后崩塌,因此应从头用匹配骨架的 recipe 训练。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[2].stancegpt-5.2

    [Camp C: Subquadratic models must be pretrained from scratch;] Argues backbone changes break learned representations and optimization trajectories; distillation aligns only short-range behavior, while long-context and algorithmic capabiliti

contestedc-39725ffed3
把 attention 视为“精确寻址模块”,把 SSM/递推视为“线性路由模块”,通过稀疏插入 attention 层获得大部分质量,同时显著降低 KV cache 与长上下文成本。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[3].stancegpt-5.2

    [Camp D: The engineering optimum is hybrid; minimize attentio] Treats attention as a “precise addressing module” and SSM/recurrence as a “linear routing module,” inserting attention sparsely to retain most quality while substantially reduci

contestedc-72248c9a20
把 roofline 与数值合约当作设计输入:先明确 memory/compute/latency bound,再决定结构与数据流;并把低精度与 fused kernel 的数值路径当作训练稳定性变量,而不是实现细节。
2 观测Cuda Kernel Pretrain
证据 (2)
  • topic_reportcuda-kernel-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: algorithms and kernels must be co-designed (bytes/FL] Treat roofline and numeric contracts as design inputs: classify memory/compute/latency bounds first, then choose structure and dataflow; treat low precision and fused-kernel num

  • topic_reportcuda-kernel-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: algorithms and kernels must be co-designed (bytes/FL] Treat roofline and numeric contracts as architecture inputs: classify memory/compute/latency bound first, then choose structure (MQA/GQA/MLA), dataflow (FlashAttention), and low

contestedc-5e8f2c1dba
把生产力与可维护性放在首位:用 FSDP、编译器 IR(MLIR)与算子融合/自动调优覆盖大多数需求,避免关键路径被少数 kernel 专家“卡脖子”。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Prioritize productivity and maintainability: use FSDP, compiler IR (MLIR), and fusion/autotuning to cover most needs, avoiding critical paths being gated by a small number of ke

contestedc-bf6e936d9d
历史上 BF16 的普及对算法侵入较小:只要保持 master 权重与合适的 loss scaling,很多模型能直接训练 [Kalamkar2019BF16Study]。因此可以押注硬件迭代与库更新,而不是频繁改结构与 kernel。
来源论文· 1[Kalamkar2019BF16Study]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: hardware will get faster; algorithms need not adapt ] BF16 adoption historically required limited algorithmic intrusion: with master weights and appropriate loss scaling, many models train successfully [Kalamkar2019BF16Study]. So o

contestedc-37616d3db5
数值格式与 kernel autotuning 并非 CUDA 独占:HiFloat8 展示了不同生态能提出新的 8-bit 合约 [Luo2024HiFloat8];OpenCL/CUDA 双栈与 autotuning 工具链也在积累可移植经验 [Petrovic2019KTT]。
来源论文· 2[Luo2024HiFloat8][Petrovic2019KTT]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Numerics and kernel autotuning are not CUDA-exclusive: HiFloat8 shows alternative ecosystems can propose new 8-bit contracts [Luo2024HiFloat8]; OpenCL/CUDA dual-stack and autotu

contestedc-c512acfaf7
把并行计划交给编译器/搜索:用代价模型在计算图上自动决定 sharding、PP 切分与 placement,减少手工调参与代码侵入,并让计划随拓扑变化自动适配。[Zheng2022Alpa][Xu2021GSPMD][Lattner2020MLIR]
来源论文· 3[Zheng2022Alpa][Xu2021GSPMD][Lattner2020MLIR]
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.positions[1].stancegpt-5.2

    [Camp B: auto-parallel (Alpa / GSPMD / pjit line)] Delegate parallel plans to compiler/search: use cost models to decide sharding, PP partitioning, and placement on computation graphs, reducing manual tuning and code intrusion, and adapting

  • topic_report4d-parallelism-megatron· report.positions[1].stancegpt-5.2

    [Camp B: auto-parallel (Alpa / GSPMD / compiler-search)] Compile parallel plans: with minimal sharding annotations, a cost model + search decides sharding, PP partitioning, and placement, adapting plans to topology changes; the goal is to a

contestedc-1f3caad103
优先用 FSDP/ZeRO 把参数、梯度、optimizer state 分片,尽量不引入 TP/PP/CP 这类高侵入维度;通过 overlap、结构感知与 runtime 优化把性能推到可接受水平。[Zhao2023PyTorchFSDP][Wang2026veScaleFSDP]
来源论文· 2[Zhao2023PyTorchFSDP][Wang2026veScaleFSDP]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[2].stancegpt-5.2

    [Camp C: FSDP-only is enough (low intrusion first)] Prefer FSDP/ZeRO to shard params, grads, and optimizer states, avoiding high-intrusion dimensions like TP/PP/CP; rely on overlap, structure-aware execution, and runtime optimizations to re

contestedc-ac979a61a5
主张把系统复杂度控制在 3D:用 TP+PP 解决模型规模,用 DP/FSDP 扩展吞吐;SP/CP 只有在极端长上下文才需要,更多时候靠更好的 attention kernel 或更好的 PP 调度即可。[Dao2023FlashAttention2][Qi2023ZeroBubble][Harlap2018PipeDream]
来源论文· 3[Dao2023FlashAttention2][Qi2023ZeroBubble][Harlap2018PipeDream]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[3].stancegpt-5.2

    [Camp D: 3D (DP+TP+PP) is enough; SP/CP are optional optimiza] Keep system complexity within 3D: use TP+PP for model scale and DP/FSDP for throughput; SP/CP are only needed for extreme long context, and most cases can be handled by better a

contestedc-435bd5ac5f
只要把位置编码/位置 bias 设计好,就能在不改变训练数据的情况下把模型能力外推到更长输入;长上下文退化主要来自 RoPE 频率失真或相对位置建模不稳。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: positional extrapolation is enough; long context is ] If positional encoding / positional bias is designed well, capability extrapolates to longer inputs without changing training data; long-context degradation mainly comes from Ro

contestedc-d9adfa7269
有效上下文来自训练时是否频繁出现“需要用远处 token 才能降低 loss”的事件;因此应优先做长文档上采样、继续训练的长度 curriculum,并保持与基础预训练一致的领域分布以避免漂移。
3 观测Context Scaling Pretrain
证据 (3)
  • topic_reportcontext-scaling-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: the data recipe is the main variable; long-doc upsam] Effective context comes from how often training contains events where far-away tokens are necessary to reduce loss; prioritize long-doc upsampling, length curriculum in continue

  • topic_reportcontext-scaling-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: data recipe is the main variable; long-doc ratio and] Effective context comes from how often training contains events where far tokens are necessary to reduce loss; prioritize long-doc upsampling, length curriculum, continual-pretr

  • topic_reportcontext-scaling-pretrain· report.positions[1].stancegpt-5.4

    [Camp B: the data recipe is the main variable; long-document ] This camp argues that effective context comes from how often training contains events where reducing loss requires using distant tokens; therefore long-document upsampling, leng

contestedc-7ff7c79275
长上下文的关键不是把单文档拉长,而是让模型在训练时经常遇到跨文档引用、重复与对齐;通过相关文档检索聚类后拼接、低截断打包、显式分隔 token,可以把“跨段推理”变成语言建模的常态。
2 观测Context Scaling Pretrain
证据 (2)
  • topic_reportcontext-scaling-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: packing/concatenation is under-exploited; sequence c] The key is not just longer single documents, but frequent exposure to cross-document reference, repetition, and alignment during training; related-doc retrieval+clustering conca

  • topic_reportcontext-scaling-pretrain· report.positions[2].stancegpt-5.4

    [Camp C: packing / concatenation is underestimated; sequence ] This camp argues that the key is not merely making single documents longer, but making training frequently contain cross-document reference, repetition, and alignment; related-d

contestedc-150c39caec
Transformer 的注意力与 KV cache 形态决定了长程读写的预算分配,导致有效上下文难以随窗口线性增长;应改用稀疏注意力或循环记忆,把长度扩展与读写机制一起重做。
2 观测Context Scaling Pretrain
证据 (2)
  • topic_reportcontext-scaling-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: switch architectures (sparse / memory / recurrent) t] Transformer attention and KV-cache mechanics constrain long-range read/write budget allocation, preventing effective context from scaling linearly with window size; use sparse a

  • topic_reportcontext-scaling-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: switch architectures to bypass Transformer long-rang] Transformer attention’s quadratic cost and KV-cache shape constrain long-range read/write budgeting; use sparse attention, recurrent memory, or linear/hybrid architectures to re

contestedc-d19a8d06fd
把 code 视为通用推理能力的强先验:更多 code 会持续带来更强的结构化推理与工具使用能力,NL 损失可以通过更大模型或后续对齐补回,因此默认应当走高 code。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: generalists should push code past >40%; more is alwa] Treats code as a strong prior for general reasoning: more code keeps improving structured reasoning and tool-use, while NL losses can be recovered by larger models or later alig

contestedc-522572e2cb
把 code 看作“更干净、更可压缩”的 token:它降低梯度噪声、让训练更稳,从而间接提升下游;因此 code 占比的讨论应当围绕 loss/variance/稳定性,而不是语义迁移。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: code helps mostly via optimization/regularization (l] Views code as “cleaner, more compressible” tokens: it reduces gradient noise and stabilizes optimization, indirectly improving downstream; thus code-fraction debates should focu

contestedc-a27feb0824
认为 continual 在 code 这种强分布迁移上不可控:要么遗忘 NL,要么形成不可预测的行为漂移;因此应当用 from-scratch 的专门配方训练 coder,再通过对齐或蒸馏与 generalist 协作。
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: code ability must be trained from scratch; continual] Argues continual is inherently unstable under strong distribution shifts like code: it either forgets NL or drifts unpredictably; thus one should train coders from scratch with

contestedc-6e72720bc4
这一派认为 exact attention 的 kernel 主线已经收敛:训练用 FA2/FA3 [Dao2023FA2][Shah2024FA3],serving 用 FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine],继续在单机 attention kernel 上做大投入,边际收益会快速下降。
来源论文· 5[Dao2023FA2][Shah2024FA3][Ye2024FlashInfer][Ye2025FlashInferEngine]arXiv 2307.08691arxiv.org
2 观测Flashattention Kernels
证据 (2)
  • topic_reportflashattention-kernelsarXiv 2307.08691· report.positions[0].stancegpt-5.4

    [Camp A: FlashAttention is largely the endpoint of attention ] This camp argues that the exact-attention kernel line has largely converged: use FA2/FA3 for training [Dao2023FA2][Shah2024FA3], FlashInfer for serving [Ye2024FlashInfer][Ye2025

  • topic_reportflashattention-kernelsarXiv 2307.08691· report.positions[0].stancegpt-5.2

    [Camp A: FA1/FA2/FA3 already solved the core; what remains is] The dense exact-attention kernel line has converged: pick FA2/FA3 by hardware for training [Dao2023FA2][Shah2024FA3], and use an engine like FlashInfer for serving [Ye2024FlashI

contestedc-f00a0f8184
这一派认为未来的关键不是再写更多专用 CUDA,而是把 attention 语义交给编译器。FlexAttention [Dong2024Flex] 给出统一入口,Triton 生态则提供更可移植的实现层;这样维护成本更低,跨模型变体更稳。
来源论文· 2[Dong2024Flex]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.positions[1].stancegpt-5.4

    [Camp B: The main line should move from hand-written CUDA to ] This camp argues that the future lies not in writing more specialized CUDA but in handing attention semantics to the compiler. FlexAttention [Dong2024Flex] provides the unified

contestedc-bf5e97f8c9
这一派认为继续优化 attention kernel 只是修补旧范式,真正该做的是换掉 attention,用更低复杂度结构处理长上下文与状态更新。
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[2].stancegpt-5.4

    [Camp C: Attention itself should be replaced by SSM, linear, ] This camp argues that further attention-kernel optimization is patching an old paradigm, and that the real move is to replace attention with lower-complexity structures for long

contestedc-54d76c2a5c
这一派担心 FA3 [Shah2024FA3] 过度依赖 Hopper 的 TMA 与 warp specialization,导致跨架构迁移成本高、维护窗口短,尤其对多云或混合 GPU 团队不划算。
来源论文· 2[Shah2024FA3]arXiv 2407.08608arxiv.org
2 观测Flashattention Kernels
证据 (2)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[3].stancegpt-5.4

    [Camp D: FA3 embodies NVIDIA generation-specific lock-in, so ] This camp worries that FA3 [Shah2024FA3] depends too heavily on Hopper TMA and warp specialization, raising migration cost and shortening maintenance lifetime, especially for mu

  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[3].stancegpt-5.2

    [Camp D: FA3 implies generation lock-in; critical paths shoul] FA3 [Shah2024FA3] depends heavily on Hopper’s TMA and warp specialization, increasing cross-architecture migration cost and shortening the maintenance window; in multi-cloud or

contestedc-c5b285848d
把数据工程当作可回归的实验系统:用 proxy/classifier 做 bulk filtering,把候选改动送进 ladder,在固定预算下用 per-capability 指标验收。核心假设是:只要 ladder 设计得当,大多数数据决策不需要更细粒度的归因或强因果识别。
2 观测Data Value Causality
证据 (2)
  • topic_reportdata-value-causality· report.positions[0].stancegpt-5.2

    [Camp A: classifiers + ablation ladders are sufficient (bet o] Treat data engineering as a regression-testable experimental system: use proxies/classifiers for bulk filtering, send candidate changes into ladders, and accept/reject via per-c

  • topic_reportdata-value-causality· report.positions[0].stancegpt-5.2

    [Camp A: classifiers + ablation ladders are sufficient (contr] Treat data engineering as a regression-testable experiment system: use bulk filtering to make candidates trainable, then accept decisions via ladders under fixed budgets. Influe

contestedc-bb37143a19
认为 example-level 归因能直接回答“哪些数据在驱动哪些能力”,从而绕过大量 sweep/ablation:先用 influence/attribution 找到关键训练片段,再围绕这些片段做 targeted curation 或 reweighting。
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[1].stancegpt-5.2

    [Camp B: influence/attribution is the main path (infer recipe] Believes example-level attribution can directly answer “which data drives which capabilities”, reducing the need for broad sweeps/ablations: use influence/attribution to find ke

contestedc-e385bd2b45
认为数据选择的核心难点是混杂:文档特征、域、质量、长度、重复度等变量纠缠在一起,相关性方法会选到“看起来有效但其实是混杂”的特征;因此应把选择问题写成可识别的因果估计,并用 IV/robustness 等工具给出净效应。
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[2].stancegpt-5.2

    [Camp C: full causal identification is the future (solve conf] Argues the core difficulty is confounding: document features, domain, quality, length, repetition, etc. are entangled, so correlational methods pick features that look effective

contestedc-7a3940a933
认为细粒度数据估值的边际收益不稳定,最可靠的是扩大参数/数据/训练步数并保持覆盖;在 data scarcity 下,重复训练与不完美去重也可能是合理策略,因此不应过度依赖过滤与复杂估值。
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[3].stancegpt-5.2

    [Camp D: skip measurement, rely on intuition and scale (scali] Believes marginal gains from fine-grained data valuation are unstable; the most reliable lever is scaling params/data/steps while maintaining coverage. Under data scarcity, repe

contestedc-70dd266225
把 mixture 当作可建模的响应面或鲁棒优化问题:用少量点拟合 mixture→loss/utility,再外推到更大规模;或直接优化最差桶表现,减少“凭经验调参”的试错成本 [RegMix2024] [BiMix2024] [MixMax2024]。
来源论文· 3[RegMix2024][BiMix2024][MixMax2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.2

    [Camp A: Formal search first (laws/regression/robust optimiza] Treat mixtures as a modelable response surface or a robust optimization problem: fit mixture→loss/utility from few points and extrapolate to larger scales; or directly optimize

contestedc-ded2b127e3
把 mixture 当作工程配方:先保证数据质量与覆盖面,再用少量可回滚的阶段性调权补齐能力。公开配方(如 Llama 3 的后期 3–5× 上采样)说明“尾段补稀缺域”是可复用的结构 [MetaLlama32024] [Gopher2021] [LLaMA2023]。
来源论文· 3[MetaLlama32024][Gopher2021][LLaMA2023]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.2

    [Camp B: Heuristics + curricula are more robust; a few ablati] Treat mixtures as engineering recipes: ensure data quality and coverage first, then use a small number of reversible phase-wise reweightings to patch capabilities. Public recipe

contestedc-cacac146a3
把权重学习放进训练环:用 proxy 或训练中信号动态调权,目标是更快到达同等 loss 或更好最差桶表现。DoReMi 用 Group DRO 在 proxy 上学权重再迁移,是“可操作的在线化”版本 [DoReMi2023];Adaptive Data Optimization 把 scaling law 信号用于动态样本选择,主张按训练进度调整数据分配 [AdaptiveDataOpt2024]。
来源论文· 2[DoReMi2023][AdaptiveDataOpt2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.2

    [Camp C: Online/adaptive mixing beats one-shot offline search] Put weight learning into the training loop: dynamically reweight using proxy or in-training signals to reach the same loss faster or improve worst-bucket behavior. DoReMi learns

contestedc-6f11a08b11
认为主要收益来自过滤/选择:高质 web-only 也能打过“web+多源混合”,因此 ratio 只是细节。RefinedWeb 与 FineWeb 都强调过滤带来的收益 [RefinedWeb2023] [FineWeb2024];Data Filtering Networks 把过滤器学习成模型,进一步把“质量”变成可优化对象 [DFN2023]。
来源论文· 3[RefinedWeb2023][FineWeb2024][DFN2023]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].stancegpt-5.2

    [Camp D: Ratios are second-order; quality/selection is first-] Argues the main gains come from filtering/selection: high-quality web-only can outperform “web + curated mixtures,” making ratios a detail. RefinedWeb and FineWeb emphasize filt

contestedc-fd9cc2eb05
把长文当作“频段预算 + 分布对齐”的训练问题:base 直接对齐目标长度,并用 short-to-long curriculum 让远距离依赖在训练中逐段出现;有效上下文应随阶段与长度档位连续增长,而不是靠推理期外推或少量微调硬拉窗口。[Dubey2024Llama3][Xu2024RoPEBaseBounds][Fu2024DataEngineering128K]
来源论文· 3[Dubey2024Llama3][Xu2024RoPEBaseBounds][Fu2024DataEngineering128K]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].stancegpt-5.2

    [Camp A: native long-context (ABF + curriculum) is the cleane] Treat long context as a training problem of “frequency budget + distribution alignment”: set base to the target length and use short-to-long curricula so long-range dependencies

contestedc-27309f14f1
PI 的全局插值把高频维度一起压缩,局部相对位移分辨率下降;YaRN 用 per-dim ramp 把插值主要放在低频维度,并用 attention temperature 稳住远距 logits,因而在同等微调预算下更容易得到可用的有效上下文。[Chen2023PI][Peng2023YaRN]
来源论文· 2[Chen2023PI][Peng2023YaRN]
2 观测Long Context Rope Ntk
证据 (2)
  • topic_reportlong-context-rope-ntk· report.positions[1].stancegpt-5.2

    [Camp B: YaRN, not PI, should be the default for 32K–128K ret] PI’s global interpolation compresses high-frequency dimensions and reduces local relative-offset resolution; YaRN’s per-dim ramp concentrates interpolation on low-frequency dims

  • topic_reportlong-context-rope-ntk· report.positions[1].stancegpt-5.2

    [Camp B: for 32K–128K retrofits, default to YaRN, not PI] PI’s global interpolation compresses high-frequency dimensions and reduces local relative-position resolution; YaRN’s per-dim ramp pushes interpolation mostly into low-frequency dime

contestedc-f1908adc3f
当长度跨到 512K–2M,不同频段的误差形态分化,单一全局缩放无法同时满足低频相位覆盖与高频局部分辨率;LongRoPE 通过搜索学非均匀 per-dim scale,并配更长微调预算来换稳定性。[Ding2024LongRoPE]
来源论文· 1[Ding2024LongRoPE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].stancegpt-5.2

    [Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] At 512K–2M, error patterns diverge across frequency bands, so a single global scaling cannot satisfy both low-frequency phase coverage and high-frequency local resolution; LongR

contestedc-9e220ff288
把“长上下文”当作系统问题:用线性时间模型(Mamba)、递归/外部记忆(RMT、Infini-attention)、或检索增强来避免 O(n^2) attention 与长序列训练成本;窗口上限可以做得很大,且对存量模型侵入性更小。[Gu2023Mamba][Bulatov2023RMT][Munkhdalai2024InfiniAttention][Xu2023RetrievalMeetsLongContext]
来源论文· 3[Gu2023Mamba][Bulatov2023RMT][Munkhdalai2024InfiniAttention]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].stancegpt-5.2

    [Camp D: bypassing RoPE/attention (SSM/external memory/retrie] Treat “long context” as a systems problem: use linear-time models (Mamba), recurrent/external memory (RMT, Infini-attention), or retrieval augmentation to avoid O(n^2) attention

contestedc-af4dd91c2d
这一路线认为长度泛化主要由位置建模决定,因此工程上应集中在 RoPE 缩放、插值和相关稳定化技巧上。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].stancegpt-5.4

    [Camp A: explicit positional design is required, and RoPE ext] This camp treats length generalization as primarily a positional-modeling problem, so engineering effort should focus on RoPE scaling, interpolation, and related stabilization t

contestedc-5ef38c5699
这一路线把长上下文看成训练系统问题,认为 kernel、模型并行和序列并行到位后,窗口扩展主要是工程资源投入。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].stancegpt-5.4

    [Camp B: scaling to million tokens is mostly a parallelism pr] This camp treats long context mainly as a training-systems problem, arguing that once kernels, model parallelism, and sequence parallelism are in place, window scaling is mostly

contestedc-d7dc2f81ec
这一路线偏好简单、可重复的 proxy,认为 needle 和 perplexity 足以代表长上下文能力,复杂任务 benchmark 受污染、泄漏和实现细节影响太大。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[2].stancegpt-5.4

    [Camp C: NIAH/perplexity are sufficient; task benchmarks are ] This camp prefers simple, repeatable proxies and argues that needle tests and perplexity are sufficient for long-context ability, while complex task benchmarks are too affected

contestedc-628f6007fd
这一路线认为 full attention 的二次复杂度终究不可持续,因此应转向 sparse attention、memory 机制或其他替代架构。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[3].stancegpt-5.4

    [Camp D: sparse, memory, or linear-attention alternatives wil] This camp argues that the quadratic cost of full attention is ultimately unsustainable, so the field should move to sparse attention, memory mechanisms, or alternative architect

contestedc-37a274b1df
主张把 MoE 当作主流扩容路径:在相近训练 compute 下,用更大的 total 参数换更强能力;关键是把均衡、拥塞与系统实现模板化,降低复现门槛。证据通常来自可复刻的大规模 MoE 报告与开源基线。
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].stancegpt-5.2

    [Camp A: MoE becomes the default backbone (dense kept for sma] Argues MoE should be the mainstream scaling path: at similar training compute, use larger total parameters for stronger capability; the key is templating balancing, congestion c

contestedc-72d53f5e38
主张把 MoE 的收益按阶段拆账:从 dense checkpoint 转 MoE 的收益会提前饱和,且额外的稳定期与系统开销不可忽略;因此更稳的策略是继续训练/扩容 dense,或只在明确 ROI 区间内做 MoE。
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[1].stancegpt-5.2

    [Camp B: dense wins on full-lifecycle ROI (especially under u] Argues MoE gains must be ledgered by phase: dense→MoE conversion saturates early, and stabilization plus systems overhead is non-trivial; thus a steadier strategy is to keep tra

contestedc-28626f3247
主张把 MoE 的不稳定与系统复杂性限制在预训练阶段:post-training(SFT/RLHF/对齐)更需要稳定吞吐与可控优化,因此应切回 dense,或采用“dense training + sparse inference”的阶段解耦。
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].stancegpt-5.2

    [Camp D: MoE is mainly for pretraining; post-training should ] Argues MoE instability and systems complexity should be confined to pretraining: post-training (SFT/RLHF/alignment) benefits more from stable throughput and controllable optimiz

contestedc-639eb6050f
这一派认为,只要网络族定义正确、模块缩放规则补全,base HP 应该跨宽度并部分跨深度稳定迁移。经验公式只能在 recipe 不变时局部有效,不能替代正确 parameterization [Yang2022muP][Cerebras2024CompleteP][Dey2025DontBeLazy]。
来源论文· 3[Yang2022muP][Cerebras2024CompleteP][Dey2025DontBeLazy]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].stancegpt-5.4

    [Camp A: Complete-P should be the default starting point, and] This camp argues that once the network family is specified correctly and module scaling rules are completed, base hyperparameters should transfer stably across width and partly

contestedc-a26be039ee
这一派强调固定 recipe 的现实约束:只要 proxy run 与目标规模同分布,联合 scaling law 就能给出 LR、batch、token:param 的好初值,再做 1–2 轮 sweep 即可 [Dey2023CerebrasGPT][Bi2024DeepSeekLLM][Hoffmann2022Chinchilla]。
来源论文· 3[Dey2023CerebrasGPT][Bi2024DeepSeekLLM][Hoffmann2022Chinchilla]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].stancegpt-5.4

    [Camp B: empirical formulas plus small sweeps are enough; µP ] This camp emphasizes the reality of fixed recipes: if proxy runs match the target-scale distribution, joint scaling laws can provide good initial values for LR, batch, and token

contestedc-c17cdd00fb
这一派认为,既然真实训练栈的超参耦合复杂,不如直接让自动优化器或 BO 在代理任务上学出更新规则或搜索策略,减少手工 transfer 规则 [Bernstein2023AGD][Chen2023SymbolicOptimizer]。
来源论文· 2[Bernstein2023AGD][Chen2023SymbolicOptimizer]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].stancegpt-5.4

    [Camp C: end-to-end BO / automatic optimization will replace ] This camp argues that since real training stacks have complex hyperparameter couplings, it is better to let automatic optimizers or BO learn update rules or search strategies on

contestedc-bdb7a65e9e
这一派认为,很多“LR 不可迁移”的现象其实是 wd、β₂、norm 控制等变量没被一起建模。尤其在 AdamW 下,wd 是独立轴,固定 wd 再讨论 LR transfer 往往会得出误导性结论 [Kosson2025WDMoreThanMuP][Wang2024AdamWWD][Loshchilov2017AdamW]。
来源论文· 3[Kosson2025WDMoreThanMuP][Wang2024AdamWWD][Loshchilov2017AdamW]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].stancegpt-5.4

    [Camp D: transfer error is dominated by non-transferable hype] This camp argues that many cases of 'LR not transferring' are actually caused by failing to model wd, β₂, norm control, and related variables jointly. Under AdamW in particular,

contestedc-7e5770c506
这一路线把 web 语料中的重复视为低价值 token,优先目标是删掉 exact、near-exact 与长 substring 重复,以换取更高 token 利用率、更低泄漏与更低记忆化 [Lee2021Dedup]。进一步版本会把语义近重复也纳入治理 [SemDeDup2023][D42023]。
来源论文· 2[Lee2021Dedup][SemDeDup2023]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].stancegpt-5.4

    [Camp A: Dedup as much as possible; repetition is mostly nois] This line treats repetition in web corpora as low-value tokens. The first objective is to remove exact, near-exact, and long-substring duplication to improve token efficiency wh

contestedc-2c9c18bcfc
这一路线基于数据受限训练的经验:当独特 token 不够时,均匀重复前几轮仍能提供接近新鲜 token 的收益,因此与其停训,不如把高质量池多看几遍 [Muennighoff2023DataConstrained]。
来源论文· 1[Muennighoff2023DataConstrained]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[1].stancegpt-5.4

    [Camp B: Uniform repetition up to about 4 epochs is close to ] This line is based on data-constrained training results: when unique tokens are insufficient, the first few uniform passes can still deliver returns close to fresh tokens, so it

contestedc-72e48e1087
这一路线认为 web 规模下的大头冗余不是逐字重复,而是语义近重复、模板化改写与同簇内容,因此 embedding 相似度、聚类与多样化采样才是更接近最终收益的方向 [SemDeDup2023][D42023]。
来源论文· 1[SemDeDup2023]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[2].stancegpt-5.4

    [Camp C: The real battleground is semantic dedup; exact dedup] This line argues that at web scale, the larger redundancy source is not verbatim duplication but semantic near-duplicates, templated rewrites, and same-cluster content, so embed

contestedc-49eb5339b8
这一路线把重复视为合规与泄漏风险,而不是 token 利用率问题。benchmark contamination、PII 泄漏与版权文本记忆化都更适合按曝光次数管理,而不是按混入比例管理 [Deng2023BenchmarkContamination][Carlini2022Memorization]。
来源论文· 1[Carlini2022Memorization]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].stancegpt-5.4

    [Camp D: For sensitive/eval/copyrighted data, zero repetition] This line treats repetition as a compliance and leakage risk rather than a token-efficiency issue. Benchmark contamination, PII leakage, and copyrighted-text memorization are be

contestedc-1d436af87d
只要训练 recipe 稳定,验证 loss/PPL 的缩放律足够稳定,可以承担预算分配、模型选型甚至发版前的主要比较;下游评估更多是确认而非驱动。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].stancegpt-5.2

    [Camp A: PPL remains the most reliable primary variable (at l] With a stable training recipe, validation loss/PPL scaling is stable enough to drive budgeting, model selection, and even most pre-release comparisons; downstream evaluation is

contestedc-f0140d21d9
训练内用 PPL 管过程与预算;真正的产品决策(继续训练/是否 overtrain/是否压缩/是否发版)应由逐任务曲线与外推误差驱动。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].stancegpt-5.2

    [Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Use PPL in-loop for process and budgeting; product decisions (train more/overtrain/compress/ship) should be driven by per-task curves and extrapolation error.

contestedc-32185a9894
LM 质量是多维的:能力、鲁棒性、安全、偏见、隐私等无法被一个标量覆盖;应把评估制度化为标准场景 + 多指标面板,PPL 只是其中一个训练内信号。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].stancegpt-5.2

    [Camp C: stop searching for a scalar; define quality via stan] LM quality is multi-dimensional: capability, robustness, safety, bias, privacy cannot be summarized by one scalar; evaluation should be institutionalized as standardized scenari

contestedc-ba720fcb45
很多产品目标(对话安全、遵循指令、避免重复、功能正确性)不是 MLE 的自然结果;应改变目标函数(偏好优化、unlikelihood、混合目标)而不是纠结 PPL 与下游的相关性。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].stancegpt-5.2

    [Camp D: the issue is ontological—next-token loss is not the ] Many product goals (dialog safety, instruction following, avoiding repetition, functional correctness) are not natural outcomes of MLE; change the objective (preference optimiza

contestedc-84e519f2fb
训练长度分布应当从一开始就覆盖目标窗口,避免模型把“短长度统计”学成默认;通过混合采样让模型对长度更鲁棒,减少末段 mid-train 的分布漂移。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].stancegpt-5.2

    [Camp B: always-mixed lengths (anti-curriculum)] The length distribution should cover the target window from the start to avoid baking short-length statistics as the default; mixed sampling improves length robustness and reduces distributio

contestedc-d569237511
把训练分布对齐到推理时的上下文拼接:通过拼接让模型在预训练阶段就习惯跨段落/跨文档条件化,从而更贴近 in-context learning 与长文生成。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[2].stancegpt-5.2

    [Camp C: cross-doc visibility by default (beyond-boundaries c] Align training distribution with inference-time context stitching: concatenation trains the model to condition across paragraphs/documents during pretraining, better matching in

contestedc-282af9c7ba
纯 causal LM 不是唯一合理默认:混合目标(包含 infilling/span corruption)能覆盖更多任务形态,减少“只会续写”的偏置;FIM 应当从 code 推广到通用语料。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[3].stancegpt-5.2

    [Camp D: FIM/denoising-style objectives as default (infilling] Pure causal LM is not the only reasonable default: mixed objectives (including infilling/span corruption) cover more task shapes and reduce the “only continuation” bias; FIM sho

contestedc-874e257b01
主张用更强的 SSM 参数化与 kernel 把 attention 完全替换掉:推理 constant-memory、训练可通过 scan/matmul 优化获得高吞吐;长上下文的主要矛盾是算子成本而非建模能力。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[0].stancegpt-5.2

    [Camp A: Pure SSM will be the endgame for long context] Argues stronger SSM parameterizations and kernels will fully replace attention: constant-memory inference and high training throughput via scan/matmul optimizations; the main long-cont

contestedc-4c9e72971e
把 attention 当作稀缺资源:用少量 attention 层提供离散检索/路由,其余层用 Mamba/SSM 提供线性吞吐;层比例按上下文长度与吞吐目标调节。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[1].stancegpt-5.2

    [Camp B: Hybrids are the production default (tunable 1:3 to 1] Treat attention as a scarce resource: keep a few attention layers for discrete retrieval/routing and use Mamba/SSM elsewhere for linear throughput; tune ratios by context length

contestedc-d6cec38874
通过稀疏/窗口/近似 attention 与位置编码改造,把成本压下去即可;attention 的可寻址交互是核心能力,不应削弱。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[2].stancegpt-5.2

    [Camp C: No architecture change—Transformer + long-context ex] Reduce cost via sparse/windowed/approximate attention and positional encoding; attention’s addressable interaction is the core capability and should not be weakened.

contestedc-9ce3bc342e
递推提供最干净的部署优势:constant-memory、天然流式;训练可并行化,避免传统 RNN 的训练瓶颈。随着门控与参数化改进,质量会追上。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[3].stancegpt-5.2

    [Camp D: RWKV/linear RNN is the correct RNN revival path] Recurrence offers the cleanest deployment advantage: constant-memory and native streaming; training can be parallelized to avoid classic RNN bottlenecks. With better gating/parameter

contestedc-20bfca5fd2
默认选择应以“总成本最小化”为目标:失败模式少、调参可迁移、工具链成熟比单点最优更值钱。尤其在 ≥70B 生产里,额外 1–2 轮 sweep 的成本通常超过 3–5% 的 token 节省。
2 观测Optimizer Landscape
证据 (2)
  • topic_reportoptimizer-landscape· report.positions[0].stancegpt-5.2

    [Camp A: AdamW won’t be retired (highest default priority)] Default choice should minimize total cost: fewer failure modes, transferable tuning, and mature tooling matter more than pointwise wins. In ≥70B production, the cost of 1–2 extra s

  • topic_reportoptimizer-landscape· report.positions[0].stancegpt-5.2

    [Camp A: AdamW won’t be retired (highest default priority)] Default choice should minimize total cost: reusable recipes, predictable failure modes, and transferable tuning matter more than pointwise optimality. For ≥70B, get μP/LR transfer

contestedc-49018f42af
张量级预条件是更接近几何正确性的方向;SOAP 把 Shampoo 的稳定性与调参负担压缩到可接受范围,剩下主要是实现与规模化证据问题。长期默认应迁移到 SOAP 类二阶。
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[2].stancegpt-5.2

    [Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Tensor-level preconditioning is closer to the right geometry; SOAP reduces Shampoo’s stability and tuning burden to an acceptable level, leaving mainly implementation and scalin

contestedc-29914e77ac
大量“更快/更好”的结论来自不匹配的 schedule 家族、不同的调参预算、或不一致的评估指标。把协议对齐后,optimizer gap 会显著变小,工程应把精力放在 recipe 与评估标准化。
2 观测Optimizer Landscape
证据 (2)
  • topic_reportoptimizer-landscape· report.positions[3].stancegpt-5.2

    [Camp D: optimizers matter less; many gains are evaluation ar] Many “faster/better” claims come from mismatched schedule families, unequal tuning budgets, or inconsistent evaluation metrics. Once protocols are aligned, optimizer gaps shrink

  • topic_reportoptimizer-landscape· report.positions[3].stancegpt-5.2

    [Camp D: optimizers matter less; many gains are evaluation ar] With fair tuning (fixed budgets) and aligned schedules, many optimizer gaps vanish or shrink sharply; effort is better spent on standardized protocols, robust schedules, and reu

contestedc-be5f0085b8
主张把 attention 当作历史包袱:通过更强的 SSM 参数化、门控与更大规模训练,纯递推会在主流 LM 指标上追平,并用线性推理与更小 KV cache 获得部署优势 [Gu2023Mamba][Peng2023RWKV][Gu2021S4]。
来源论文· 3[Gu2023Mamba][Peng2023RWKV][Gu2021S4]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].stancegpt-5.2

    [Camp A: Pure SSM/RWKV will eventually replace Transformers] Treat attention as legacy overhead: with stronger SSM parameterizations, gating, and scaling, pure recurrence will match mainstream LM metrics while winning deployment via linear-

contestedc-eef2f40e26
强调算子层面的等价:注意力可写成结构化递推,线性注意力可写成 RNN;因此“选 attention 还是 SSM”主要是实现细节与 kernel 优化问题 [Dao2024TransformersAreSSMs][Yang2023GLA][Sun2023RetNet]。
来源论文· 3[Dao2024TransformersAreSSMs][Yang2023GLA][Sun2023RetNet]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].stancegpt-5.2

    [Camp B: Linear attention and SSMs are the same class and mut] Emphasize operator-level equivalence: attention can be rewritten as structured recurrence, linear attention as an RNN; therefore choosing attention vs SSM is mostly an implement

contestedc-adb123b335
认为蒸馏会引入不可控偏差:学生骨架的归纳偏置不同,迁移来的行为可能是脆弱近似;因此应从零预训练,让模型在自身算子约束下形成稳定能力 [Wang2022PretrainingWithoutAttention]。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[2].stancegpt-5.2

    [Camp C: Subquadratic models must be pretrained from scratch;] Argue distillation introduces uncontrolled bias: the student backbone has different inductive biases, transferred behaviors may be brittle approximations; therefore pretrain fro

contestedc-6596e53ee5
把 attention 视为“精确寻址模块”,把递推/SSM 视为“压缩与路由模块”,通过少量 attention 层兜底复制/检索,同时把大部分层线性化以获得吞吐与显存收益 [Lieber2024Jamba][De2024Griffin]。
来源论文· 2[Lieber2024Jamba][De2024Griffin]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[3].stancegpt-5.2

    [Camp D: The engineering optimum is hybrid; minimize attentio] Treat attention as an exact-addressing module and recurrence/SSMs as compression/routing modules; backstop copy/retrieval with a few attention layers while linearizing most laye

contestedc-a34e28d5d3
核心主张是联合幂律足够稳定,compute-optimal 配方可由少量 sweep 外推,固定算力下应优先把预算投向更大模型而不是更多 tokens [Kaplan2020ScalingLaws][Henighan2020AutoregressiveScaling]。
来源论文· 1[Kaplan2020ScalingLaws]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].stancegpt-5.4

    [Camp A: Kaplan-style — portable exponents, fixed-compute sho] The core claim is that joint power laws are stable enough that compute-optimal recipes can be extrapolated from modest sweeps, and that under fixed compute, budget should tilt t

contestedc-e2361a4007
核心主张是许多大模型只是训练不够久;在固定算力下,较小模型配更多 tokens 更划算,经验中心约在 tokens/param≈20 [Hoffmann2022Chinchilla][Touvron2023LLaMA]。
来源论文· 2[Hoffmann2022Chinchilla][Touvron2023LLaMA]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[1].stancegpt-5.4

    [Camp B: Chinchilla-style — balance N and D under fixed compu] The core claim is that many large models were simply not trained long enough; under fixed compute, smaller models with more tokens are more efficient, with an empirical center a

contestedc-f7286a59b8
核心主张是数据筛选、去重、freshness 与 mixture 配比带来的收益,常常不小于一次参数规模升级;因此预算分配应先做数据 Pareto,再谈模型最优点 [Li2024DCLM][Penedo2023RefinedWeb][Albalak2024DataSelectionSurvey]。
来源论文· 3[Li2024DCLM][Penedo2023RefinedWeb][Albalak2024DataSelectionSurvey]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].stancegpt-5.4

    [Camp C: Data-mixture pragmatists — data is the first axis, g] The core claim is that gains from data filtering, deduplication, freshness, and mixture can be as large as a model-scale upgrade; budget allocation should therefore run a data P

contestedc-2017e034b2
核心主张是能力增长并不总是神秘拐点;不少看似突然出现的效果来自离散指标、阈值化任务和评测噪声,尤其在单任务分数上更明显 [Gadre2024OverTraining][Bhagia2024TaskLadders][Xu2024SelfBias]。
来源论文· 3[Gadre2024OverTraining][Bhagia2024TaskLadders][Xu2024SelfBias]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].stancegpt-5.4

    [Camp D: Against emergence-as-magic — many “emergent” effects] The core claim is that capability growth is not always a mysterious phase change; many apparently sudden effects come from discrete metrics, thresholded tasks, and evaluation no

contestedc-90f8f3ed32
函数题 pass@k 简单、可复现、成本低;模型迭代时用 HumanEval/MBPP 做主指标即可,其他基准只会增加噪声与评测开销。[Chen2021Codex][Austin2021MBPP][CodeLlama2023]
来源论文· 3[Chen2021Codex][Austin2021MBPP][CodeLlama2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[0].stancegpt-5.2

    [Camp A: HumanEval/MBPP is sufficient to represent coding abi] Function-style pass@k is simple, reproducible, and cheap; HumanEval/MBPP can be the primary metric for model iteration, while additional benchmarks mostly add noise and evaluati

contestedc-b6d1926f11
likelihood 指标成本低、可在大规模 sweep 上稳定比较;如果 patch-PPL 降了,就说明模型更擅长生成“像真实修复”的补丁,因此下游 issue resolving 也会更强。(当前更多是工程直觉,而非被系统验证的结论。)
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].stancegpt-5.2

    [Camp C: patch-PPL/code BPB in pretraining is the best predic] Likelihood metrics are cheap and stable for large sweeps; if patch-PPL drops, the model is better at producing “realistic fixes”, so downstream issue resolving should improve to

contestedc-67c1665a55
用户关心的是能否在预算内解决问题:重试次数、执行成功率、token 成本、轨迹是否可读可控;pass@1 或单次 Verified 分数无法反映这些约束。[OpenCodeInterpreter2024][SelfRepair2023]
来源论文· 2[OpenCodeInterpreter2024][SelfRepair2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].stancegpt-5.2

    [Camp D: deployment UX metrics reflect SWE-agent value better] Users care about solving within budget: retries, execution success rate, token cost, and whether trajectories are readable/controllable; pass@1 or a single Verified score does n

contestedc-1c1dbce970
把 SWE 视为“推理时系统设计”问题:通过检索、工具调用、分层调试、多代理协作与更长的 test-time compute,可以在不改变 base model 的情况下覆盖大部分真实工程任务。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[0].stancegpt-5.2

    [Camp A: scaffolding and test-time compute are everything] Treats SWE as an inference-system design problem: retrieval, tool use, hierarchical debugging, multi-agent collaboration, and more test-time compute can cover most real engineering

contestedc-3642c331b5
把 SWE 视为“可验证的序列决策”问题:只要有 tests/规则/静态分析等 verifier,RL 或偏好优化就能把模型推到更高的通过率,预训练只需提供基本语法与常识。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[1].stancegpt-5.2

    [Camp B: RL and verifiers are the true drivers] Treats SWE as verifiable sequential decision making: with tests/rules/static analyzers as verifiers, RL or preference optimization can push pass rates higher; pretraining only needs basic synt

contestedc-a671e52cec
把 SWE 能力近似为代码语言建模能力:只要代码 token 足够多、语言覆盖足够广,模型自然会学到修复与工程能力;额外的过程数据与环境信号属于锦上添花。
2 观测Swe Agent Pretraining
证据 (2)
  • topic_reportswe-agent-pretraining· report.positions[2].stancegpt-5.2

    [Camp C: just mix more code (code is all you need)] Approximates SWE capability as code language modeling: with enough code tokens and broad language coverage, the model will naturally learn repair and engineering; process data and environm

  • topic_reportswe-agent-pretraining· report.positions[2].stanceep-20260214160829-csjmc

    [Camp C: just mix more code (code is all you need)] Code capability follows the scaling law. As long as enough code tokens are trained, the model will naturally emerge software engineering capabilities, and there is no need to additionally

contestedc-62bb547ba4
把 SWE 能力视为“训练分布匹配”问题:先在 pretrain/mid-training 阶段把 repo-level 共现、patch insertion、commit/PR 过程、tests/CI/trace 信号做成高频训练事件;scaffolding 与 RL 负责把通过率做稳与做高。
2 观测Swe Agent Pretraining
证据 (2)
  • topic_reportswe-agent-pretraining· report.positions[3].stancegpt-5.2

    [Camp D: data shape first (repo/patch/process/execution first] Treats SWE capability as distribution matching: during pretrain/mid-training, make repo-level co-occurrence, patch insertion, commit/PR processes, and tests/CI/traces frequent t

  • topic_reportswe-agent-pretraining· report.positions[3].stanceep-20260214160829-csjmc

    [Camp D: data shape first (repo/patch/process/execution first] The distribution of SWE tasks should be matched during the pretrain/mid-training stage: repo-level co-occurrence, patch insertion, development process data, executable feedback,

contestedc-6f5fd17550
当真实高质量数据受限时,synthetic(textbook、合成练习、seed→evol)可以替代稀缺数据并推动能力增长;关键是把合成数据做得“更像可学习的课程”,而不是更像 web 噪声 [Textbooks2023][Phi15Report2023][DataConstrained2023][TinyStories2023]。
来源论文· 4[Textbooks2023][Phi15Report2023][DataConstrained2023][TinyStories2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].stancegpt-5.2

    [Camp A: synthetic-first can be a primary route (especially u] When high-quality real data is constrained, synthetic (textbooks, synthetic exercises, seed→evol) can substitute for scarce data and drive capability growth; the key is making s

contestedc-18497c253e
先用 web-heavy backbone 学通用能力,再用 mid-train 把分布拉向 code/math/reasoning/long-context;synthetic 的角色是 shaping 与补齐覆盖,而不是替换真实分布 [Llama3Herd][Phi3Report][Llemma2023][LongContextScaling2023]。
来源论文· 2[Llemma2023][LongContextScaling2023]
2 观测Synthetic Data Midtrain
证据 (2)
  • topic_reportsynthetic-data-midtrain· report.positions[1].stancegpt-5.2

    [Camp B: web-heavy backbone + (real/synthetic) mid-train is t] Train general capability with a web-heavy backbone, then use mid-train to pull toward code/math/reasoning/long-context; synthetic serves shaping and coverage completion rather t

  • topic_reportsynthetic-data-midtrain· report.positions[1].stancegpt-5.2

    [Camp B: web-heavy backbone + (real/synthetic) mid-train (a m] Learn broad coverage and tails with a web-heavy backbone, then use mid-train to pull toward target-domain distributions; synthetic is mainly for shaping and coverage fill, not r

contestedc-ab64f13666
synthetic 会引入 teacher 偏差与风格收缩风险;与其生成,不如把 web 数据做得更干净、更去重、更覆盖目标领域。RefinedWeb、SemDeDup、CCNet 等工作表明,强过滤本身就能带来可观收益 [RefinedWeb2023][SemDeDup2023][CCNet2019]。
来源论文· 3[RefinedWeb2023][SemDeDup2023][CCNet2019]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].stancegpt-5.2

    [Camp C: avoid synthetic as much as possible; stronger filter] Synthetic introduces teacher bias and style contraction; instead of generating, make web data cleaner, more deduplicated, and better domain-covered. RefinedWeb, SemDeDup, and CC

contestedc-588cd52729
只要生成质量足够高,synthetic 可以持续扩容并替代真实数据瓶颈;collapse 不是主要约束。
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].stancegpt-5.2

    [Camp D: synthetic scales almost without bound; collapse is m] If generation quality is high enough, synthetic can keep scaling and substitute for real-data bottlenecks; collapse is not a primary constraint.

contestedc-a8061f5916
主张把 tokenizer 当作一次性工程选择:只要 OOV 少、平均 token 长度合理,就不应频繁改动;训练与评估主要看 per-token PPL 与下游任务,系统侧优化交给 batch/并行/量化。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] Treat tokenization as a one-shot engineering choice: if OOV is low and average token length is reasonable, avoid changing it; focus on per-token PPL and downstream tasks, and le

contestedc-4d4f9aae83
认为更大 vocab 带来更短序列与更少的跨 token 依赖,质量与吞吐会一起变好;因此应把 256K+ 作为默认,并用更强的 embedding/softmax 工程优化抵消参数与带宽成本。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is always better; default to 256K+] Argues larger vocabs shorten sequences and reduce cross-token dependencies, improving both quality and throughput; therefore default to 256K+ and rely on embedding/softmax engineerin

contestedc-6474a46c77
主张直接在 byte/character 上建模,避免 BPE 的歧义编码、tail token 与数字/日期碎片化;系统成本用更强的模型与更长训练来换。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].stancegpt-5.2

    [Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Advocates byte/character modeling to avoid BPE ambiguity, tail tokens, and number/date fragmentation; accept higher systems cost via larger models and longer training.

contestedc-ad3cb945c6
认为更小词表能减少稀有 token、降低 tail token 风险,并让 RL/对齐阶段的 credit assignment 更稳定;宁可接受更长序列与更高推理成本。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: shrink/prune the vocabulary to buy alignment and RL ] Argues smaller vocabs reduce rare tokens and tail-token risk, and make RL/alignment credit assignment more stable; accept longer sequences and higher inference cost.

contestedc-5c0fcb7394
主要收益来自参数量、数据量与 compute;架构细节多为常数项,投入新 attention 变体的回报通常不如继续扩大训练预算。
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[0].stancegpt-5.2

    [Camp A: architecture is mostly done; keep scaling from scrat] Most gains come from parameters, data, and compute; architecture details are mostly constant factors, and investing in new attention variants usually returns less than increasin

contestedc-e1b5ba45a3
attention 的二次复杂度与 KV-cache 是结构性瓶颈;应换到 retention/SSM/递归结构,从根上改写并行与推理成本。
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].stancegpt-5.2

    [Camp B: the next backbone should move to SSM / RetNet / Mamb] Attention’s quadratic behavior and KV-cache are structural bottlenecks; switching to retention/SSM/recurrence rewrites parallelism and inference cost at the root.

contestedc-0d60ef9d27
已有稳定 base 时,扩深/扩块并 continued pretrain 能继承表示与优化状态,降低从头训练的重复成本;尤其适合 domain continued pretrain 与快速迭代。
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[2].stancegpt-5.2

    [Camp C: a second scaling path should be default—grow first, ] With a stable base, growing depth/blocks and continued pretraining can inherit representations and optimization state, reducing repeated from-scratch cost; especially suitable f

contestedc-dd81de9176
训练不稳主要靠调学习率、batch、optimizer 就能解决;额外 norm 组件带来复杂度与潜在分布偏移,不值得默认化。
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvements· report.positions[3].stancegpt-5.2

    [Camp D: QK-Norm / sandwich norm are optional details] Instability can be handled by learning rate, batch size, and optimizer tuning; extra norm components add complexity and potential distribution shifts, so they should not be defaulted.

  • topic_reporttransformer-arch-improvements· report.positions[3].stancegpt-5.2

    [Camp D: stability is mostly LR/optimizer/data; QK-Norm/sandw] Instability can mostly be handled via learning rate, warmup, initialization, clipping, and data cleaning; extra norm components add complexity and potential distribution shift,

contestedc-0e06feed14
只要把 RoPE 外推(PI/YaRN/相关缩放)做对,就能把短上下文模型扩成长上下文模型;额外长数据与 packing 只是锦上添花,成本高且收益不稳定(对应 claim c-2218c6a6ff、c-6a2e99f979、c-435bd5ac5f)。
2 观测Context Scaling Pretrain
证据 (2)
  • topic_reportcontext-scaling-pretrain· report.positions[0].stancegpt-5.2

    [Camp A: PE/extrapolation is enough; long context is mainly a] If RoPE extrapolation (PI/YaRN/related scaling) is done correctly, short-context models can be extended to long context; extra long-data training and packing are secondary, cost

  • topic_reportcontext-scaling-pretrain· report.positions[0].stancegpt-5.4

    [Camp A: PE / extrapolation is enough; long context is mainly] Representative work argues that if RoPE base, interpolation, scaling, or position bias is designed correctly, short-context models can extrapolate to long context; extra long-da

contestedc-b51a8309a9
在同样的数据池与 PE 下,相关文档聚类 packing、低截断打包、显式分隔 token 能把长上下文能力再抬一档;原因是它提高重复/对齐事件密度,从而更频繁触发 induction head 类电路(对应 claim c-8202803d9b、c-2d04dd042e、c-7ff7c79275、c-b277f220e9、c-8376f2d76a)。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: packing/concatenation is underused; sequence constru] With the same data pool and PE, related-doc clustered packing, low-truncation packing, and explicit separators can further lift long-context ability; the mechanism is increasing

contestedc-d1a2ac180d
只要没有明显 OOV/乱码,tokenizer 影响主要体现在 I/O 与 token count;主训练周期应把预算放在参数规模、数据质量与训练配方上,tokenizer 变更不值得成为主要变量。
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] As long as there is no obvious OOV/garbling, tokenization mostly affects I/O and token count; the main training budget should go to model scale, data quality, and recipe, and to

  • topic_reporttokenizer-scaling· report.positions[0].stancegpt-5.2

    [Camp A: tokenizer is frozen preprocessing; coverage is enoug] Spend budget on parameters/data/training recipe; as long as there is no OOV/garbling, tokenization should not be a primary variable in the main training cycle. In practice, team

contestedc-9a4fe206bb
更大 vocab 带来更高压缩率与更短序列,attention 更省;embedding/softmax 的额外成本相对可忽略,因此应继续从 128K 扩到 256K 甚至更大,把压缩率当作主要目标。
3 观测↳ spawned: vocab-scaling-128k-to-256k-tradeoffsTokenizer Scaling
证据 (3)
  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is near-monotonic; default to 256K+] Larger vocabs increase compression and shorten sequences, saving attention; embedding/softmax overhead is relatively small, so one should keep expanding from 128K to 256K+ and treat

  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is near-monotonic; default to 256K+] Larger vocab shortens sequences, saving attention and making long context cheaper; embedding/softmax overhead is relatively small, so the default should move from 128K to 256K+.

  • topic_reporttokenizer-scaling· report.positions[1].stancegpt-5.2

    [Camp B: bigger vocab is near-monotonic; default to 256K+] Treat larger vocab as “cheap compression”: shorter sequences reduce attention cost and make long-context cheaper; embedding/softmax overhead is relatively small. It is especially be

contestedc-c6b18c0a19
pretraining 可能偏好大词表,但 post-training(RLHF/DPO)更在意策略一致性与可控性;应主动剪尾、拆分 heavy-domain 长合并,减少 under-trained tokens 与 non-unique encoding,宁可让序列变长也要换稳定。
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: shrink/prune the vocabulary to buy alignment and RL ] Pretraining may prefer large vocabs, but post-training (RLHF/DPO) cares more about policy consistency and controllability; proactively prune tails and split heavy-domain long me

contestedc-923c649011
把 TP/EP/PP/DP(FSDP)/CP 当作通信原语的分层放置问题:高频(TP/EP)绑定 NVLink,PP 跨 IB,DP/FSDP 跨 pod;长上下文下 CP 进入 mesh,并用 schedule(zero-bubble)与 kernel 对齐把 MFU 推到可预期区间。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].stancegpt-5.2

    [Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Treat TP/EP/PP/DP(FSDP)/CP as hierarchical placement of communication primitives: high-frequency (TP/EP) pinned to NVLink, PP over IB, DP/FSDP across pods; under long context CP

contestedc-e73d6190e0
用高层 attention 编程模型与 IR(如 MLIR)生成优化 kernel,减少对少数 CUDA 专家的依赖;算法团队聚焦模型与数据,性能由编译器与 autotuning 兜底。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].stancegpt-5.2

    [Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Use high-level attention programming models and IRs (e.g., MLIR) to generate optimized kernels, reducing dependence on a small set of CUDA experts; algorithm teams focus on mode

contestedc-d5809c287c
每代硬件带宽与算力提升会自然吞掉大部分瓶颈;维持标准 transformer 原语与 BF16 训练更省工程成本,结构复杂化与 kernel 深度优化收益不稳定。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].stancegpt-5.2

    [Camp C: hardware will get faster; dense MHA + BF16 need not ] Per-generation bandwidth and FLOPs gains will naturally absorb most bottlenecks; keeping standard transformer primitives and BF16 training reduces engineering cost, while struct

contestedc-f06977ec5c
硬件多元化(TPU/ROCm/自研加速器)会让可移植算子表达与编译器路线更重要;团队不应把算法绑定在 CUDA 特性上,避免被单一生态锁定。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].stancegpt-5.2

    [Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Hardware diversification (TPU/ROCm/custom accelerators) makes portable operator expressions and compiler routes more important; teams should avoid binding algorithms to CUDA-spe

contestedc-d9ff2ac7cc
长期成本来自变体爆炸与跨硬件部署,手写 CUDA 的维护窗口越来越短。FlexAttention [Dong2024Flex] 把变体表达上移到 score_mod/mask_mod,并通过 torch.compile 生成 fused kernel;更一般的编译路线(TVM/Ansor/Mirage)表明高性能不必依赖手写内核 [Chen2018TVM][Zheng2020Ansor][Wu2024Mirage]。
来源论文· 5[Dong2024Flex][Chen2018TVM][Zheng2020Ansor][Wu2024Mirage]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.positions[1].stancegpt-5.2

    [Camp B: the mainline should move from hand-written CUDA to T] Long-run cost comes from variant explosion and cross-hardware deployment; the maintenance window of hand-written CUDA keeps shrinking. FlexAttention [Dong2024Flex] lifts variant

contestedc-c216ece221
dense attention 的 O(L^2) 复杂度与 KV cache 成本会随上下文增长持续恶化,因此应优先采用线性/稀疏/递归结构(如 Longformer [Beltagy2020Longformer]、Reformer [Kitaev2020Reformer]、gated linear attention [Yang2023GatedLinearAttention]),从根源降低复杂度,而不是继续在 kernel 上挤边际。
来源论文· 4[Beltagy2020Longformer][Kitaev2020Reformer][Yang2023GatedLinearAttention]arXiv 2004.05150arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2004.05150· report.positions[2].stancegpt-5.2

    [Camp C: attention should be replaced by SSM/linear/sparse st] Dense attention’s O(L^2) complexity and KV-cache cost worsen with context length, so linear/sparse/recurrent structures (e.g., Longformer [Beltagy2020Longformer], Reformer [Kita

contestedc-aed25d8639
domain weight 是一阶优化变量,应该像学习率一样系统搜索:用 proxy 模型群拟合响应面(RegMix)或用 mixing law 外推(BiMix),并用 worst-case/Group DRO 类目标把“别让某些桶塌陷”写进优化问题[RegMix2024][BiMix2024][Michel2021RobustMTL]。
来源论文· 3[RegMix2024][BiMix2024][Michel2021RobustMTL]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].stancegpt-5.2

    [Camp A: Formal search first (regression/laws/robust optimiza] Domain weights are first-order variables and should be searched systematically like learning rates: fit response surfaces on proxy ladders (RegMix) or extrapolate via mixing law

contestedc-ce3dfaf994
真实工程里最稀缺的是稳定记账与迭代速度,不是算法新颖性。公开配方(LLaMA、Gopher、Llama 3、OLMo)提供了强先验:先 web-heavy,再在尾段对 code/math/reasoning 上采样;用少量对照实验校准 stage 边界与倍率,比追一个全程固定 w* 更可控[Touvron2023LLaMA][Rae2021Gopher][MetaLlama32024][OLMo2024]。
来源论文· 4[Touvron2023LLaMA][Rae2021Gopher][MetaLlama32024][OLMo2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].stancegpt-5.2

    [Camp B: Heuristics + curricula are more robust; 2–3 ablation] In real engineering, the scarce resource is stable accounting and iteration speed, not algorithm novelty. Public recipes (LLaMA, Gopher, Llama 3, OLMo) provide strong priors: we

contestedc-0cba121c70
mixture 是非平稳问题:训练阶段变化会改变各桶的边际收益,因此应当边训边调。DoReMi 用 Group DRO 学权重,目标是提升最差域并加速收敛;Irreducible Curriculum 进一步把选择粒度下沉到样本层,试图用不可约损失信号组织 curriculum[DoReMi2023][IrreducibleCurriculum2023]。
来源论文· 2[DoReMi2023][IrreducibleCurriculum2023]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].stancegpt-5.2

    [Camp C: Online/adaptive mixing beats one-shot offline ratios] Mixtures are non-stationary: training phases change marginal returns per bucket, so weights should adapt online. DoReMi learns weights via Group DRO to improve worst domains and

contestedc-52e263095c
在固定训练 FLOPs 下,MoE 通过更大的 total 参数提供更高知识容量;随着 fine-grained+shared 与 aux-loss-free 均衡模板成熟,训练事故率下降到可接受水平,因此从零预训练更倾向直接上 MoE。
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].stancegpt-5.2

    [Camp A: MoE becomes the default backbone for frontier pretra] Under fixed training FLOPs, MoE expands total parameters to increase knowledge capacity; with mature fine-grained+shared and aux-loss-free balancing templates, failure rates bec

contestedc-e67c4d7d02
真实组织更常见的是复用既有 dense 权重与后训练资产;upcycling scaling law 显示收益上限与 token-rich 依赖,且系统开销与稳定期成本会侵蚀稀疏收益,因此继续扩容 dense 更稳。
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[1].stancegpt-5.2

    [Camp B: dense wins on full-lifecycle ROI; upcycling makes Mo] Real organizations more often reuse existing dense weights and post-training assets; upcycling scaling laws show ceilings and token-rich dependence, and systems overhead plus st

contestedc-60927a281a
长文能力的上界由 RoPE 相位覆盖决定,能否用起来由训练分布决定;因此应在 pretrain/continual pretrain 阶段把 base 直接对齐目标窗口,并用 short-to-long 把长序列 token 分阶段混入,避免推理期外推或少量微调带来的不可逆损伤。[Dubey2024Llama3][Xu2024RoPEBaseBounds][Fu2024DataEngineering128K] 这条路线的验收应是跨长度曲线:每个阶段都要看到有效上下文随长度档位
来源论文· 3[Dubey2024Llama3][Xu2024RoPEBaseBounds][Fu2024DataEngineering128K]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].stancegpt-5.2

    [Camp A: native long context (base-aligned + curriculum/ABF) ] The ceiling is set by RoPE phase coverage, and usability is set by training distribution; therefore base should be aligned to the target window during pretraining/continual pret

contestedc-c713af57c7
长序列 attention 的 O(n^2) 成本与 RoPE 外推风险让训练与部署都难控;更合理的是用线性/稀疏/递归/记忆/压缩把历史搬出主上下文,实现 train-short test-long 或近似“无限上下文”。[Gu2023Mamba][Bulatov2023RMT][Mohtashami2023Landmark][Jiang2023LongLLMLingua] 在极端长度检索上,recurrent memory 甚至能在 11M-token haystack
来源论文· 4[Gu2023Mamba][Bulatov2023RMT][Mohtashami2023Landmark][Jiang2023LongLLMLingua]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].stancegpt-5.2

    [Camp D: bypass RoPE/attention (SSM/external memory/retrieval] Quadratic attention cost and RoPE extrapolation risk make training and deployment hard to control; a more reasonable approach is to use linear/sparse/recurrent/memory/compressio

contestedc-57fec2a5b5
先用 PI/YaRN 把窗口扩到 128K,再用 LongRoPE 把 1M+ 做稳;只要外推稳定,模型自然会学会用长上下文,评估与数据配方属于优化细节(对应 ledger c-8e5bfb202f、c-af4dd91c2d)。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].stancegpt-5.2

    [Camp A: positional extrapolation is the main line; evaluatio] Extend to 128K with PI/YaRN, then stabilize 1M+ with LongRoPE; once extrapolation is stable, the model will naturally learn to use long context, while evaluation and data recipe

contestedc-cb0c0b466f
FlashAttention 类 kernel 提升单卡效率,序列并行把 L 维切开;把 DP/TP/PP/SP 分别压到极致即可,剩下只是资源投入(对应 ledger c-65f78fd3e6、c-fecb27fa82、c-5ef38c5699)。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].stancegpt-5.2

    [Camp B: long-sequence training is mainly a systems-paralleli] FlashAttention-like kernels improve per-GPU efficiency and sequence parallelism shards the length dimension; push DP/TP/PP/SP to their limits and the rest is mostly resource sca

contestedc-7e75950752
先把网络族与尺度闭包做对:用 Complete-P 的模块级规则修补现代组件,把 coord check / RMS 叠合作为硬验收;在此基础上,base LR 与 init scale 才有资格谈“可迁移”。经验公式只能在 recipe 不变时给近似起点,不能替代正确 parameterization。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].stancegpt-5.2

    [Camp A: Complete-P as the default; formulas are a stopgap] Get the model family and scaling closure right first: use Complete-P’s module-wise rules to patch modern components and enforce coord check / RMS overlap as a hard acceptance gate;

contestedc-85989d336b
与其维护迁移规则,不如把目标函数交给自动化:warm-start BO 迁移历史试验分布,CARBS 类方法在 cost×loss 帕累托前沿上直接搜索;甚至可以用 AGD 或 symbolic discovery 学出“免超参”的更新规则,从根上减少迁移需求。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].stancegpt-5.2

    [Camp C: end-to-end automation (BO/auto-optimizers) will repl] Instead of maintaining transfer rules, hand the objective to automation: warm-start BO transfers prior trial distributions, CARBS-like methods search directly on the cost×loss P

contestedc-207d2c8eee
很多“LR 不可迁移”现象是归因错误:在 AdamW 下 wd 是独立轴,β₂ 与 batch/噪声耦合会移动稳定边界与最优点;因此应把 wd/β₂ 显式建模或显式搜索,而不是把精力集中在 LR 的单标量迁移上。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].stancegpt-5.2

    [Camp D: transfer error is dominated by non-transferable HPs,] Many “LR does not transfer” observations are misattributions: under AdamW, wd is an independent axis, and β₂ couples with batch/noise to shift stability boundaries and optima; w

contestedc-34c53a36f2
近二阶收益主要集中在 hidden 的 2D 权重;用混合路由把收益与风险都局部化:hidden 用 Muon,其余用 AdamW。优先目标是 wall-clock 与更少调参回合,而不是追求全参统一。
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].stancegpt-5.2

    [Camp B: Muon is the next default (but only as a hybrid)] Near-second-order benefits concentrate in hidden 2D weights; hybrid routing localizes both gains and risks: Muon on hidden weights, AdamW elsewhere. The primary goal is wall-clock an

contestedc-d9d03f011c
重复主要是无效 token 与风险放大器;对 web 池宁可删多也不留,把重复率压到最低,再谈 epochs 与配方。语义近重复与主题簇内冗余同样吞预算,应尽早用 embedding 去重/多样化把 token 换成覆盖率。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].stancegpt-5.2

    [Camp A: Dedup-first and aggressive (exact → near → semantic)] Repetition is mostly wasted tokens and a risk amplifier; for web pools, delete aggressively first, then discuss epochs/recipes. Semantic near-duplicates and within-topic redunda

contestedc-8863ff37d3
当独特高质量 token 不够时,均匀多 epoch 是最直接的 compute 利用方式;前几轮收益接近新增同量新鲜 token,默认把 epochs 拉到约 2–4,再考虑扩数据或更复杂的数据工程。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[1].stancegpt-5.2

    [Camp B: Epochs-first (fill compute under data constraints)] When unique high-quality tokens are insufficient, uniform multi-epoch training is the most direct way to utilize compute; the first few passes yield gains close to adding fresh to

contestedc-211afac5db
benchmark、PII、版权内容不应进入主训练池;默认 0–1 次曝光,并把治理做成可审计配置(provenance、opt-out、过滤证据)。理由是污染会扭曲评测,记忆化会带来可恢复泄漏风险,且风险随规模与更长 context 上升。
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].stancegpt-5.2

    [Camp D: Zero exposure (or 0–1 exposure) for risky data overr] Benchmarks, PII, and copyrighted content should not enter the main pretraining pool; default to 0–1 exposure with auditable controls (provenance, opt-out, filtering evidence). R

contestedc-860750308f
把验证 loss/PPL 视为最可靠的可缩放信号:它可外推 compute 预算、指导参数:token 配比,并在多数任务上与能力提升同向,因此也可用于模型选择的主排序键。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].stancegpt-5.2

    [Camp A: PPL remains the primary variable (at least for train] Treats validation loss/PPL as the most reliable scalable signal: it extrapolates compute budgets, guides parameter:token tradeoffs, and tends to move with capability on many tas

contestedc-ea0ddfebf3
把“是否继续训练/是否加算力”的问题直接写成逐任务外推:对每个关键任务拟合缩放律,决策基于边际收益与外推误差,而不是基于 upstream loss 的代理关系。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].stancegpt-5.2

    [Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Writes keep-training/scale-up decisions as per-task extrapolation: fit scaling laws for each key task and decide based on marginal gains and extrapolation error rather than upst

contestedc-ac6cdd29d2
把模型质量拆成可审计维度(能力、鲁棒、安全、公平、效率、场景适配),用标准化面板与报告格式减少“只看一个数”的误导;PPL 只作为训练监控信号保留。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].stancegpt-5.2

    [Camp C: stop searching for a scalar; define quality via stan] Decomposes model quality into auditable dimensions (capability, robustness, safety, fairness, efficiency, scenario fit) and uses standardized panels/reporting to reduce single-n

contestedc-90d307f237
把“有用的助手”视为偏好与约束满足问题:helpfulness/harmlessness/指令遵循来自偏好优化与监督信号选择,而不是来自更低的 next-token loss;因此 PPL 与最终质量相关性天然弱。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].stancegpt-5.2

    [Camp D: the issue is ontological—next-token loss is not the ] Treats a “useful assistant” as preference and constraint satisfaction: helpfulness/harmlessness/instruction-following come from preference optimization and supervision selection

contestedc-5e81c9a0d9
联合幂律足够稳定,可用少量小模型 sweep 外推到更大规模;在固定算力下应优先增大 N,tokens 只要“够用”。这条路线在早期大模型(如 GPT-3 时代)被广泛采用,工程上也更符合“先把模型做大再补数据”的组织惯性 [Kaplan2020ScalingLaws][Rae2021Gopher]。
来源论文· 2[Kaplan2020ScalingLaws][Rae2021Gopher]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].stancegpt-5.2

    [Camp A: Kaplan-style — portable exponents; under fixed compu] Joint power laws are stable enough to extrapolate from small sweeps to larger scales; under fixed compute, prioritize increasing N while tokens only need to be “sufficient.” Thi

contestedc-406cbcc3ea
在固定算力下,较小模型配更多 tokens 更划算;tokens/param≈20 是可复用的经验中心,能避免“训练不够久”的系统性浪费。开源侧 LLaMA/Llama 2 的高 token 训练也被视作工程复现 [Hoffmann2022Chinchilla][Touvron2023LLaMA][Touvron2023Llama2]。
来源论文· 3[Hoffmann2022Chinchilla][Touvron2023LLaMA][Touvron2023Llama2]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[1].stancegpt-5.2

    [Camp B: Chinchilla-style — tokens/param≈20 as the default re] Under fixed compute, smaller models with more tokens are more efficient; tokens/param≈20 is a reusable empirical center that avoids systematic waste from undertraining. Open evi

contestedc-6259195fcb
许多看似突然出现的能力来自 0-1 指标阈值化、benchmark 粒度与评测噪声;用连续指标(task perplexity、log-prob)与更稳健的评测协议后,曲线更接近平滑提升。over-training 的任务分数抖动与阈值稳定性也支持把“突现”优先当作可测的评测现象 [Gadre2024OverTraining][Bhagia2024TaskLadders]。
来源论文· 2[Gadre2024OverTraining][Bhagia2024TaskLadders]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].stancegpt-5.2

    [Camp D: Against emergence-as-magic — many “emergent” effects] Many apparent capability jumps come from thresholded 0-1 metrics, benchmark granularity, and evaluation noise; with continuous metrics (task perplexity, log-prob) and more robus

contestedc-ca981e3e26
主张打包采用per-doc causal mask配合FA varlen接口,长度训练采用short-to-long课程,仅code模型默认启用FIM,跨文档拼接作为显式构造,在保证训练正确性的前提下最大化效率。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[0].stanceep-20260214160829-csjmc

    [Camp A: per-doc masking + short-to-long (default engineering] Advocates using per-doc causal mask with FA varlen interface for packing, short-to-long curriculum for length training, enabling FIM by default only for code models, and using c

contestedc-c0dd279d1f
主张预训练全程按分布混合长短序列,避免末段mid-train的分布漂移,让模型更早接触长上下文,提升长度鲁棒性与长上下文性能。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].stanceep-20260214160829-csjmc

    [Camp B: always-mixed lengths (anti-curriculum)] Advocates mixing short and long sequences according to distribution throughout pretraining, avoiding distribution shift in final-stage mid-train, letting models be exposed to long context ear

contestedc-32d3b0b1c8
主张预训练默认让模型可attend到同pack内的其他文档,对齐推理时的prompt拼接、检索增强等跨文档场景,提升in-context learning与长文生成能力。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[2].stanceep-20260214160829-csjmc

    [Camp C: cross-doc visibility by default (beyond-boundaries c] Advocates allowing models to attend to other documents in the same pack by default during pretraining, aligning with cross-document inference scenarios such as prompt concatenat

contestedc-9eaeac0a4b
主张FIM/span corruption等混合目标对左到右建模的代价极低,可覆盖补全、编辑、指令跟随等更多任务,应当对NL和code都默认启用,提升跨任务泛化能力。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[3].stanceep-20260214160829-csjmc

    [Camp D: FIM/denoising-style objectives as default (infilling] Advocates that mixed objectives such as FIM/span corruption have very low cost for left-to-right modeling, can cover more tasks such as infilling, editing, and instruction follo

contestedc-c497057f3d
对齐阶段(RLHF/DPO/PPO)更怕 tail token 与稀有 token 的数值不稳定、训练-推理不一致与攻击面;宁可牺牲一点压缩率,也要减少尾部与重 domain 合并,换更稳的训练与解码。
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: shrink/prune the vocabulary to buy alignment and RL ] Alignment stages (RLHF/DPO/PPO) are sensitive to tail tokens, numerical instability, train–inference mismatch, and attack surface; sacrificing some compression to reduce tail an

  • topic_reporttokenizer-scaling· report.positions[3].stancegpt-5.2

    [Camp E: shrink/prune the vocabulary to buy alignment and RL ] Alignment stages (RLHF/DPO/PPO) care about numerical stability and policy consistency; low-frequency tail tokens and non-unique encodings can amplify train/infer mismatch and at

contestedc-677e93768c
函数题 pass@k 与代码能力高度相关,适合作为主指标;更复杂的基准会引入噪声与评测开销,反而降低迭代效率。[Chen2021Codex][Austin2021MBPP][CodeGen2022]
来源论文· 3[Chen2021Codex][Austin2021MBPP][CodeGen2022]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[0].stancegpt-5.2

    [Camp A: HumanEval/MBPP is sufficient—cheap, stable, reproduc] Function-synthesis pass@k is strongly correlated with coding ability and should be the primary metric; more complex benchmarks add noise and evaluation cost, reducing iteration

contestedc-9658fa7f4c
likelihood 指标成本低、可在大规模 sweep 上稳定比较;真实 SWE 是补丁与编辑轨迹的分布,patch-PPL 下降意味着更贴近真实修复分布,因此应作为主要优化与评估信号(对应 c-23dce50372 / c-867a4775a4 / c-b6d1926f11)。
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].stancegpt-5.2

    [Camp C: pretrain BPB/patch-PPL best predicts SWE; downstream] Likelihood metrics are cheap and stable for large sweeps; real SWE is a distribution of patches and edit trajectories, so lower patch-PPL implies closer-to-real repair distribut

contestedc-76064cf979
用户感受到的是交互成本与稳定性:重试收益、测试执行成功率、token 消耗、轨迹可读可控;因此应以 tokens-per-issue、retry@k、test-exec rate 等为主指标,pass@1/Verified 只是次要参考(对应 c-c980753890 / c-88b722216c / c-67c1665a55)。[SelfDebug2023][ReCode2022]
来源论文· 2[SelfDebug2023][ReCode2022]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].stancegpt-5.2

    [Camp D: deployment UX/cost metrics reflect value better than] Users feel interaction cost and stability: retry gains, test execution success, token spend, and trajectory readability/controllability; therefore tokens-per-issue, retry@k, and

contestedc-075feb1744
高质量 synthetic(教材化、练习、解释)单位 token 信息密度更高,尤其在小模型与结构化领域(code/math)可以部分替代大规模粗糙 web;mid-train 不是必要条件,关键是把 token 变“更可学”。[Textbooks2023][Phi3Report2024][MetaMath2023]
来源论文· 3[Textbooks2023][Phi3Report2024][MetaMath2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].stancegpt-5.2

    [Camp A: synthetic-first (a primary route under data constrai] High-quality synthetic (textbooks, exercises, explanations) has higher information density per token; for small models and structured domains (code/math) it can partially substi

contestedc-1cb57a52de
synthetic 引入 teacher 偏差与风格收缩,且可能触发递归退化;与其生成,不如扩大真实抓取、去重、强过滤与数据剪枝,在真实候选池内得到更干净的分布。[RefinedWeb2023][WhenLessIsMore2023][DataCompLM2024]
来源论文· 3[RefinedWeb2023][WhenLessIsMore2023][DataCompLM2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].stancegpt-5.2

    [Camp C: minimize synthetic; stronger filtering + more real d] Synthetic introduces teacher bias and style narrowing and may trigger recursive degradation; instead of generating, expand real crawling, dedup, strong filtering, and pruning to

contestedc-09a2d0f238
只要 teacher 足够强、采样与反馈足够好,synthetic ratio 可以持续上升;再结合“accumulate real 可破除 collapse”,就可以把真实数据压到很小的 seed。[SelfImprove2022][CollapseInevitable2024]
来源论文· 2[SelfImprove2022][CollapseInevitable2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].stancegpt-5.2

    [Camp D: synthetic scales almost without bound; collapse is m] With a strong enough teacher and good sampling/feedback, synthetic ratios can keep rising; combined with “accumulate real breaks collapse,” real data can be reduced to a small s

contestedc-b833385ae5
attention 的二次复杂度与 KV-cache 是结构性瓶颈;即便做 GQA、SWA 或 cache 压缩,也是在补丁式优化。更合理的路线是转向 retention/SSM/递归结构,以常数状态或线性状态替代随上下文增长的 KV-cache [Tiezzi2024RecurrentSurvey][LongSSM2024]。
来源论文· 2[Tiezzi2024RecurrentSurvey][LongSSM2024]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].stancegpt-5.2

    [Camp B: Transformer state cost is near its limit; move to re] Attention’s quadratic behavior and KV-cache are structural bottlenecks; even with GQA, SWA, or cache compression, this is patching. A more coherent route is to move to retention

contestedc-f02eeb46cc
已有 pretrained base 是资产:扩深/插块/稀疏 upcycling 往往能继承表示与部分优化状态,用更少 token/compute 达到接近目标规模的效果,尤其适合 domain continued pretrain 与快速迭代 [Kim2023Solar][LLaMAPro2024][Wang2023Grow][SparseUpcycling2022][Net2Net2015]。
来源论文· 5[Kim2023Solar][LLaMAPro2024][Wang2023Grow][SparseUpcycling2022][Net2Net2015]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[2].stancegpt-5.2

    [Camp C: default a second scaling path—grow first, then decid] Pretrained bases are assets: deepening/inserting blocks/sparse upcycling can inherit representations and parts of optimization state, reaching near-target scale with fewer token

contestedc-0247de4c9d
只要设计足够复杂的多智能体框架、任务拆解、工具调用流程,就可以在不修改基座的前提下解决大部分SWE问题,不需要调整预训练策略。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[0].stanceep-20260214160829-csjmc

    [Camp A: Inference-time scaffolding and test-time compute are] As long as sufficiently complex multi-agent frameworks, task decomposition, and tool calling workflows are designed, most SWE problems can be solved without modifying the base m

contestedc-4a577e8cc8
只要有足够多的测试、静态分析、执行反馈作为奖励信号,RL或偏好优化就能把模型的SWE能力推到很高水平,预训练只需要提供基础语法能力即可。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[1].stanceep-20260214160829-csjmc

    [Camp B: RL and verifiers are the true drivers] As long as there are enough tests, static analysis, and execution feedback as reward signals, RL or preference optimization can push the model's SWE capability to a very high level, and pretra

contestedc-fe063eaa7c
这一阵营会说,Transformer 的 attention 与 KV cache 形态决定了长程读写预算分配,effective context 很难随窗口线性增长;因此应转向 sparse/recurrent/memory/SSM,或直接用 retrieval/native memory 系统 [Dai2019TransformerXL][Beltagy2020Longformer][Rae2019CompressiveTransformer][Mohtashami202
来源论文· 2[Dai2019TransformerXL][Beltagy2020Longformer]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].stancegpt-5.4

    [Camp D: changing the architecture or system boundary is more] This camp argues that the attention and KV-cache form of Transformers determines the long-range read/write budget, so effective context is unlikely to scale linearly with window

contestedc-1390e4ecfa
标准 NTP 的幂律缩放足够稳健:扩大模型/数据/compute,并用过滤、去重、数据配比提升平均质量,就能在大多数任务上持续变好;跨 doc 的显式上下文恢复属于锦上添花,成本高且难以证明等 compute 优势。
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[0].stancegpt-5.2

    [Camp A: keep Classic NTP + scale; hallucination is mostly so] Power-law scaling under standard NTP is robust: scale model/data/compute and improve average data quality via filtering/dedup/mixtures; explicit cross-doc context recovery is op

contestedc-041f845e9d
幻觉是缺失变量 Z 被边缘化后的 prior 填空;解决方式是把 Z 通过链接/检索/工具反馈写回上下文 C,并用 loss mask 把“证据”从监督目标里剥离,让模型学会在证据条件下生成而不是背诵证据。
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[1].stancegpt-5.2

    [Camp B: HDP / retrieval-aware pretraining—make retrieval and] Hallucination is prior-filling after missing variables Z are marginalized; fix it by writing Z back into context C via links/retrieval/tools and masking evidence out of the supe

contestedc-2d44567a6e
很多所谓缺失 Z 的问题,本质是语料噪声与结构不一致导致的可学性差;先用重写把文本变得规范、再用 back-translation 反推 prompt/plan,把隐含任务结构显式化,能用更少 token 买到更高的 learnability-per-token。
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[2].stancegpt-5.2

    [Camp C: Method-2 rewriting / reverse prompt-plan—structured ] Many “missing Z” issues are really low learnability due to noisy, inconsistent corpora; normalize via rewriting and recover prompts/plans via back-translation so latent task str

contestedc-1d71524b0d
收集更强模型的 CoT、自反思与自我修正轨迹,能把 reasoning 与 agent 行为蒸馏到更小/更便宜的模型;相比改预训练语料与检索系统,这条路更快、更贴近产品迭代。
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[3].stancegpt-5.2

    [Camp D: trajectory distillation / self-reflection first—CoT ] Collect CoT/self-reflection/self-refinement traces from stronger models to distill reasoning and agent behavior into smaller/cheaper models; faster and closer to product iterati

contestedc-fb00153bfa
仅通过位置编码优化、内存优化等工程手段扩展上下文窗口即可满足长上下文需求,结构化增强会增加数据处理复杂度,收益不明显。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4d· report.positions[0].stanceep-20260214160829-csjmc

    [Camp A: Long context only requires engineering length scalin] Extending context window only through engineering means such as position encoding optimization and memory optimization can meet long-context requirements, structured augmentatio

contestedc-67108457b0
不同领域的Hyper-Doc构造方法差异极大,强行统一会限制方法创新,各领域独立探索的效率更高。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4d· report.positions[1].stanceep-20260214160829-csjmc

    [Camp B: Hyper-Doc pretraining is a collection of scattered m] Hyper-Doc construction methods vary greatly across different domains, forced unification will limit method innovation, independent exploration per domain is more efficient.

contestedc-52ca26e84f
推理时RAG可灵活注入最新上下文,不需要修改预训练模型,成本更低灵活性更高。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4d· report.positions[2].stanceep-20260214160829-csjmc

    [Camp C: Inference-time RAG can completely replace training-t] Inference-time RAG can flexibly inject latest context without modifying pretrained models, with lower cost and higher flexibility.

contestedc-157459daa6
这一读法以 Wei et al. [Wei2022Emergent] 为代表:许多任务在小模型区间几乎贴地,到某个规模后突然抬升,因此主训前外推价值有限。
来源论文· 1[Wei2022Emergent]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[0].stancegpt-5.4

    [Camp A: Downstream abilities are fundamentally threshold-eme] This reading, represented by Wei et al. [Wei2022Emergent], treats many tasks as near-floor at small scale and abruptly rising only after a threshold, making pre-main-run extrapo

contestedc-3336a2f01c
这一读法认为只要小规模实验设计得当,compute→任务分数的外推已经能支持大多数预算决策,额外引入 loss 只会增加评估复杂度 [Gadre2024ReliableScaling]。
来源论文· 1[Gadre2024ReliableScaling]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[1].stancegpt-5.4

    [Camp B: The compute axis is already sufficient; there is no ] This reading argues that with well-designed small-scale experiments, compute-to-task extrapolation already supports most budget decisions, and adding loss only increases evaluat

contestedc-876840f99b
Ruan et al. [Ruan2024Observational] 代表的路线主张:公开模型已经覆盖了丰富的能力空间,用低维 manifold 回归即可低成本预测新模型表现,尤其适合快速筛选候选方案。
来源论文· 1[Ruan2024Observational]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[2].stancegpt-5.4

    [Camp C: Observational scaling over public models is sufficie] The route represented by Ruan et al. [Ruan2024Observational] argues that public models already span a rich capability space, so low-dimensional manifold regression can predict n

contestedc-35cd958c72
这一读法强调单一幂律样本效率高、参数少、沟通简单,因此应作为默认模型;分段或 broken 拟合容易把噪声误判成结构 [Gadre2024ReliableScaling]。
来源论文· 1[Gadre2024ReliableScaling]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[3].stancegpt-5.4

    [Camp D: A single power law covers most tasks; broken laws ar] This reading emphasizes that a single power law is sample-efficient, low-parameter, and easy to communicate, so it should remain the default model; piecewise or broken fits risk

consensusdatac-3eb65c900d
在线数据混合正在成为静态语料配方的实用替代:其主张预训练 mixture 可以在训练过程中高效自适应,而不是一开始就固定。
1 观测Data MixingPretrainingEfficiency
证据 (1)
contestedc-6559657b09
支持者会指出,MLA 在长上下文下把 KV cache 压到传统 attention 的一个小分数,V2 [DeepSeekAI2024V2] 与 V3 [DeepSeekAI2024V3] 都把它作为主干,说明它不只是实验室玩具,而是可量产默认值。
来源论文· 2[DeepSeekAI2024V2][DeepSeekAI2024V3]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[0].stancegpt-5.4

    [Camp A: MLA will become a general replacement for GQA] Supporters point out that MLA shrinks KV cache to a small fraction of conventional attention under long context, and both V2 [DeepSeekAI2024V2] and V3 [DeepSeekAI2024V3] use it as the

contestedc-9e185612ba
支持者会强调,DeepSeekMoE [Dai2024DeepSeekMoE]、V2 [DeepSeekAI2024V2]、V3 [DeepSeekAI2024V3] 连续三代都沿着更细专家与少量 shared expert 演进,说明这条路在大规模训练里是可持续的。
来源论文· 3[Dai2024DeepSeekMoE][DeepSeekAI2024V2][DeepSeekAI2024V3]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[1].stancegpt-5.4

    [Camp B: many-expert plus shared experts is the stable endgam] Supporters emphasize that DeepSeekMoE [Dai2024DeepSeekMoE], V2 [DeepSeekAI2024V2], and V3 [DeepSeekAI2024V3] all evolve toward finer experts with a small shared path, suggesting

contestedc-683c650b1b
支持者会说,DeepSeek LLM [DeepSeekLLM2024]、DeepSeek-Coder [DeepSeekCoder2024]、V2 [DeepSeekAI2024V2]、V3 [DeepSeekAI2024V3] 的连续提升,说明手工设计的 mixture、代码/数学上采样和任务导向数据组织仍然是主导杠杆。
来源论文· 4[DeepSeekLLM2024][DeepSeekCoder2024][DeepSeekAI2024V2][DeepSeekAI2024V3]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[2].stancegpt-5.4

    [Camp C: data quality mainly comes from curated mixtures, not] Supporters argue that the steady gains from DeepSeek LLM [DeepSeekLLM2024], DeepSeek-Coder [DeepSeekCoder2024], V2 [DeepSeekAI2024V2], and V3 [DeepSeekAI2024V3] show that hand-d

contestedc-162be38d22
支持者会拿 DeepSeekMath [DeepSeekMath2024] 与 R1 [DeepSeekR12025] 作证据:GRPO 去掉 critic 后,纯 RL 或 RL-first 路线可以在数学与通用推理上直接拉起能力,再通过蒸馏回流到 dense 学生模型。
来源论文· 2[DeepSeekMath2024][DeepSeekR12025]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[3].stancegpt-5.4

    [Camp D: the main path for reasoning has shifted from SFT/RLH] Supporters use DeepSeekMath [DeepSeekMath2024] and R1 [DeepSeekR12025] as evidence: after removing the critic with GRPO, pure RL or RL-first pipelines can directly induce math a

consensusinferencec-8c91268b97
attention sink 已重要到需要专门研究其何时出现,说明它更像结构性特征而非偶然伪影。
1 观测Attention SinksStreamingMechanism
证据 (1)
contestedc-dd2a31e5ce
这一阵营默认只要模型能在任意位置稳定找回 needle,长文能力就已基本成立;更复杂 benchmark 只是把任务噪声混进来。
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[0].stancegpt-5.4

    [Camp A: NIAH can still serve as the primary long-context met] This camp assumes that if a model can reliably recover a needle from arbitrary positions, long-context capability is largely established; more complex benchmarks mainly add task

contestedc-61203f2914
这一阵营认为大多数长文任务本质上是 sparse evidence lookup,因此更长窗口的边际收益有限,检索增强更省算力也更稳。
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[1].stancegpt-5.4

    [Camp B: Long context is mostly a retrieval problem, and RAG ] This camp argues that most long-context tasks are fundamentally sparse evidence lookup, so the marginal benefit of longer windows is limited and retrieval augmentation is cheape

contestedc-8e5c544459
这一阵营把 U-shape 的主要根因归到 RoPE 外推或 PE 设计,认为只要换 PE 或做插值,长文位置退化就会大幅缓解。
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[2].stancegpt-5.4

    [Camp C: Lost-in-the-middle is mainly a positional-encoding p] This camp attributes the main cause of U-shape to RoPE extrapolation or PE design, and expects positional degradation to ease substantially once PE is changed or interpolated.

contestedc-9fb750c86b
这一阵营更偏向把长文能力看成整体表征质量的结果,认为 head 级专家化只是分析便利,不是决定性结构。
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[3].stancegpt-5.4

    [Camp D: Long-context capability is distributed across the wh] This camp prefers to view long-context capability as a consequence of overall representation quality, treating head-level specialization as an analytic convenience rather than a

contestedc-95001c7f0a
RefinedWeb [RefinedWeb2023]、Dolma [Dolma2024] 和 Lee et al. [Dedup2022] 说明,强过滤、去重、文档化和稳定 pipeline 已经能支撑高质量预训练。逐样本改写增加系统复杂度、审计难度和成本,收益未必超过把基线先做好。
来源论文· 3[RefinedWeb2023][Dolma2024][Dedup2022]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[0].stancegpt-5.4

    [Camp A: Global filtering and deduplication are already stron] RefinedWeb [RefinedWeb2023], Dolma [Dolma2024], and Lee et al. [Dedup2022] show that strong filtering, deduplication, documentation, and stable pipelines already support high-qu

contestedc-866915834a
Less is More [LessIsMore2024] 的核心判断是,训练前删掉低价值 token/样本,常常比逐条修复更便宜,也更符合固定 compute 下的优化目标[Chinchilla2022]。
来源论文· 2[LessIsMore2024][Chinchilla2022]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[1].stancegpt-5.4

    [Camp B: Low-value data should be pruned directly; repair is ] The core claim of Less is More [LessIsMore2024] is that removing low-value tokens/samples before training is often cheaper than repairing them one by one, and better aligned wit

contestedc-904117b848
Bai et al. [ConstitutionalAI2022] 与 Lambert et al. [SelfRewarding2024] 说明,模型可以生成、批改、打分并回收训练信号,因此数据质量控制也可以更多交给模型,而不是人工规则。
来源论文· 2[ConstitutionalAI2022][SelfRewarding2024]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[2].stancegpt-5.4

    [Camp C: A model-generated loop can directly take over qualit] Bai et al. [ConstitutionalAI2022] and Lambert et al. [SelfRewarding2024] show that models can generate, critique, score, and recycle training signals, suggesting that data quali

contestedc-6d2c61b5e3
Textbooks Are All You Need [Textbooks2023] 与 TinyStories [TinyStories2023] 支持另一条路线:与其修补海量网页,不如直接构造高密度、低噪声、目标明确的 synthetic 语料。
来源论文· 2[Textbooks2023][TinyStories2023]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[3].stancegpt-5.4

    [Camp D: High-density synthetic data is a more direct path th] Textbooks Are All You Need [Textbooks2023] and TinyStories [TinyStories2023] support a different route: instead of repairing massive web corpora, directly construct dense, low-n

contestedc-ce1a90167e
支持者会引用 Huginn [Geiping2025Huginn]、MoR [Bae2025MoR]、以及部分小模型深度设计经验如 MobileLLM [Liu2024MobileLLM],认为共享层栈反复执行能把更多 compute 变成有效深度,因此在固定参数预算下接近更大的 dense 模型。
来源论文· 3[Geiping2025Huginn][Bae2025MoR][Liu2024MobileLLM]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[0].stancegpt-5.4

    [Camp A: Looping can largely substitute for adding parameters] Supporters point to Huginn [Geiping2025Huginn], MoR [Bae2025MoR], and some small-model depth-design evidence such as MobileLLM [Liu2024MobileLLM], arguing that repeatedly execut

contestedc-94ecaccc9c
这一派会强调固定 r 的训练更稳、更容易并行,也更容易做 scaling。Huginn [Geiping2025Huginn] 与早期 UT [Dehghani2018UniversalTransformer] 都能在固定或受控随机化步数下工作,因此没有必要引入 router。
来源论文· 2[Geiping2025Huginn][Dehghani2018UniversalTransformer]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[1].stancegpt-5.4

    [Camp B: Fixed loop counts are enough; adaptive routing only ] This camp emphasizes that fixed-r training is more stable, easier to parallelize, and easier to scale. Huginn [Geiping2025Huginn] and earlier UT [Dehghani2018UniversalTransforme

contestedc-487fa62036
支持者会引用 Coconut [Hao2024Coconut]、CoCoMix [Tack2025CoCoMix]、Compressed CoT [Cheng2024CompressedCoT]、latent cache deliberation [Liu2024LatentCache]、LCM [Barrault2024LCM],认为 reasoning token 大多只是可见外壳,真正的中间计算应压缩进连续表征。
来源论文· 5[Hao2024Coconut][Tack2025CoCoMix][Cheng2024CompressedCoT][Liu2024LatentCache][Barrault2024LCM]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[2].stancegpt-5.4

    [Camp C: The real loop should live in latent state, not in th] Supporters cite Coconut [Hao2024Coconut], CoCoMix [Tack2025CoCoMix], Compressed CoT [Cheng2024CompressedCoT], latent-cache deliberation [Liu2024LatentCache], and LCM [Barrault20

contestedc-436e0a9f7c
这一派会引用 Fan et al. [Fan2024Length]、Giannou et al. [Giannou2023LoopedICL]、Sparse UT [Tan2023SparseUT],认为 loop 的价值在于更接近迭代算法结构,因此在 length generalization 与 ICL 上出现同参数下的能力跃迁。
来源论文· 3[Fan2024Length][Giannou2023LoopedICL][Tan2023SparseUT]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[3].stancegpt-5.4

    [Camp D: Looping gains come from recurrent inductive bias, no] This camp cites Fan et al. [Fan2024Length], Giannou et al. [Giannou2023LoopedICL], and Sparse UT [Tan2023SparseUT], arguing that looping matters because it better matches iterat

contestedalignmentc-9988adfca3
直接偏好优化方法可被解释为 Q 函数,从而缩小 DPO 类对齐与 RLHF 之间的概念差距。
1 观测DPOQ FunctionRLHF
证据 (1)
  • ref_triagebot-topicarXiv 2404.12358gpt-5.5-2026-04-24

    direct alignment algorithms such as Direct Preference Optimization (DPO) have emerged as an alternative approach

consensustrainingc-8402ef35f2
序列级重要性比率被提出作为LLM强化学习的稳定性修复,替代逐token比率。
1 观测StabilityPolicy GradientGspo
证据 (1)
  • ref_triagebot-topicarXiv 2507.18071gpt-5.5-2026-04-24

    Unlike previous algorithms that adopt token-level importance ratios, GSPO defines the importance ratio based on sequence likelihood

contestedinferencec-aff4181353
在加入 RL 之前,仅靠提示实现的推理加行动仍是工具和网页智能体的强基线。
1 观测ReactToolsBaseline
证据 (1)
  • ref_triagebot-topicarXiv 2210.03629gpt-5.5-2026-04-24

    ReAct: Synergizing Reasoning and Acting in Language Models... reasoning and acting have primarily been studied as separate topics.

contestedc-9cab6ab11b
可验证任务上,final answer reward 加 group-based policy optimization 可以形成简洁训练闭环;DeepSeekMath [DeepSeekMath2024]、OpenAI o1 [OpenAIo1Card2024] 和 DeepSeek-R1 [DeepSeekR12025] 都支持这一路径。
来源论文· 3[DeepSeekMath2024][OpenAIo1Card2024][DeepSeekR12025]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[0].stancegpt-5.5

    [Camp A: Outcome-only RLVR is enough] On verifiable tasks, final-answer rewards plus group-based policy optimization form a simple training loop; DeepSeekMath [DeepSeekMath2024], OpenAI o1 [OpenAIo1Card2024], and DeepSeek-R1 [DeepSeekR12025

contestedc-0758bdcfe8
稀疏 reward 的主要问题不是 reward 少,而是无法区分哪一步错。Implicit Rewards [ImplicitRewards2025]、AgentPRM [AgentPRM2025]、PRL [PRL2026] 和 Self-Guided PRL [SelfGuidedPRL2025] 都把训练信号下沉到过程层。
来源论文· 4[ImplicitRewards2025][AgentPRM2025][PRL2026][SelfGuidedPRL2025]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[1].stancegpt-5.5

    [Camp B: Process and step rewards] The main issue with sparse rewards is not only scarcity, but inability to identify the wrong step. Implicit Rewards [ImplicitRewards2025], AgentPRM [AgentPRM2025], PRL [PRL2026], and Self-Guided PRL [SelfG

contestedc-196c377815
当轨迹里只有少数步骤真正决定成败时,平均分配 reward 会浪费更新。SCAR [SCAR2025]、SPA-RL [SPARL2025]、CAPO [CAPO2025]、Attribution-based CA [AttributionCA2025]、InT [InT2026] 和 Hindsight CA [HindsightCA2026] 都试图把更新集中到关键片段。
来源论文· 6[SCAR2025][SPARL2025][CAPO2025][AttributionCA2025][InT2026][HindsightCA2026]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[2].stancegpt-5.5

    [Camp C: Attribution and causal credit] When only a few steps determine success or failure, uniformly spreading reward wastes updates. SCAR [SCAR2025], SPA-RL [SPARL2025], CAPO [CAPO2025], Attribution-based CA [AttributionCA2025], InT [InT2

contestedc-ef6e9f61f9
Search-R1 [SearchR12025]、ReSearch [ReSearch2025]、ToRL [ToRL2025]、ReTool [ReTool2025]、ToolRL [ToolRL2025]、WebRL [WebRL2024] 和 WebAgent-R1 [WebAgentR12025] 说明 RL 可以探索工具和搜索策略。
来源论文· 7[SearchR12025][ReSearch2025][ToRL2025][ReTool2025][ToolRL2025][WebRL2024][WebAgentR12025]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[3].stancegpt-5.5

    [Camp D: Agentic RL versus non-RL agent baselines] Search-R1 [SearchR12025], ReSearch [ReSearch2025], ToRL [ToRL2025], ReTool [ReTool2025], ToolRL [ToolRL2025], WebRL [WebRL2024], and WebAgent-R1 [WebAgentR12025] show that RL can explore to

consensusarchc-cb50ba3c80
GQA/MQA 式注意力效率路线是 MLA 类 KV-cache 压缩的现实竞争路径,因此 DeepSeek 的注意力演进不能只和传统 MHA 比,而应放在替代压缩默认方案中比较。
1 观测AttentionKV CacheGQA
证据 (1)
consensusarchc-c370a3e5e9
RoPE 外推似乎遵循某种 scaling law,这意味着部分长文失败是可预测的训练长度外推极限,而非泛化意义上的推理崩溃。
1 观测RopeExtrapolationCliff
证据 (1)
contestedtrainingc-df0734ac4d
搜索能力可以在不接入实时搜索引擎的情况下训练,从而降低成本和环境依赖。
1 观测SearchSimulationAgents
证据 (1)
consensusarchc-3a00b147c9
自适应深度思想可追溯到 ACT,这说明 MoR 式按 token 决定循环次数属于更长的条件计算谱系,而非全新的缩放旋钮。
1 观测Adaptive DepthHistoryRouting
证据 (1)
consensusinferencec-b18258e9fa
当前检索强烈表明,许多被包装成推理提升的论文,本质上是在改变 inference compute 的花法,而非证明递归本身带来收益。
1 观测Compute BudgetCotBaselines
证据 (1)
consensusalignmentc-f3cd05b250
推理能力似乎同时受预训练质量与监督数据规模约束,这意味着仅靠后训练并不足以解释大幅推理跃迁。
1 观测ReasoningScalingPretraining
证据 (1)
consensustrainingc-1f5731c29e
已有工作专门用 middle-focused 位置编码修复 lost-in-middle,而且不要求在目标长度上完整微调。
1 观测Position BiasLost in MiddlePositional Encoding
证据 (1)
consensusc-7be1598239
Mesh 的分配次序由**通讯原语频次**决定,而不是模型大小:TP/EP 住 NVLink 8-GPU 域、PP 可跨 IB、DP/FSDP 跨 pod [Shoeybi2019Megatron, Narayanan2021PTD, Jiang2024MegaScale]。
来源论文· 3[Shoeybi2019Megatron][Narayanan2021PTD][Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsclaude-opus-4.7

    The ordering of mesh dimensions is dictated by communication-primitive frequency, not by model size: TP/EP on the 8-GPU NVLink island, PP across IB, DP/FSDP across pods [Shoeybi2019Megatron, Narayanan2021PTD, Jiang2024MegaScale].

consensusc-503b6f0410
Zero-bubble 1F1B [Qi2023ZeroBubble] 已经取代 interleaved 1F1B [Narayanan2021PTD] 成为 2024 年起的 PP 默认;不上 ZB 通常意味着损失 2–5 pp MFU。
来源论文· 2[Qi2023ZeroBubble][Narayanan2021PTD]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsclaude-opus-4.7

    Zero-bubble 1F1B [Qi2023ZeroBubble] has replaced interleaved 1F1B [Narayanan2021PTD] as the PP default since 2024; not adopting ZB typically costs 2–5 pp MFU.

consensusc-a0b544bfc1
长序列 (L ≥ 32K) 下 CP 是必须的第四维,Ring [Liu2023RingAttn] 与 Ulysses [Jacobs2023Ulysses] 在 USP [Fang2024USP] 的拓扑感知下择一或混合,不应被视为 SP 的替代品。
来源论文· 4[Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]arXiv 2310.01889arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2310.01889· report.headline_claimsclaude-opus-4.7

    For L ≥ 32K, CP is a mandatory fourth dimension; Ring [Liu2023RingAttn] and Ulysses [Jacobs2023Ulysses] are selected or mixed by USP's topology awareness [Fang2024USP] and should not be conflated with SP.

consensusc-b301c4c17a
Auto-parallel [Zheng2022Alpa, Xu2021GSPMD] 在 <100B 逼近 hand-tuned 的 90–95%,但 100B+ GPU 栈没有公开复现;TPU 栈 (GSPMD/pjit) 是唯一的反例。
来源论文· 2[Zheng2022Alpa][Xu2021GSPMD]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsclaude-opus-4.7

    Auto-parallel [Zheng2022Alpa, Xu2021GSPMD] reaches 90–95% of hand-tuned below 100B, but no 100B+ public reproduction exists on the GPU stack; the TPU stack (GSPMD/pjit) is the sole counter-example.

consensusc-7f76719c6f
FSDP-only 路线 [Rajbhandari2021ZeROInfinity, Wang2023ZeROpp] 的 memory 上限在 dense 70B 附近,再往上必须显式加入 TP / PP,否则 HBM↔NVMe 带宽会拖垮 MFU。
来源论文· 2[Rajbhandari2021ZeROInfinity][Wang2023ZeROpp]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsclaude-opus-4.7

    FSDP-only [Rajbhandari2021ZeROInfinity, Wang2023ZeROpp] caps near dense-70B; beyond that, TP/PP must be introduced explicitly or HBM↔NVMe bandwidth collapses MFU.

consensusc-0200b93d41
2026 年合理 MFU 带为 dense 40–55% [Jiang2024MegaScale] 与 MoE 25–45% [DeepSeek2024V3],显著低于此带通常是 mesh/调度问题而非算法问题。
来源论文· 3[Jiang2024MegaScale][DeepSeek2024V3]arXiv 2402.15627arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2402.15627· report.headline_claimsclaude-opus-4.7

    The 2026 sane MFU bands are 40–55% for dense [Jiang2024MegaScale] and 25–45% for MoE [DeepSeek2024V3]; significantly below those bands usually signals a mesh/schedule issue rather than an algorithmic one.

contestedc-4f9e331643
工程成本高、对异构新模型响应慢;mesh 参数需要大量人肉 sweep。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: hand-tuned 4D (Megatron / MegaScale style)] High engineering cost, slow response to heterogeneous new models, and mesh parameters require substantial manual sweeping.

contestedc-88a48ec24f
GPU 栈 100B+ 没有公开复现;cost model 对 FP8/异构 MoE 的刻画仍粗糙 [Ko2024DFModel]。
来源论文· 1[Ko2024DFModel]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: auto-parallel (Alpa / GSPMD / pjit)] No public GPU reproduction at 100B+; cost models still handle FP8 and heterogeneous MoE only coarsely [Ko2024DFModel].

contestedc-3b026cd843
70B 以上 HBM↔NVMe 带宽成为瓶颈;MoE 下 all-to-all 没有对应抽象。MegaScale [Jiang2024MegaScale] 和 DeepSeek-V3 [DeepSeek2024V3] 证明 frontier 必须显式 TP/EP。
来源论文· 3[Jiang2024MegaScale][DeepSeek2024V3]arXiv 2402.15627arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2402.15627· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: FSDP-only is enough] Past 70B, HBM↔NVMe bandwidth becomes the bottleneck; MoE all-to-all has no native abstraction. MegaScale [Jiang2024MegaScale] and DeepSeek-V3 [DeepSeek2024V3] show that explicit TP/EP is required at

contestedc-0105ba1bea
L ≥ 32K 时 attention 主导 memory,3D 栈会 OOM [Korthikanti2022SP, Liu2023RingAttn];MoE 路线下 EP 不是可选项 [GShard2020, DeepSeek2024V3]。
来源论文· 4[Korthikanti2022SP][Liu2023RingAttn][GShard2020][DeepSeek2024V3]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: 3D (DP+TP+PP) is enough, SP/EP optional] For L ≥ 32K attention dominates memory and the 3D stack OOMs [Korthikanti2022SP, Liu2023RingAttn]; under MoE, EP is not optional [GShard2020, DeepSeek2024V3].

consensusc-5886ef3ff8
通用 model 的 code ratio 默认 20%,区间 15–25%;<15% 是 under-trained reasoning,>30% 才进入 NL 风险区([DeepSeekCoder2024][CodeLlama2023][Aghajanyan2023SciMix])。
来源论文· 4[DeepSeekCoder2024][CodeLlama2023][Aghajanyan2023SciMix]arXiv 2401.14196arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2401.14196· report.headline_claimsclaude-opus-4.7

    Generalist default is 20% code, range 15–25%; below 15% reasoning is under-trained, only above 30% does NL enter the risk zone ([DeepSeekCoder2024][CodeLlama2023][Aghajanyan2023SciMix]).

consensusc-a180b2e9a4
Code specialist 合理区间是 50–90%,对应 MMLU 相对通用 base 下降 6–9 pp;不要假装 NL 不会退,要用 post-training 补回([DeepSeekCoder2024])。
来源论文· 2[DeepSeekCoder2024]arXiv 2401.14196arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2401.14196· report.headline_claimsclaude-opus-4.7

    The specialist range is 50–90%, with a 6–9 pp MMLU drop versus a matched generalist base; don't pretend NL won't regress, recover it via post-training ([DeepSeekCoder2024]).

consensusc-812e7a4360
Code 加的是 reasoning 里面 '结构化 / 多步' 那部分,不是 commonsense 全域;Petty et al. [Petty2024CodeReasoning] 的 controlled 实验表明数学 +2–5 pp、commonsense 不显著。
来源论文· 1[Petty2024CodeReasoning]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.headline_claimsclaude-opus-4.7

    Code boosts the structured / multi-step slice of reasoning, not commonsense as a whole; the controlled experiment of Petty et al. [Petty2024CodeReasoning] shows +2–5 pp on math and no significant effect on commonsense.

consensusc-186df3fac8
Continual 不是 catastrophic 的充分条件:Code Llama [CodeLlama2023] 500B code continual 让 MMLU 只掉 <1 pp;但 Ma et al. [Ma2023AtWhichLayer] 指出 annealing-only 加 code 的迁移远弱于全程混入。
来源论文· 3[CodeLlama2023][Ma2023AtWhichLayer]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.headline_claimsclaude-opus-4.7

    Continual is not automatically catastrophic: Code Llama [CodeLlama2023] loses <1 pp MMLU after 500B continual code tokens; but Ma et al. [Ma2023AtWhichLayer] shows annealing-only code transfers far less than mixing throughout.

consensusc-e2dfd5f0c3
Repo-level packing [Shi2023InContextPretraining] + FIM 0.5–0.9 [Bavarian2022FIM] + 结构 token [Li2023StarCoder] 已是 2026 default;少做一条,cross-file 补全掉 5–10 pp。
来源论文· 4[Shi2023InContextPretraining][Bavarian2022FIM][Li2023StarCoder]arXiv 2309.16039arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2309.16039· report.headline_claimsclaude-opus-4.7

    Repo-level packing [Shi2023InContextPretraining] + FIM 0.5–0.9 [Bavarian2022FIM] + structural tokens [Li2023StarCoder] are 2026 defaults; skipping any one loses 5–10 pp on cross-file completion.

contestedc-591db01c6d
Aghajanyan et al. [Aghajanyan2023SciMix] 的相图显示 >40% 进入 interference 区;DeepSeek-Coder [DeepSeekCoder2024] 87% 的 MMLU 缺口 6–9 pp 是真金白银;Petty et al. [Petty2024CodeReasoning] 也证实 code 的迁移不覆盖 commonsense 全域。
来源论文· 4[Aghajanyan2023SciMix][DeepSeekCoder2024][Petty2024CodeReasoning]arXiv 2301.03728arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2301.03728· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: more code is always better, generalists should push ] The phase plot of Aghajanyan et al. [Aghajanyan2023SciMix] puts >40% in the interference regime; DeepSeek-Coder [DeepSeekCoder2024]'s 6–9 pp MMLU gap at 87% is a hard

contestedc-669c09c030
Olsson et al. [Olsson2022InductionHeads] 提供了 circuit 级证据:code 的 repeat-token 结构是 induction head 的特异食物,不是一般低 entropy 数据可以替代的;Madaan et al. [Madaan2022CodeReason] 的结构化 commonsense 增益也无法用 'effective LR' 解释。
来源论文· 2[Olsson2022InductionHeads][Madaan2022CodeReason]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: code's benefit is purely regularisation / lower effe] Olsson et al. [Olsson2022InductionHeads] offer circuit-level evidence: code's repeat-token structure is specific induction-head fuel, not substitutable by generic low

contestedc-f5fc7112af
三组独立 ratio ablation [DeepSeekCoder2024][CodeLlama2023][CodeScalingLaws2023] 都显示 <15% 时 reasoning under-trained;Llama 3 [Llama3Herd] 在 15T 后期反而抬高 code 占比——'<10% 保 NL' 与 frontier 实际选择相反。
来源论文· 4[DeepSeekCoder2024][CodeLlama2023][CodeScalingLaws2023]arXiv 2401.14196arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2401.14196· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: keep code below 10% to protect NL] Three independent ratio ablations [DeepSeekCoder2024][CodeLlama2023][CodeScalingLaws2023] show reasoning is under-trained below 15%; Llama 3 [Llama3Herd] actually raises code share in l

contestedc-e08d71a523
Code Llama [CodeLlama2023] 500B continual 让 MMLU 只掉 <1 pp——证据本身就反例;Ma et al. [Ma2023AtWhichLayer] 也显示只要保留 NL replay 并让 code 全程参与,continual 的迁移可达 from-scratch 80%+。
来源论文· 3[CodeLlama2023][Ma2023AtWhichLayer]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: code ability must be trained from scratch; continual] Code Llama [CodeLlama2023]'s 500B continual loses <1 pp MMLU—a direct counter-example; Ma et al. [Ma2023AtWhichLayer] further show that with NL replay and full-run co

consensusc-429c87aafd
只改位置编码、不改数据配方的 'nominal 128K' 在 RULER [Hsieh2024RULER] 上有效长度坍缩到 ~32K,与 Fu et al. [Fu2024DataEngineering] 和 Xiong et al. [Xiong2023EffectiveLongCtx] 的 continual-pretrain 消融结果一致。
来源论文· 4[Hsieh2024RULER][Fu2024DataEngineering][Xiong2023EffectiveLongCtx]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.headline_claimsclaude-opus-4.7

    'Nominal 128K' models that only change positional encoding collapse to ~32K effective length on RULER [Hsieh2024RULER], consistent with Fu et al. [Fu2024DataEngineering] and Xiong et al. [Xiong2023EffectiveLongCtx] continual-pretrain ablati

consensusc-44632aa071
ICL 的涌现在 Chan et al. [Chan2022DataDist] 的严格合成实验下是 burstiness + skewed Zipfian 的函数,而不是参数量函数;这解释了为什么 Shi et al. [Shi2023InContextPretraining] 的相关文档 packing 能在同 compute 下把 ICL 抬 +8 pp。
来源论文· 2[Chan2022DataDist][Shi2023InContextPretraining]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsclaude-opus-4.7

    ICL emergence, under Chan et al. [Chan2022DataDist]'s controlled synthetic setups, is a function of burstiness + skewed Zipfian, not parameter count; this explains why Shi et al. [Shi2023InContextPretraining]'s related-document packing adds

consensusc-8376f2d76a
induction head [Olsson2022InductionHeads] 在训练中的相变点与 ICL 性能相变同步;packing 顺序直接决定该电路是否跨文档迁移。
来源论文· 1[Olsson2022InductionHeads]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsclaude-opus-4.7

    The induction-head [Olsson2022InductionHeads] phase transition aligns with the ICL phase transition during training; packing order directly controls whether the circuit transfers across documents.

consensusc-1344c41c92
长上下文 post-training(Bai et al. [Bai2024LongAlign])必须复制 pretrain 的 packing 结构;否则 32K+ 能力会在 SFT 阶段回退。
来源论文· 2[Bai2024LongAlign]arXiv 2401.18058arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2401.18058· report.headline_claimsclaude-opus-4.7

    Long-context post-training (Bai et al. [Bai2024LongAlign]) must replicate pretrain packing structure; otherwise 32K+ capability degrades during SFT.

consensusc-2de92376f9
retrieval-only 长上下文评测(NIAH 及其变体)高估能力 20–30 pp;Karpinska et al. [Karpinska2024NoCha] 与 Goldman et al. [Goldman2024LongCtxTaxonomy] 给出真实难度校正。
来源论文· 3[Karpinska2024NoCha][Goldman2024LongCtxTaxonomy]arXiv 2406.16264arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2406.16264· report.headline_claimsclaude-opus-4.7

    Retrieval-only long-context evals (NIAH and variants) overstate capability by 20–30 pp; Karpinska et al. [Karpinska2024NoCha] and Goldman et al. [Goldman2024LongCtxTaxonomy] provide honest difficulty calibration.

contestedc-45a91c3dc2
Hsieh et al. [Hsieh2024RULER] 显示这批模型的有效上下文大多在 32K 坍缩;Karpinska et al. [Karpinska2024NoCha] 显示 Gemini-1.5-1M 在 book-length 推理上只有 ~50%;Fu et al. [Fu2024DataEngineering] 消融显示只改位置不改数据拿不到主要收益。
来源论文· 4[Hsieh2024RULER][Karpinska2024NoCha][Fu2024DataEngineering]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: Positional extrapolation is enough] Hsieh et al. [Hsieh2024RULER] show effective context collapses around 32K for this lineage; Karpinska et al. [Karpinska2024NoCha] show Gemini-1.5-1M scores ~50% on book-length reasonin

contestedc-51ca2c6ff5
Shi et al. [Shi2023InContextPretraining] 与 Staniszewski et al. [Staniszewski2023StructuredPacking] 指出:同样的数据,不同拼接结构下收益差一个数量级——数据配方本身不足以独立解释长上下文能力。
来源论文· 3[Shi2023InContextPretraining][Staniszewski2023StructuredPacking]arXiv 2309.16039arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: The data recipe is the main variable] Shi et al. [Shi2023InContextPretraining] and Staniszewski et al. [Staniszewski2023StructuredPacking] note: same data, different packing structures, yield differs by an order of magni

contestedc-55bcc2bb33
Camp A 的工作者会说:packing 在内部早就这么做了,公开报告少。Camp E 的评测(RULER [Hsieh2024RULER]、NoCha [Karpinska2024NoCha])则显示 packing 做了之后仍有缺口,不能把它视作银弹。
来源论文· 3[Hsieh2024RULER][Karpinska2024NoCha]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: Packing engineering is the under-exploited lever] Camp A advocates note packing is done internally and under-reported. Camp E evals (RULER [Hsieh2024RULER], NoCha [Karpinska2024NoCha]) show gaps remain after packing — no

contestedc-5b64164d84
在 RULER [Hsieh2024RULER] / NoCha [Karpinska2024NoCha] 等组合任务上,同尺寸 SSM 与 hybrid 仍然落后纯 Transformer;induction head 在 SSM 里不是自然电路 [Olsson2022InductionHeads]。
来源论文· 4[Hsieh2024RULER][Karpinska2024NoCha][Olsson2022InductionHeads]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: Switch architectures (SSM / linear) to bypass positi] On compositional evals (RULER [Hsieh2024RULER], NoCha [Karpinska2024NoCha]), same-size SSMs and hybrids still trail pure Transformers; induction heads are not a natur

consensusc-263a0c5909
2022–2026 pretrain 架构 70%+ 的关键决策由 kernel 约束驱动,而非独立算法选择:GQA/MLA 的 KV head 数 ([Ainslie2023GQA]、[DeepSeek2024V2]) 直接来自 decode memory-bound 的 bandwidth wall。
来源论文· 3[Ainslie2023GQA][DeepSeek2024V2]arXiv 2305.13245arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2305.13245· report.headline_claimsclaude-opus-4.7

    Over 70% of key 2022–2026 pretrain architecture decisions are driven by kernel constraints rather than independent algorithmic choice: KV-head counts in GQA/MLA ([Ainslie2023GQA], [DeepSeek2024V2]) come directly from the decode memory-bound

consensusc-adbdedf1e5
FP8 pretrain 在 100B+ 规模下没上 per-block scaling 会漂移——[Fishman2024FP8Scale] 在 2T tokens 暴露了 per-tensor scaling 的系统性失败,[Mishra2025MXFP8Recipes] 给出可复现的 MXFP8 解。
来源论文· 3[Fishman2024FP8Scale][Mishra2025MXFP8Recipes]arXiv 2409.12517arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2409.12517· report.headline_claimsclaude-opus-4.7

    FP8 pretraining above 100B drifts without per-block scaling — [Fishman2024FP8Scale] exposes systematic per-tensor failure at 2T tokens, and [Mishra2025MXFP8Recipes] supplies the reproducible MXFP8 fix.

consensusc-ca65e1e8fb
Kernel 的 numerics 选择会外泄到 convergence:[Golden2024FAStability] 显示 FlashAttention 的分块 softmax 与大规模训练的 loss spike 存在相关性,打破了'kernel 只影响速度不影响质量'的错觉。
来源论文· 2[Golden2024FAStability]arXiv 2405.02803arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2405.02803· report.headline_claimsclaude-opus-4.7

    Kernel numerics leak into convergence: [Golden2024FAStability] correlates FlashAttention's block-softmax with large-scale loss spikes, breaking the illusion that 'kernels only affect speed, not quality'.

consensusc-b4d6306966
诊断任何 kernel 的正确起手式是 roofline ([Williams2008Roofline]),而不是 fusion:memory-bound 的 kernel 做算子融合几乎没用,必须先分类再选优化。
来源论文· 1[Williams2008Roofline]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsclaude-opus-4.7

    The correct opening move for any kernel diagnosis is the roofline ([Williams2008Roofline]), not fusion: fusing a memory-bound kernel is nearly useless; classify first, optimize second.

consensusc-84a1ec971e
Fine-grained MoE(DeepSeek-V2/V3、Mixtral、Qwen-MoE)是 grouped GEMM 成熟 ([Gale2022MegaBlocks]) + Hopper async ([Luo2024HopperDissect]) 的结果,而不是先有算法灵感再找 kernel。
来源论文· 2[Gale2022MegaBlocks][Luo2024HopperDissect]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsclaude-opus-4.7

    Fine-grained MoE (DeepSeek-V2/V3, Mixtral, Qwen-MoE) is the downstream of grouped GEMM maturing ([Gale2022MegaBlocks]) combined with Hopper async ([Luo2024HopperDissect]), not an algorithmic inspiration that later found a kernel.

contestedc-76fa91e228
对非 frontier 规模团队成本过高;小团队不需要自写 kernel 也能拿 80% 性能。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: kernels and algorithms must be co-designed] Too costly for non-frontier teams; smaller teams can capture 80% without writing kernels themselves.

contestedc-7fc2464e49
[Fishman2024FP8Scale] 显示 trillion-token 规模必须动 kernel-level numerics;[Golden2024FAStability] 显示 kernel 选择会泄漏到 convergence;MLA ([DeepSeek2024V2]) 需要自定义 attention 实现。
来源论文· 4[Fishman2024FP8Scale][Golden2024FAStability][DeepSeek2024V2]arXiv 2409.12517arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2409.12517· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: PyTorch level is enough] [Fishman2024FP8Scale] shows trillion-token scale requires kernel-level numerics; [Golden2024FAStability] shows kernel choice leaks into convergence; MLA ([DeepSeek2024V2]) demands a custom attent

contestedc-8ffba63e6c
每代硬件的 capability 新增反而强制算法迁移:Hopper async 催生 FA3,Blackwell MXFP8 催生 [Mishra2025MXFP8Recipes] 配方,FP4 tensor core 推动 hybrid 精度。算法不是不用变,而是被新 primitive 逼着变。
来源论文· 2[Mishra2025MXFP8Recipes]arXiv 2506.08027arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2506.08027· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: hardware keeps improving, algorithms don't need to a] Each hardware generation's new capabilities force algorithmic migration: Hopper async birthed FA3, Blackwell MXFP8 birthed [Mishra2025MXFP8Recipes], FP4 tensor cores

contestedc-47308ca159
CUTLASS 3 / CuTe-DSL ([CUTLASS3]) 与 FA3 ([Shah2024FA3]) 的 software moat 仍在扩大;MXFP8 硬件支持目前仅 Blackwell,非 NV 家还在追 FP8 per-block。实测 frontier ecosystem 仍明显 NV 为主。
来源论文· 2[Shah2024FA3]arXiv 2407.08608arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2407.08608· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: non-NVIDIA hardware will catch up] The CUTLASS 3 / CuTe-DSL ([CUTLASS3]) and FA3 ([Shah2024FA3]) software moat is still widening; MXFP8 hardware support today is Blackwell-only, with non-NV vendors still chasing per-bloc

consensusc-894079b1de
data ablation ladder 的 proxy-to-frontier rank correlation 是任务依赖的:DCLM [DCLM2024] 上 MMLU 跨 412M↔7B 的 Spearman ≈ 0.78,HumanEval 仅 ≈ 0.41——code/reasoning 决策不能在 1B 以下拍板。
来源论文· 2[DCLM2024]arXiv 2406.11794arxiv.org
2 观测Data Value Causality
证据 (2)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.headline_claimsclaude-opus-4.7

    Proxy-to-frontier rank correlation in data ablation ladders is task-dependent: on DCLM [DCLM2024], Spearman ≈ 0.78 for MMLU but ≈ 0.41 for HumanEval across 412M↔7B, so code/reasoning calls cannot be made below 1B.

  • topic_reportdata-value-causalityarXiv 2406.11794· report.headline_claimsgpt-5.2

    On DCLM-style ladders, proxy-to-target rank correlation is strongly capability-dependent: Spearman≈0.78 for MMLU (412M↔7B) but only ≈0.41 for HumanEval; therefore code/reasoning data decisions should not be finalized at ≤1B scale.[DCLM2024]

consensusc-2671edc1ea
influential data 随 scale 漂移:[AnthropicInfluence2023] 在 810M↔52B 间同一 completion 的 top-influence 样本重叠率 <10%,'影响力数据是模型无关的' 这个暗默假设站不住。
来源论文· 2[AnthropicInfluence2023]arXiv 2308.03296arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2308.03296· report.headline_claimsclaude-opus-4.7

    Influential data drifts with scale: in [AnthropicInfluence2023], the top-influence samples for the same completion overlap <10% between 810M and 52B models, invalidating the tacit assumption that influential data is model-agnostic.

consensusc-0977c84866
classifier-family 的贡献大于 threshold 调节:DCLM [DCLM2024] 显示 fastText vs DSIR [DSIR2023] vs perplexity 的差距达 4–6 pp MMLU,而单一 classifier 阈值扫描通常 <1 pp。bulk filtering 选 classifier 比调超参重要。
来源论文· 3[DCLM2024][DSIR2023]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.headline_claimsclaude-opus-4.7

    Classifier family matters more than threshold tuning: DCLM [DCLM2024] shows 4–6 pp MMLU gaps between fastText, DSIR [DSIR2023], and perplexity filters, while single-classifier threshold sweeps usually move <1 pp. Pick the classifier before

consensusc-c021d44800
synthesis-with-verifier 的收益集中在 verifier 覆盖的分布上:Phi [Textbooks2023] 在 HumanEval 上 50%+,但 verifier 之外的 code 分布退化明显;synthesis 不是普适 data value。
来源论文· 2[Textbooks2023]arXiv 2306.11644arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2306.11644· report.headline_claimsclaude-opus-4.7

    Synthesis-with-verifier gains concentrate on the verifier's covered distribution: Phi [Textbooks2023] hits 50%+ on HumanEval but degrades on code distributions outside verifier coverage; synthesis is not a universal data-value lever.

contestedc-9f0043b113
在 code / multi-step reasoning 上 DCLM [DCLM2024] 自己报告 7B↔412M Spearman 仅 0.41;Physics-of-LMs [PhysicsLMs2024] 显示 per-document duplication 的非线性被 classifier 抹平。所以 '够用' 的边界在知识与阅读,不到 code 与 factual memorization。
来源论文· 3[DCLM2024][PhysicsLMs2024]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: Quality classifier plus ablation ladder is enough] DCLM [DCLM2024] itself reports only 0.41 Spearman between 412M and 7B on code; Physics-of-LMs [PhysicsLMs2024] shows per-document duplication non-linearities that classi

contestedc-0f6b7dadb2
[AnthropicInfluence2023] 自己报告 top-influence 跨 scale 重叠 <10%,意味着 influence 在 production 里做 data filter 不可靠;TRAK [TRAK2023] 在 >10B 尚无独立复现;Simfluence [Simfluence2023] 只验证到小规模。
来源论文· 4[AnthropicInfluence2023][TRAK2023][Simfluence2023]arXiv 2308.03296arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2308.03296· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: Influence functions are the main path] [AnthropicInfluence2023] itself reports <10% top-influence overlap across scales, making influence unreliable for production-level filtering; TRAK [TRAK2023] lacks independent >10B

contestedc-6cb1d91960
IV 要求 instrument 只通过 treatment 影响 outcome,在真实 web 数据里 domain 与 downstream 的 backdoor path 难以排除;Skill-it [SkillIt2023] 的 mediator 假设 skill 可枚举,emergent capability 打穿这个假设。
来源论文· 1[SkillIt2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: Full causal inference is the future] IV requires the instrument to affect outcome only through treatment, but backdoor paths between domain and downstream are hard to rule out on real web data; Skill-it [SkillIt2023]'s e

contestedc-d7cc4e91f7
DCLM [DCLM2024] 的 thousands-of-ablations 证明即便是有经验的 team,直觉在 mixture 决策上与 regression 的预测差 3–5 pp MMLU;FineWeb-Edu [FineWeb2024] 的公开 recipe 显示没有 classifier 的 baseline 直接落后 4 pp+。'靠直觉' 的成本是每次 frontier run 约 5–15% downstream ceiling。
来源论文· 3[DCLM2024][FineWeb2024]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: Skip measurement, rely on intuition and scale] DCLM [DCLM2024]'s thousands of ablations show experienced teams' intuition lags regression predictions by 3–5 pp MMLU on mixture decisions; FineWeb-Edu [FineWeb2024]'s open

consensusc-0346a3ce4c
FA3 [Shah2024FA3] 在 H100 BF16 上达到 ~75% peak,FP8 ~1.2 PFLOPs/s,是 FA2-BF16 [Dao2023FA2] 的 2× 以上——attention 已从 memory-bound 彻底跨到 compute-bound。
来源论文· 3[Shah2024FA3][Dao2023FA2]arXiv 2407.08608arxiv.org
2 观测Flashattention Kernels
证据 (2)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.headline_claimsclaude-opus-4.7

    FA3 [Shah2024FA3] hits ~75% BF16 peak and ~1.2 PFLOPs/s FP8 on H100, more than 2× FA2-BF16 [Dao2023FA2] — attention has crossed from memory-bound to compute-bound.

  • topic_reportflashattention-kernelsarXiv 2407.08608· report.headline_claimsgpt-5.2

    On H100, FA3 [Shah2024FA3] pushes attention close to compute-bound, reporting ~75% peak in BF16 and ~1.2 PFLOPs/s in FP8; therefore, “hand-writing even more aggressive kernels” typically looks like 10–20% marginal gains rather than order-of

consensusc-cac51444f4
FlexAttention [Dong2024Flex] 在 ALiBi / SWA / soft mask 等变体上达到手写 FA2 的 85–95% 吞吐,代码量从 ~800 行 CUDA 降到 ~10 行 Python;2026 年变体默认入口应该是 FlexAttention,不是 CUDA。
来源论文· 2[Dong2024Flex]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.headline_claimsclaude-opus-4.7

    FlexAttention [Dong2024Flex] reaches 85–95% of hand-written FA2 throughput on ALiBi / SWA / soft-mask variants with ~10 lines of Python instead of ~800 lines of CUDA; the 2026 default entry point for variants is FlexAttention, not CUDA.

consensusc-0891f82aa3
decode 场景下 training-shape kernel SM 占用率 <20%;FlashDecoding [Hong2023FlashDec] 沿 KV 切 128/256-token chunk 后回到 70%+,FlashInfer [Ye2024FlashInfer] 已是 vLLM 默认后端。
来源论文· 2[Hong2023FlashDec][Ye2024FlashInfer]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsclaude-opus-4.7

    Training-shape kernels hit <20% SM occupancy in decode; FlashDecoding [Hong2023FlashDec] restores 70%+ by chunking KV along seq into 128/256-token blocks, and FlashInfer [Ye2024FlashInfer] is now vLLM's default backend.

consensusc-8582c13f41
FP8 FA3 在 10B 级 LM 上 per-step loss 偏差 <0.1%,但 [Golden2024FAStable] 指出某些 rescale 顺序可放大误差;生产上 FP8 训练必须跑 BF16 对照 2–5B 步。
来源论文· 1[Golden2024FAStable]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsclaude-opus-4.7

    FP8 FA3 shows <0.1% per-step loss deviation at 10B scale, but [Golden2024FAStable] shows some rescale orderings amplify error; production FP8 training must run a 2–5B-step BF16 control.

consensusc-78dda4740f
'replace attention' 阵营(RetNet [Sun2023RetNet]、Mamba [Waleffe2024Mamba])在短 context LM loss 上已追平,但 retrieval / ICL 仍差 2–5 pp;2026 年稳定获益的是 hybrid(1/6 attention 层),不是纯 SSM。
来源论文· 3[Sun2023RetNet][Waleffe2024Mamba]arXiv 2307.08621arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2307.08621· report.headline_claimsclaude-opus-4.7

    The 'replace attention' camp (RetNet [Sun2023RetNet], Mamba [Waleffe2024Mamba]) has matched short-context LM loss but still trails by 2–5 pp on retrieval / ICL; the 2026 reliable win is a hybrid with ~1/6 attention layers, not pure SSM.

contestedc-6856d77478
这忽略了 FlexAttention [Dong2024Flex] 和 FlashInfer [Ye2024FlashInfer] 带来的工具化红利——写 kernel 的人变少,写 variant 的人变多,生态重心已转移。
来源论文· 3[Dong2024Flex][Ye2024FlashInfer]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: the FA series is the endpoint of attention engineeri] Ignores the tooling dividend of FlexAttention [Dong2024Flex] and FlashInfer [Ye2024FlashInfer]: fewer kernel authors, many more variant authors, and the center of gra

contestedc-84c53d7f5e
FlexAttention 在 FA3 的 FP8 + warp-specialization 特性面前仍有 5–15% 性能差 [Shah2024FA3];在 frontier training 的临界 batch 上这 10% 就是钱。
来源论文· 2[Shah2024FA3]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: Triton / FlexAttention is the real revolution] FlexAttention still trails FA3's FP8 + warp-specialization by 5–15% [Shah2024FA3]; at frontier training's marginal batch that 10% translates into real compute budget.

contestedc-77e493aa83
[Waleffe2024Mamba] 自己的 8B 实验显示 retrieval / ICL 仍差 2–5 pp;hybrid(1/6 attention 层)才是稳态,而不是纯 SSM。
来源论文· 1[Waleffe2024Mamba]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: attention itself should be replaced (SSM / linear RN] [Waleffe2024Mamba]'s own 8B study shows 2–5 pp deficits on retrieval / ICL; the stable endpoint is hybrid (~1/6 attention layers), not pure SSM.

contestedc-d0acb78463
FA1/2 在 AMD MI300 和若干国产卡上已有工程团队移植,Triton 的 AMD backend 也在 2024 成熟;锁定集中在 FA3 FP8 一代,上游 FA2 + FlexAttention 的可移植性明显更好。
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: FA is the embodiment of NVIDIA lock-in] FA1/2 already have maintained ports on AMD MI300 and several domestic accelerators, and Triton's AMD backend matured in 2024; the lock-in concentrates on the FA3 FP8 generation, wh

consensusc-d70b4bc26b
同 active 参数预算下,DeepSeekMoE 的 fine-grained + 1 shared 相比 Mixtral-style coarse (8×7B, top-2) 在 MMLU 上 +1.8–3.4 pp [Dai2024DeepSeekMoE][Jiang2024Mixtral],这是 2024 年模板切换的首要驱动力。
来源论文· 3[Dai2024DeepSeekMoE][Jiang2024Mixtral]arXiv 2401.06066arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2401.06066· report.headline_claimsclaude-opus-4.7

    At matched active parameters, DeepSeekMoE's fine-grained + 1 shared beats Mixtral-style coarse (8×7B, top-2) by 1.8–3.4 pp MMLU [Dai2024DeepSeekMoE][Jiang2024Mixtral] — the primary driver behind the 2024 template switch.

consensusc-9307842c95
aux-loss-free bias EMA 路由不是 'aux loss 的等价替换',而是结构上更简单:dead-expert 收敛有 stochastic-approximation 证明 [Han2025AuxFreeTheory],且公开训练日志显示 token drop rate 可稳在 <0.5% [DeepSeek2024V3][Wang2024AuxFree]。
来源论文· 3[Han2025AuxFreeTheory][DeepSeek2024V3][Wang2024AuxFree]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsclaude-opus-4.7

    Aux-loss-free bias-EMA isn't an 'aux-loss equivalent' — it's structurally simpler: dead-expert convergence has a stochastic-approximation proof [Han2025AuxFreeTheory], and published logs show token-drop rate stays <0.5% [DeepSeek2024V3][Wan

consensusc-d4e3fe8aa9
expert-choice routing [Zhou2022ExpertChoice] 在 encoder-only 和 prefill 上仍有优势,但 decoder-only autoregressive 推理里无法保留 causal 顺序——这是它在 2024+ 开源 MoE 中没进入事实 default 的硬约束。
来源论文· 2[Zhou2022ExpertChoice]arXiv 2202.09300arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2202.09300· report.headline_claimsclaude-opus-4.7

    Expert-choice routing [Zhou2022ExpertChoice] still wins on encoder-only and prefill, but can't preserve causality in decoder-only autoregressive decoding — the hard constraint that kept it out of the 2024+ open-MoE default.

consensusc-49036c1bb3
dense→MoE upcycling 只在 dense checkpoint 已经 token-rich 时划算,有效 token 等效系数 0.4–0.6× [Liew2025Upcycling];低 token dense 直接 upcycle 的边际收益经常为负,OLMo 3 / Olmo2 坚持 dense 路线有其 ROI 基础 [OLMo2025Olmo3][Walsh2024OLMo2]。
来源论文· 4[Liew2025Upcycling][OLMo2025Olmo3][Walsh2024OLMo2]arXiv 2502.03009arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2502.03009· report.headline_claimsclaude-opus-4.7

    Dense→MoE upcycling only pays off when the dense checkpoint is already token-rich, with effective-token factor ≈0.4–0.6× [Liew2025Upcycling]; upcycling from under-trained dense frequently yields negative marginal return, giving OLMo 3 / OLM

consensusc-ce4d390a0a
router z-loss [Zoph2022STMoE] 与 aux loss 正交,解决的是 logit overflow 而非负载塌缩——这是 DeepSeek-V3 去掉 aux loss 后仍保留 z-loss=1e-3 的原因 [DeepSeek2024V3]。
来源论文· 3[Zoph2022STMoE][DeepSeek2024V3]arXiv 2202.09368arxiv.org
2 观测MOE Landscape
证据 (2)
  • topic_reportmoe-landscapearXiv 2202.09368· report.headline_claimsclaude-opus-4.7

    Router z-loss [Zoph2022STMoE] is orthogonal to aux loss — it fixes logit overflow, not load collapse — which is why DeepSeek-V3 keeps z-loss=1e-3 after removing aux loss [DeepSeek2024V3].

  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Router z-loss primarily prevents router-logit blow-ups and numerical overflow, not load collapse; it is orthogonal to “remove aux loss / switch to bias EMA”, which explains why DeepSeek-V3 keeps z-loss around 1e-3 even after removing aux lo

contestedc-48de7cc311
忽略 serving 端 memory footprint 与 post-train 稳定性的约束;在 7–13B active 的部署区间 dense 仍然更省 [OLMo2025Olmo3][Walsh2024OLMo2]。
来源论文· 2[OLMo2025Olmo3][Walsh2024OLMo2]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: MoE is the inevitable replacement for dense] Ignores serving-side memory footprint and post-training stability; in the 7–13B active deployment band dense is still cheaper [OLMo2025Olmo3][Walsh2024OLMo2].

contestedc-70a18679dd
对 frontier-scale ≥70B active 的 quality ceiling 没有回应;DeepSeek-V3 在同 FLOPs 下已 overtake 众多 dense 7–70B [DeepSeek2024V3][DeepSeekAI2024V2]。
来源论文· 3[DeepSeek2024V3][DeepSeekAI2024V2]arXiv 2412.19437arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2412.19437· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: dense paths yield better ROI in the end] No response to frontier-scale ≥70B active quality ceiling; DeepSeek-V3 already overtakes many 7–70B dense models at matched FLOPs [DeepSeek2024V3][DeepSeekAI2024V2].

contestedc-0dbe94cb50
expert-choice 在 autoregressive decoder 下无法保留 causal 顺序 [Zhou2022ExpertChoice];aux-loss-free 的 bias EMA 步长仍需人工前 2000 step 监控,并非完全免维护。
来源论文· 2[Zhou2022ExpertChoice]arXiv 2202.09300arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2202.09300· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: expert-choice / aux-loss-free is the future] Expert-choice breaks causality under autoregressive decoders [Zhou2022ExpertChoice]; aux-loss-free bias-EMA still needs manual monitoring in the first 2000 steps and isn't mai

contestedc-d7ca5574a2
DeepSeek-V3 在 post-train 阶段保留 256 expert 结构且拿到一线 alignment 效果,说明 MoE post-train 并非结构性不可行,只是工程门槛高 [DeepSeek2024V3]。
来源论文· 1[DeepSeek2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: MoE matters only for pretrain; post-train should rev] DeepSeek-V3 retains 256 experts through post-training with frontier alignment results, showing MoE post-training isn't structurally impossible — just engineering-heav

consensusc-8068b98eef
RegMix [Liu2024RegMix] 以 ~30× 更低 compute 逼近 DoReMi [Xie2023DoReMi];中小团队应当默认选 RegMix,只有 >100B token、>10 domain 的场景才值得 DoReMi 的 proxy overhead。
来源论文· 2[Liu2024RegMix][Xie2023DoReMi]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsunknown

    RegMix [Liu2024RegMix] approaches DoReMi [Xie2023DoReMi] at ~30× lower compute; mid-sized teams should default to RegMix, reserving DoReMi's proxy overhead for >100B-token, >10-domain settings.

consensusc-905b2e4a71
Llama 3 [MetaLlama32024] 和 MiniCPM [Hu2024MiniCPM] 的公开 recipe 把 capability-heavy stage 压在最后 10–30% compute;这个 schedule 不是美学选择,而是 LR cosine 尾段 gradient norm 下降时 upsample 难域的回报最高。
来源论文· 2[MetaLlama32024][Hu2024MiniCPM]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsunknown

    The public recipes of Llama 3 [MetaLlama32024] and MiniCPM [Hu2024MiniCPM] place the capability-heavy stage in the last 10–30% of compute; this schedule isn't aesthetic—it exploits the fact that upsampling hard domains pays off most when gr

consensusc-680e902723
DCLM [Li2024DCLM] 与 FineWeb-Edu [Penedo2024FineWeb] 的 filter 增益与 mixture 优化处在同一数量级(MMLU 3–6 pp);这意味着两者竞争同一条改进预算,而非独立叠加。
来源论文· 2[Li2024DCLM][Penedo2024FineWeb]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsunknown

    The filter gains in DCLM [Li2024DCLM] and FineWeb-Edu [Penedo2024FineWeb] are on the same order as mixture optimization (3–6 pp on MMLU), meaning they compete for the same improvement budget rather than adding independently.

consensusc-72b9dd6ad6
稀缺域的 upsample 上限由 repetition law [Muennighoff2023Repeat] 给定:4 epoch 以内近似等价新数据,超过 16 epoch 贡献塌缩;这把 math/code 在 mixture 里的 ceiling 固定在一个 hard 数字上。
来源论文· 1[Muennighoff2023Repeat]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsunknown

    Upsample ceilings for scarce domains are pinned by the repetition law [Muennighoff2023Repeat]: ≤4 epochs are roughly equivalent to fresh tokens, returns collapse past 16. This places a hard numerical ceiling on how far math/code can be push

consensusc-26655fe95a
Online mixing [Albalak2024ODM] 在固定 compute 下可持平 offline DoReMi,但稳定性依赖 loss signal 的 normalization;在强 quality filter 上游的 regime 下 online 容易退化为噪声驱动。
来源论文· 1[Albalak2024ODM]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsunknown

    Online mixing [Albalak2024ODM] matches offline DoReMi at fixed compute, but stability hinges on loss-signal normalization; under aggressive upstream quality filters it can degenerate into noise-driven reweighting.

contestedc-4cfea4a0d0
在 >10 域、>30B 规模上 transfer 的证据依然有限;DoReMi 的 proxy overhead 在 domain 不断变化的实际 pipeline 里要反复付费。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].counterunknown

    [counter to Camp A: Formal mixture search (DoReMi / RegMix / Data Mixing] Transfer evidence past 10 domains and 30B scale is still thin; DoReMi's proxy overhead has to be paid repeatedly when the domain inventory changes.

contestedc-a35f44306f
heuristic 无法回答”再 upsample 1 pp 变好还是变坏”;一旦 domain 数超过 20 或团队缺乏长期经验,heuristic 误差会快速累积。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].counterunknown

    [counter to Camp B: Heuristic + curriculum (Llama 3 / MiniCPM route)] Heuristics can't answer 'what if I upsample another 1 pp'; beyond 20 domains or without long-term institutional experience, heuristic error compounds quickly.

contestedc-8ce61993ab
不同域的 loss 绝对值不可比;normalization 做不好会让 policy 塌缩到单域。目前开源复现仅限 <10B scale。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].counterunknown

    [counter to Camp C: Online adaptive mixing] Absolute per-domain losses aren't comparable; imperfect normalization collapses the policy onto a single domain. Open reproductions so far are limited to <10B scale.

contestedc-632bfb02ff
Llama 3 [MetaLlama32024] 即使跑过强 filter 仍然花精力在 curriculum 上;filter 不能替代对 multilingual / domain-scarce 数据的显式 upsample。
来源论文· 1[MetaLlama32024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].counterunknown

    [counter to Camp D: Ratio doesn't matter, quality does] Even after strong filtering Llama 3 [MetaLlama32024] still invests in curriculum; filters cannot substitute for explicit upsampling of multilingual or domain-scarce data.

consensusc-ba3ad206a2
128K 下游能力由长文档比例决定而非 token 总量——≥25% 长文档 + 5B continual-pretrain tokens 就能饱和 NIAH [Fu2024DE128K]。
来源论文· 1[Fu2024DE128K]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsunknown

    128K capability is governed by long-document ratio, not token count: ≥25% long docs with 5B continual-pretraining tokens already saturates NIAH [Fu2024DE128K].

consensusc-4dada09621
perplexity 与 RULER/LongBench v2 的排名相关性弱 [Gao2024EffectiveLongCtx, LongBenchV2],用 ppl 选长上下文模型已不再可信。
来源论文· 1[Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsunknown

    Perplexity rankings decouple from RULER/LongBench v2 [Gao2024EffectiveLongCtx, LongBenchV2]; selecting long-context models by ppl is no longer defensible.

contestedc-3f1816b405
Kazemnejad et al. [Kazemnejad2023NoPE] 在合成外推任务上 NoPE 稳过 ALiBi 与 RoPE,说明 causal mask 已提供足够位置信号。
来源论文· 1[Kazemnejad2023NoPE]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].counterunknown

    [counter to Camp A: Explicit PE is necessary; RoPE interpolation is the ] Kazemnejad et al. [Kazemnejad2023NoPE] show NoPE beats ALiBi and RoPE on synthetic extrapolation, indicating the causal mask already injects enough positional signal.

contestedc-1148119a11
ByteScale [ByteScale2025] 实测:真实变长语料里,正交假设导致严重负载不均,在 12K+ GPU 上把两者统一为同一切分维度后吞吐恢复到 ~55% MFU。
来源论文· 1[ByteScale2025]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].counterunknown

    [counter to Camp B: SP and DP are orthogonal and can be optimized indepe] ByteScale [ByteScale2025] measures that on real variable-length corpora the orthogonal assumption causes severe load imbalance; unifying them as one partitioning dime

contestedc-4fbf7dd867
DeepSeek-V2 [DeepSeekV2] 在同 KV 预算下把 context 从 32K 扩到 128K,V3 [DeepSeekV3] 把 MLA 推到 671B/37B-active 仍不退化——是 GQA 的硬反驳。
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[2].counterunknown

    [counter to Camp C: KV compression = GQA/MQA is enough] DeepSeek-V2 [DeepSeekV2] stretches context 32K→128K at equal KV budget, and V3 [DeepSeekV3] sustains MLA at 671B/37B-active without regression—a hard rebuttal to GQA.

contestedc-878e1c945c
Gao et al. [Gao2024EffectiveLongCtx] 与 LongBench v2 [LongBenchV2] 共同显示,ppl 与 RULER/跨段推理任务排名脱钩;ppl 下降 5% 的模型在 64K+ 任务上可能表现不升反降。
来源论文· 2[Gao2024EffectiveLongCtx]arXiv 2410.02660arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2410.02660· report.positions[3].counterunknown

    [counter to Camp D: Perplexity is still a valid long-context metric] Gao et al. [Gao2024EffectiveLongCtx] and LongBench v2 [LongBenchV2] jointly show ppl decouples from RULER and cross-span reasoning; a 5% ppl drop may even regress on 64K+

consensusc-5d9837311b
RoPE 外推失败的根因是低频维度 OOD,不是全局 scale 问题——这决定了 PI 在高频维度上的副作用无法用 ”再微调几步” 抹掉 [Chen2023PI][bloc972023NTK]。
来源论文· 2[Chen2023PI]arXiv 2306.15595arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2306.15595· report.headline_claimsunknown

    RoPE extrapolation fails because of low-frequency OOD, not a global scale problem — which is why PI's high-frequency damage cannot be erased by a few more fine-tune steps [Chen2023PI][bloc972023NTK].

consensusc-8ed01cc420
在 ≤128K 规模,YaRN + 400 步微调和 ABF + continual pretrain 在 RULER 上差距 <3 pp,retrofit 是性价比更高的选择 [Peng2023YaRN][Xiong2023LongLlama][Young2024Yi]。
来源论文· 4[Peng2023YaRN][Xiong2023LongLlama][Young2024Yi]arXiv 2309.00071arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2309.00071· report.headline_claimsunknown

    At ≤128K, YaRN + 400 fine-tune steps and ABF + continual pretrain differ by <3 pp on RULER; retrofit is the higher-ROI choice [Peng2023YaRN][Xiong2023LongLlama][Young2024Yi].

consensusc-d83efd6683
≥512K 场景 per-dim 非均匀 scale 不可绕过,smooth NTK 公式在 1M 段明显落后 LongRoPE 的 evolutionary search [Ding2024LongRoPE][GeminiTeam2024]。
来源论文· 3[Ding2024LongRoPE][GeminiTeam2024]arXiv 2402.13753arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2402.13753· report.headline_claimsunknown

    At ≥512K, per-dim non-uniform scales cannot be bypassed; smooth NTK clearly trails LongRoPE's evolutionary search at 1M [Ding2024LongRoPE][GeminiTeam2024].

consensusc-1077daf576
PPL 不能测长文能力;任何不报 RULER / LongBench / 多跳变体数字的长文论文都应降低可信度权重 [Hsieh2024RULER][Bai2023LongBench][Li2025Haystack]。
来源论文· 4[Hsieh2024RULER][Bai2023LongBench][Li2025Haystack]arXiv 2404.06654arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2404.06654· report.headline_claimsunknown

    PPL does not measure long-context capability; any long-context paper that skips RULER / LongBench / multi-hop variants deserves a lower credibility weight [Hsieh2024RULER][Bai2023LongBench][Li2025Haystack].

consensusc-7b3e9d89eb
非 RoPE 架构(ALiBi、LM-Infinite、RetNet)在 ≤32K 的 needle 式任务上接近 RoPE,但在 128K+ 的多跳真实任务上明显落后,不是 2026 年的生产选项 [Press2021ALiBi][Han2023LMInfinite][Sun2023RetNet]。
来源论文· 4[Press2021ALiBi][Han2023LMInfinite][Sun2023RetNet]arXiv 2108.12409arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2108.12409· report.headline_claimsunknown

    Non-RoPE architectures (ALiBi, LM-Infinite, RetNet) approach RoPE on ≤32K needle tasks but clearly trail on 128K+ multi-hop real tasks, and are not a 2026 production option [Press2021ALiBi][Han2023LMInfinite][Sun2023RetNet].

contestedc-8b90a3a5a5
成本大:需要大规模 continual pretrain,且多数团队没有 8K→128K 的全阶段 token 预算。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].counterunknown

    [counter to Camp A: pretrain-time ABF is the clean path] Expensive: requires large-scale continual pretraining, and most teams lack the token budget for the full 8K→128K ladder.

contestedc-7458a91928
在 512K+ 时 YaRN 的 smooth ramp 开始出现 per-dim 错配,RULER 长尾任务退化明显。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].counterunknown

    [counter to Camp B: YaRN is the de-facto retrofit tool] Above 512K the smooth ramp begins to show per-dim mismatch; RULER's long-tail tasks degrade visibly.

contestedc-802b6e8f2c
search 成本高,且在 512K 以下相对 YaRN 优势不稳定。
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].counterunknown

    [counter to Camp C: ≥1M requires LongRoPE] Search is expensive, and below 512K its edge over YaRN is inconsistent.

contestedc-81c230c991
RULER 多跳任务上集体落后:masking 丢失中段信息,RetNet 的 retention 衰减过快。Zhong et al. [Zhong2024AttnRoPE] 给出机械解释。
来源论文· 1[Zhong2024AttnRoPE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].counterunknown

    [counter to Camp D: bypass the whole PI/NTK/YaRN lineage] Collectively trail on RULER multi-hop: masking loses middle-range info; RetNet's retention decays too fast. Zhong et al. [Zhong2024AttnRoPE] provide the mechanical explanation.

contestedc-6d05a62c16
Lingle 等人指出在实际复杂部署中,基础 µP 依然存在不可忽视的漂移;且对于已有 SP 代码库,重构的工程代价远大于收益。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].countergemini-3.1-pro

    [counter to Camp A: µP is the absolute default] Lingle et al. point out that in complex practical deployments, basic µP still exhibits non-negligible drift; and for existing SP codebases, the refactoring cost far outweighs the benefits.

contestedc-6f227d69b8
Gemstones 证明这种拟合对 aspect ratio 和 LR schedule 极度敏感,一旦架构微调,原有的经验公式就会失效。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].countergemini-3.1-pro

    [counter to Camp B: Empirical formulas + a few sweeps suffice] Gemstones proves this fitting is extremely sensitive to aspect ratio and LR schedule; once the architecture is tweaked, existing empirical formulas break down.

contestedc-83040248ac
对于 LR 和 Init Scale 这种已有明确解析缩放规则(µP)的维度,用算力去搜索纯属浪费。
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].countergemini-3.1-pro

    [counter to Camp C: End-to-end Bayesian search is the endgame] For dimensions like LR and Init Scale that already have explicit analytical scaling rules (µP), spending compute on search is pure waste.

consensusc-ecef73c938
Muon [Jordan2024Muon] 在 NanoGPT speedrun 上把 AdamW 的 5 min 压到 3.3 min(−34%),是 ≤30B 新训练唯一应被 A/B 优先列入的候选
来源论文· 1[Jordan2024Muon]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsclaude-opus-4.7

    Muon [Jordan2024Muon] cut NanoGPT speedrun from AdamW's 5 min to 3.3 min (−34%); it is the only A/B candidate worth priority for ≤30B new runs

consensusc-9c69cc911f
SOAP [Vyas2024SOAP] 把 Shampoo 的额外 HP 从 4 减到 1,wall-clock 已经与 AdamW 打平——'二阶太贵' 的反驳在 2024 年已经失效
来源论文· 2[Vyas2024SOAP]arXiv 2409.11321arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2409.11321· report.headline_claimsclaude-opus-4.7

    SOAP [Vyas2024SOAP] cut Shampoo's extra HPs from 4 to 1 and matches AdamW wall-clock — the 'second-order is too expensive' objection was retired in 2024

consensusc-a0fb17101e
Apollo [Zhu2024Apollo] 用 per-tensor 标量替代 per-param 二阶矩,把 optimizer state 从 2P 降到 ~0,7B/13B 同 loss;是 2025 年显存紧张团队的默认
来源论文· 2[Zhu2024Apollo]arXiv 2412.05270arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2412.05270· report.headline_claimsclaude-opus-4.7

    Apollo [Zhu2024Apollo] replaces per-param second moments with per-tensor scalars, collapsing optimizer state from 2P to ~0 with matched 7B/13B loss; the 2025 default for memory-tight teams

consensusc-e3fb492489
控住 HP search budget 后(AlgoPerf [Dahl2023AlgoPerf],Agarwal et al. [Agarwal2020LRConfound]),adaptive-vs-SGD 和部分 optimizer 之间的 gap 塌缩一半以上——任何 A/B 不控 HP 就是噪声
来源论文· 3[Dahl2023AlgoPerf][Agarwal2020LRConfound]arXiv 2306.07179arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2306.07179· report.headline_claimsclaude-opus-4.7

    Controlling HP-search budget (AlgoPerf [Dahl2023AlgoPerf]; Agarwal et al. [Agarwal2020LRConfound]) collapses more than half of the adaptive-vs-SGD and cross-optimizer gap — any A/B without HP control is noise

consensusc-87f14d71b3
AdamW [Loshchilov2017AdamW] 的护城河是 muP [Lingle2024muP] 生态,不是算法本身;Muon/SOAP 要进 ≥70B production,必须先补上 second-order muP [Ishikawa2023SecondOrdermuP] 这一块
来源论文· 4[Loshchilov2017AdamW][Lingle2024muP][Ishikawa2023SecondOrdermuP]arXiv 1711.05101arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 1711.05101· report.headline_claimsclaude-opus-4.7

    AdamW [Loshchilov2017AdamW]'s moat is the muP [Lingle2024muP] ecosystem, not the algorithm itself; for Muon/SOAP to enter ≥70B production, the missing piece is second-order muP [Ishikawa2023SecondOrdermuP]

contestedc-6cf8d6c199
Muon 在同 HP budget 下的 NanoGPT 测试仍给 −34% wall-clock [Jordan2024Muon];SOAP 在 360M–1.3B matched budget 下 loss 仍低于 AdamW [Vyas2024SOAP]。'AdamW 无可替代' 过度外推。
来源论文· 2[Jordan2024Muon][Vyas2024SOAP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: AdamW is never retired] Muon still yields −34% wall-clock on NanoGPT under matched HP budget [Jordan2024Muon]; SOAP still beats AdamW's loss at 360M–1.3B under matched budget [Vyas2024SOAP]. 'AdamW is irreplaceable' over

contestedc-c7ea2fc38c
Muon 在 ≥30B 上的公开证据仍缺(cluster coverage 标记为 sparse);second-order muP [Ishikawa2023SecondOrdermuP] 工程化未落地,Muon 的 LR 从 proxy 迁到 production 还靠 sqrt(max(d_in,d_out)) 经验式。
来源论文· 1[Ishikawa2023SecondOrdermuP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: Muon is the next default] Public evidence for Muon at ≥30B is still missing (cluster coverage marked sparse); productionized second-order muP [Ishikawa2023SecondOrdermuP] is not yet delivered, and Muon's LR transfer from

contestedc-7429c9aac2
≥7B 公开 Muon-vs-SOAP head-to-head 目前 open;second-order muP 的工程化 [Ishikawa2023SecondOrdermuP] 还在实验室阶段;HP 迁移生态整体落后 AdamW 一个数量级。
来源论文· 1[Ishikawa2023SecondOrdermuP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: Shampoo / SOAP is the proper endgame] ≥7B public Muon-vs-SOAP head-to-heads remain open; productionized second-order muP [Ishikawa2023SecondOrdermuP] is still in the lab; the HP-transfer ecosystem overall lags AdamW by a

contestedc-e351bca4c7
Muon 在 matched budget 下仍给 −34% wall-clock [Jordan2024Muon];SOAP 在 matched budget 下仍 loss 更低 [Vyas2024SOAP];强版本 '数据就够了' 过拟合了 schedule 现象。
来源论文· 2[Jordan2024Muon][Vyas2024SOAP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: optimizers don't matter, data does] Muon still gives −34% wall-clock under matched budget [Jordan2024Muon]; SOAP still reaches lower loss under matched budget [Vyas2024SOAP]; the strong version 'data is enough' overfits

consensusc-5cf27e9ce9
short-to-long curriculum(95/5)在同 compute 下相对 mixed-length 训练节约 20–40% wallclock,LLaMA-3 [Llama32024] 与 Qwen2.5 [Qwen25Tech] 均以此为默认。
来源论文· 1[Llama32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsclaude-opus-4.7

    The 95/5 short-to-long curriculum saves 20–40% wallclock at equal compute vs mixed-length training; LLaMA-3 [Llama32024] and Qwen2.5 [Qwen25Tech] both default to it.

consensusc-7433b70314
长上下文扩展阶段 RoPE base 应从默认 10K 调到 500K–1M,并配合 100B+ token 的 mid-train,YaRN [YaRN2023] 提供 attention temperature 修正。
来源论文· 2[YaRN2023]arXiv 2309.00071arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2309.00071· report.headline_claimsclaude-opus-4.7

    In the context-extension stage RoPE base should be moved from a default of 10K to 500K–1M with 100B+ mid-train tokens; YaRN [YaRN2023] supplies the attention-temperature correction.

consensusc-b29f73a56d
FIM 的 'free lunch' 主张 [FIM2022] 仅在 code model 上被独立复现;NL model 上 Code Llama [CodeLlama2023] 观察到小幅但稳定的退化,默认不开。
来源论文· 3[FIM2022][CodeLlama2023]arXiv 2207.14255arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2207.14255· report.headline_claimsclaude-opus-4.7

    The FIM 'free lunch' claim [FIM2022] has been independently reproduced only for code models; on NL models Code Llama [CodeLlama2023] observed a small but consistent regression, so it should be off by default.

consensusc-5df2cfae85
每条 packed doc 的 first-token loss 应该剔除,EOS+BOS 双标记 + split-then-pack 是工业界公开 recipe 的默认值 [NeMoPacking2024][Llama32024]。
来源论文· 2[NeMoPacking2024][Llama32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsclaude-opus-4.7

    The first-token loss of each packed doc should be dropped; EOS+BOS double markers plus split-then-pack are the default in published industrial recipes [NeMoPacking2024][Llama32024].

contestedc-786692f506
反对者担心末段 mid-train 的 100B token 不足以让 attention head 充分适应长 context。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: short-to-long + per-doc mask (mainstream)] Critics worry the tail 100B-token mid-train is insufficient for attention heads to fully adapt to long context.

contestedc-0854ad9f10
attention O(L²) 让短 doc 被迫跑在长窗口里,相同 compute 下 wallclock 多 20–40%,且 LLaMA-3 / Qwen2.5 的长 context 评测并未显示 short-to-long 劣势。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: uniformly mixed-length training] Attention O(L²) forces short docs into a long window; at equal compute wallclock is 20–40% higher, and LLaMA-3 / Qwen2.5 long-context evals do not show a short-to-long disadvantage.

contestedc-c61f99f880
Krell et al. [Packing2021] 量化了 0.5–2% loss 偏差;FA2 [FlashAttention2Varlen] / FA3 [FlashAttention32024] 把 per-doc mask 的成本压到 <3% throughput。
来源论文· 2[Packing2021][FlashAttention32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: naive concat + cross-doc visible] Krell et al. [Packing2021] quantify a 0.5–2% loss bias; FA2 [FlashAttention2Varlen] / FA3 [FlashAttention32024] cut per-doc mask cost to < 3% throughput.

contestedc-a51e3f7522
Code Llama [CodeLlama2023] 在 NL benchmark 上观察到小幅但稳定的退化;StarCoder [StarCoder2023] 把 FIM 限定在 code。
来源论文· 3[CodeLlama2023][StarCoder2023]arXiv 2308.12950arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2308.12950· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: FIM for everything] Code Llama [CodeLlama2023] reports a small but consistent NL regression; StarCoder [StarCoder2023] restricts FIM to code.

consensusc-da16e96486
在同 tokenizer、同目标函数、同模型家族内,预训练 loss/PPL 对规模外推仍然稳定;但这个稳定性一旦跨 tokenizer、跨语言或跨后训练阶段,解释力会明显下降 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Tao2024VocabularyScaling][Isik2024DownstreamScaling]。
来源论文· 4[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Tao2024VocabularyScaling][Isik2024DownstreamScaling]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.4

    Within the same tokenizer, objective, and model family, pretraining loss/PPL remains stable for scale extrapolation; once we cross tokenizers, languages, or post-training stages, its explanatory power drops materially [Kaplan2020ScalingLaws

consensusc-4fee02b76e
相同或近似相同的预训练 loss 并不保证相同下游能力;优化路径、隐式偏置、压缩方式和行为分布漂移都能在几乎不动 PPL 的情况下改写任务表现 [HongLiu2022SameLossBetterDownstream][KhanalCapone2024CompressionTasks]。
来源论文· 1[KhanalCapone2024CompressionTasks]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.4

    The same or nearly the same pretraining loss does not guarantee the same downstream capability; optimization path, implicit bias, compression method, and behavioral distribution shift can all rewrite task performance while leaving PPL nearl

consensusc-07aa7889ed
对 multiple-choice、reasoning、pass/fail 类任务,PPL 的连续改善常被离散评测映射成阈值跳变;因此用单一 PPL 预测任务拐点,结构上就不稳 [Schaeffer2023Mirage][Schaeffer2024WhyElusive]。
来源论文· 2[Schaeffer2023Mirage][Schaeffer2024WhyElusive]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.4

    For multiple-choice, reasoning, and pass/fail tasks, continuous PPL improvements are often mapped into threshold-like jumps by discrete evaluation; using a single PPL scalar to predict task inflection points is therefore structurally unstab

consensusc-81f704096e
一个更稳的决策流程是两阶段:训练环内用 PPL/BPB 拟合与监控,训练环外直接建模任务缩放律并配合标准化 eval panel;这比把 PPL 当最终代理更可执行 [Gadre2024OvertrainingDownstream][Bhagia2024TaskScalingLadders][Krajewski2025DownstreamMetricsScaling][Gu2024OLMES]。
来源论文· 2[Bhagia2024TaskScalingLadders][Gu2024OLMES]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.4

    A more robust decision workflow is two-stage: use PPL/BPB for fitting and monitoring inside the training loop, then model task scaling directly and pair it with a standardized evaluation panel outside the loop; this is more actionable than

contestedc-e8dce9030d
问题在于这个结论只在训练环内稳。Hong Liu et al. [HongLiu2022SameLossBetterDownstream]、Khanal and Capone [KhanalCapone2024CompressionTasks]、Tao et al. [Tao2024VocabularyScaling] 都说明,一旦跨 tokenizer、跨后训练阶段或进入压缩,PPL 的统一解释会失效。
来源论文· 2[KhanalCapone2024CompressionTasks][Tao2024VocabularyScaling]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].countergpt-5.4

    [counter to Camp A: PPL remains the most reliable primary variable] The problem is that this conclusion is stable mainly inside the training loop. Hong Liu et al. [HongLiu2022SameLossBetterDownstream], Khanal and Capone [KhanalCapone2024Com

contestedc-a036e69c6f
难点在于任务缩放律比 loss 缩放律更脆弱,受 benchmark 选择、评测协议和任务离散性影响更大。Lourie et al. [Lourie2025RealityCheck] 就质疑小规模结果外推大规模下游性能的可靠性。
来源论文· 1[Lourie2025RealityCheck]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].countergpt-5.4

    [counter to Camp B: PPL is only an intermediate state inside task scalin] The difficulty is that task scaling laws are more fragile than loss scaling laws, being more sensitive to benchmark choice, evaluation protocol, and task discreteness

contestedc-18609a817b
代价是系统更复杂,组织内部更难维护;如果没有统一聚合规则,多面板也可能退化成“指标太多,谁都能挑自己喜欢的结果”。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].countergpt-5.4

    [counter to Camp C: Stop searching for one scalar; use multi-panel diagn] The cost is greater system complexity and harder organizational maintenance; without a clear aggregation rule, a multi-panel setup can degrade into 'too many metrics,

contestedc-0142c741c3
如果把这个论点推到极端,就会低估 PPL 在训练工程中的价值。大量训练决策仍然需要一个便宜、连续、低噪声的监控量,而 PPL/BPB 目前没有真正等价的替代。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].countergpt-5.4

    [counter to Camp D: The problem with PPL is ontological, not merely pred] Taken to the extreme, this view can understate the engineering value of PPL. Many training decisions still need a cheap, continuous, low-noise monitor, and PPL/BPB cu

consensusc-e64001ac0c
“均匀重复 ≤4 epochs 基本免费”只在“曝光分布近似均匀”的前提下成立;把约 2% 的热子集重复到 100× 会带来可测 loss 退化,并在 induction head 中出现可解释的重复学习指纹 [Hernandez2022RepeatedData]。
来源论文· 1[Hernandez2022RepeatedData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.2

    “Uniform repetition ≤4 epochs is basically free” only holds when the exposure distribution stays close to uniform; repeating ~2% of a hot subset up to 100× yields measurable loss degradation and leaves interpretable repetition fingerprints

consensusc-4991ca7182
对 web 爬取语料,近重复比例可到两位数百分比(例如最高 13.6% 量级),substring/MinHash 去重能同时改善 PPL 并降低记忆化与 train-test 泄漏 [Lee2021Dedup];因此“爬取级被动重复”应默认强 dedup,而不是靠训练时的随机打散来“平均掉”。
来源论文· 2[Lee2021Dedup]arXiv 2107.06499arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2107.06499· report.headline_claimsgpt-5.2

    In web corpora, near-duplication can reach double-digit percentages (e.g., up to ~13.6%), and substring/MinHash dedup improves perplexity while reducing memorization and train-test leakage [Lee2021Dedup]; therefore scrape-level passive repe

consensusc-a7acbae1a8
语义近重复是 MinHash 的盲区:embedding 级语义去重/多样化(如 SemDeDup、D4)在固定 compute 下可用更少样本达到相近或更好的效果,等价于把“有效 token 数”抬高 [Abbas2023SemDeDup][Tirumala2023D4]。
来源论文· 2[Abbas2023SemDeDup][Tirumala2023D4]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.2

    Semantic near-duplicates are a blind spot for MinHash: embedding-based semantic dedup/diversification (e.g., SemDeDup, D4) can match or exceed performance with fewer samples at fixed compute, effectively increasing the “effective token coun

consensusc-5448e5a690
benchmark 污染难以靠“广泛过滤”彻底消除;更稳的工程策略是把 benchmark/版权/PII 作为 zero-repeat(或 0–1 次曝光)子集,走独立 provenance 与前后缀 dedup,而不是混入主池按 epoch 乘 [Deng2023BenchmarkContamination][Carlini2022Memorization]。
来源论文· 1[Carlini2022Memorization]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.2

    Benchmark contamination is hard to eliminate via “broad filtering” alone; a safer engineering policy is to treat benchmarks/copyright/PII as zero-repeat (or 0–1 exposure) subsets with separate provenance and prefix/suffix dedup, rather than

consensusc-6a169a2fe0
“少数据更好”(大规模 pruning)并不稳定:在某些规模下,激进裁剪的收益有限甚至反转;与其押注单一 pruning 分数,更可控的路线是:先去重/语义去重减少冗余,再用 mixture/reweighting 把预算投到目标分布 [Marion2023DataPruning][Xie2023DoReMi]。
来源论文· 2[Marion2023DataPruning][Xie2023DoReMi]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.2

    “Less data is more” (large-scale pruning) is not consistently reliable: at some scales aggressive pruning yields limited gains or reversals; a more controllable path is to dedup/semantic-dedup to remove redundancy, then use mixture/reweight

contestedc-4eaa3651aa
反例来自数据受限训练:当你已经在高质量有限池上训练到 data-limited,继续扩大“去重后的唯一 token”并不现实,此时均匀多 epoch 在前 ~4 轮接近等效新鲜 token [Muennighoff2023DataConstrained]。此外,过度语义去重可能误删“同主题但互补”的样本,当前缺少系统性的负面结果与边界条件 [Abbas2023SemDeDup]。
来源论文· 2[Muennighoff2023DataConstrained][Abbas2023SemDeDup]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].countergpt-5.2

    [counter to Camp A: Dedup as aggressively as possible] Counterexamples come from data-constrained training: once you are data-limited on a finite high-quality pool, expanding unique tokens after dedup may be infeasible, and uniform multi-ep

contestedc-cfab0370df
这条经验对“曝光分布均匀”高度敏感:Hernandez et al. [Hernandez2022RepeatedData] 显示热子集过曝会退化并留下 induction head 指纹;同时,敏感/版权/benchmark 的重复曝光会把记忆化与污染风险推高,不能按“平均 epoch”来论证安全 [Carlini2022Memorization][Deng2023BenchmarkContamination]。
来源论文· 3[Hernandez2022RepeatedData][Carlini2022Memorization]arXiv 2205.10487arxiv.org
2 观测Pretrain Data Repetition
证据 (2)
  • topic_reportpretrain-data-repetitionarXiv 2205.10487· report.positions[1].countergpt-5.2

    [counter to Camp B: Uniform repetition ≤4 epochs is (almost) free] The heuristic is highly sensitive to uniform exposure: Hernandez et al. [Hernandez2022RepeatedData] shows hot-subset over-exposure degrades and fingerprints induction heads;

  • topic_reportpretrain-data-repetitionarXiv 2205.10487· report.positions[1].countergpt-5.4

    [counter to Camp B: Uniform repetition up to about 4 epochs is close to ] The problem is that the word “uniform” is often dropped in practice. Hernandez et al. [Hernandez2022RepeatedData] show that hot-subset over-exposure causes extra degr

contestedc-db0364de44
证据仍偏单侧:文本 LM 上关于“语义去重误删导致能力缺口”的负面结果很少,且与 mixture/reweighting、epochs 的交互缺少同一语料同一预算的对照 [Xie2023DoReMi][Muennighoff2023DataConstrained]。此外,跨语料重复与 provenance 问题不是语义去重能单独解决的 [Elazar2023WhatsInMyBigData]。
来源论文· 3[Xie2023DoReMi][Muennighoff2023DataConstrained][Elazar2023WhatsInMyBigData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[2].countergpt-5.2

    [counter to Camp C: Semantic dedup is the real battleground] Evidence is still one-sided: negative results where semantic dedup deletes useful diversity and creates capability gaps are scarce for text LMs, and interactions with mixture/rewe

contestedc-45889dd38a
现实约束是:很多风险数据无法被完美识别,且跨语料复用会让“零重复”在工程上变成“零可证伪”;因此需要把目标从“绝对零”落到“可审计的曝光上限 + 可复现的过滤证据”,并与主池的 epochs 策略解耦 [Elazar2023WhatsInMyBigData][Deng2023BenchmarkContamination]。
来源论文· 1[Elazar2023WhatsInMyBigData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].countergpt-5.2

    [counter to Camp D: Zero repetition for sensitive/eval/copyright data] The practical constraint is that many risky items cannot be perfectly identified, and cross-corpus reuse makes “zero-repeat” hard to falsify operationally. The goal shou

consensusc-7dc93baeb8
Kaplan 2020 的 compute-optimal 偏向大模型,是 LR schedule 与训练步长假设的副产物——Chinchilla [Hoffmann2022Chinchilla] 把比例从 ≈ 1.7 tokens/param 修正到 ≈ 20。
来源论文· 2[Hoffmann2022Chinchilla]arXiv 2203.15556arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2203.15556· report.headline_claimsclaude-opus-4.7

    Kaplan 2020's large-model optimum was a side effect of a fixed LR schedule and short training horizons — Chinchilla [Hoffmann2022Chinchilla] corrected the ratio from ≈ 1.7 to ≈ 20 tokens/param.

consensusc-b7ded43bb0
开源侧的独立复现(DeepSeek LLM [DeepSeek2024LLM])把 compute-optimal 比例落在 5–100 tokens/param 的宽带——不存在可复制的 “正确斜率”。
来源论文· 2[DeepSeek2024LLM]arXiv 2401.02954arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2401.02954· report.headline_claimsclaude-opus-4.7

    Open-source independent re-fits (DeepSeek LLM [DeepSeek2024LLM]) place compute-optimal in a wide 5–100 tokens/param band — there is no copyable universal slope.

consensusc-ab69ee7e61
数据 mixture 是一条独立的 scaling 轴:在相同 N, D 预算下,DCLM [Li2024DCLM] 的筛选配方能拉开 ≥ 7 pp 下游差距,phi-1 [Gunasekar2023Textbooks] 用 1.3B 参数在代码任务上击败十倍更大的模型。
来源论文· 2[Li2024DCLM][Gunasekar2023Textbooks]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsclaude-opus-4.7

    Data mixture is an independent scaling axis: at matched N and D budgets, DCLM [Li2024DCLM] recipes open ≥ 7 pp downstream gaps, and phi-1 [Gunasekar2023Textbooks] beats 10× larger models on code with 1.3B parameters.

consensusc-00b770e0e1
单任务分数不按 loss 的幂律外推——Gadre et al. [Gadre2024OverTraining] 与 Bhagia et al. [Bhagia2024TaskLadders] 指出必须用 two-step 回归(loss → task-perplexity → accuracy)才能拿到 1.9% 级别的预测误差。
来源论文· 3[Gadre2024OverTraining][Bhagia2024TaskLadders]arXiv 2403.08540arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2403.08540· report.headline_claimsclaude-opus-4.7

    Per-task scores do not follow the loss power law — Gadre et al. [Gadre2024OverTraining] and Bhagia et al. [Bhagia2024TaskLadders] show a two-step regression (loss → task perplexity → accuracy) is required to reach ≈ 1.9% prediction error.

consensusc-7d3c13f794
大多数报告的 “能力突现” 在 Schaeffer et al. [Schaeffer2023Mirage] 换用连续指标后消失——emergence 更像 evaluation metrics 的非线性化,而非模型内部相变。
来源论文· 1[Schaeffer2023Mirage]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsclaude-opus-4.7

    Most reported 'emergent abilities' disappear once Schaeffer et al. [Schaeffer2023Mirage] swap to continuous metrics — emergence is better read as a nonlinearity in the evaluation metric than as an internal phase transition.

consensusc-5a7066e9d2
≈ 4 epochs 是数据复用的安全边界:Muennighoff et al. [Muennighoff2023DataConstrained] 给出平均曲线,Hernandez et al. [Hernandez2022Repeated] 给出微观机制——超过这个边界 induction head 容量被消耗在记忆上。
来源论文· 3[Muennighoff2023DataConstrained][Hernandez2022Repeated]arXiv 2305.16264arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2305.16264· report.headline_claimsclaude-opus-4.7

    ≈ 4 epochs is the safe boundary for token repetition: Muennighoff et al. [Muennighoff2023DataConstrained] give the average curve, Hernandez et al. [Hernandez2022Repeated] give the mechanism — beyond it induction-head capacity is spent on me

contestedc-34e7f96191
Hoffmann et al. [Hoffmann2022Chinchilla] 用三种独立拟合一致得到 tokens/param ≈ 20,Chinchilla-70B / 1.4T 击败 Gopher-280B 与 GPT-3-175B,直接证伪 Kaplan 偏大模型配方。
来源论文· 2[Hoffmann2022Chinchilla]arXiv 2203.15556arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2203.15556· report.positions[0].counterclaude-opus-4.7

    [counter to Camp A: Kaplan — bigger model, fewer tokens] Hoffmann et al. [Hoffmann2022Chinchilla] show, via three independent fits, tokens/param ≈ 20. Chinchilla-70B / 1.4T beats Gopher-280B and GPT-3-175B, empirically falsifying Kaplan's b

contestedc-2c212b6b07
DeepSeek LLM [DeepSeek2024LLM] 在独立数据与 batch schedule 下把 ratio 扫到 5–100;LLaMA [Touvron2023LLaMA] 自己就在 7B 上 over-train 到 143;Muennighoff et al. [Muennighoff2023DataConstrained] 在 data-constrained 下给出完全不同的斜率。≈ 20 不是常数,是一条局部切线。
来源论文· 4[DeepSeek2024LLM][Touvron2023LLaMA][Muennighoff2023DataConstrained]arXiv 2401.02954arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2401.02954· report.positions[1].counterclaude-opus-4.7

    [counter to Camp B: Chinchilla — balance N and D under compute] DeepSeek LLM [DeepSeek2024LLM] finds a 5–100 band under independent data and batch schedules; LLaMA [Touvron2023LLaMA] itself over-trains 7B to ratio 143; Muennighoff et al. [M

contestedc-400d2435d6
phi-1 [Gunasekar2023Textbooks] 的 ‘textbook-quality’ 在通用任务上并不稳健;RefinedWeb [Penedo2023RefinedWeb] 的 web-only 结论在代码/数学等专业领域失效。数据质量的收益强烈 domain-dependent,不宜当作普适斜率。
来源论文· 2[Gunasekar2023Textbooks][Penedo2023RefinedWeb]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].counterclaude-opus-4.7

    [counter to Camp C: Data-mixture pragmatists — data is the first axis] phi-1's [Gunasekar2023Textbooks] 'textbook-quality' edge does not generalize to broad tasks; RefinedWeb's [Penedo2023RefinedWeb] web-only claim weakens in code/math spec

contestedc-04afbf57d6
GPT-4 技术报告 [OpenAI2023GPT4] 展示了部分能力在小模型上几乎为零、在大模型上可用的阶跃;Yuan et al. [Yuan2023Math] 也报告推理任务存在明显的阈值效应,不能全部归因于指标。
来源论文· 3[OpenAI2023GPT4][Yuan2023Math]arXiv 2303.08774arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2303.08774· report.positions[3].counterclaude-opus-4.7

    [counter to Camp D: Against emergence-as-magic] The GPT-4 technical report [OpenAI2023GPT4] shows stepwise jumps where small models score near zero and large ones become usable; Yuan et al. [Yuan2023Math] report threshold-like behavior on r

contestedc-645291940e
固定维度的隐状态在数学上无法无损压缩包含大量低频实体的长序列,导致在 induction heads 和关联召回任务上存在硬性物理上限。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].countergemini-3.1-pro

    [counter to Camp A: Pure SSMs will eventually replace Transformers entir] Fixed-dimensional hidden states mathematically cannot losslessly compress long sequences containing numerous low-frequency entities, leading to hard physical limits o

contestedc-1023170297
Mamba 的核心创新在于“输入依赖的门控(Input-dependent gating)”,这使得其衰减矩阵随输入动态变化,而传统的线性注意力通常使用固定的位置衰减。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].countergemini-3.1-pro

    [counter to Camp B: Linear Attention and SSMs are fundamentally the same] Mamba's core innovation lies in "input-dependent gating," allowing its decay matrix to change dynamically with the input, whereas traditional linear attention typical

contestedc-64b4f04f14
从头预训练成本极高,且 Transformer 已经学习到了高质量的特征表示和 induction heads,这些知识完全可以通过矩阵对齐蒸馏给 SSM。
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[2].countergemini-3.1-pro

    [counter to Camp C: Subquadratic models must be pretrained from scratch] Pretraining from scratch is prohibitively expensive, and Transformers have already learned high-quality feature representations and induction heads, which can be disti

consensusc-9325ab30e6
HumanEval 单点不足以比较 SWE:它主要测单文件函数合成,而 repo context、infilling、私有库与 issue 修复会系统性改写模型排序 [Chen2021Codex] [Austin2021MBPP] [RepoBench2023] [InCoder2022]。
来源论文· 5[Chen2021Codex][Austin2021MBPP][RepoBench2023][InCoder2022]arXiv 2107.03374arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2107.03374· report.headline_claimsgpt-5.4

    HumanEval alone is insufficient for SWE comparison: it mainly measures single-file function synthesis, while repo context, infilling, private libraries, and issue fixing systematically reorder model rankings [Chen2021Codex] [Austin2021MBPP]

consensusc-5ca29ea083
SWE-bench Verified 是 SFT/RL 阶段的锚点,但不是唯一真值:它降低了 flaky test 噪声,却仍主要覆盖 Python 仓库;跨到 7 种语言后,相对排名会变化 [SWEbench2023] [SWEbenchVerified2024] [MultiSWEBench2025]。
来源论文· 4[SWEbench2023][SWEbenchVerified2024][MultiSWEBench2025]arXiv 2310.06770arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2310.06770· report.headline_claimsgpt-5.4

    SWE-bench Verified is the anchor at the SFT/RL stage, but not the only truth: it reduces flaky-test noise yet still concentrates on Python repositories; once extended to 7 languages, relative rankings can change [SWEbench2023] [SWEbenchVeri

consensusc-9a069320aa
带 freshness 控制的滚动评测是 SFT-ready 比较的必要条件;没有时间窗口,代码 benchmark 的高分可能部分来自污染而非泛化 [LiveCodeBench2024] [EvalPlus2023] [ODEX2022]。
来源论文· 4[LiveCodeBench2024][EvalPlus2023][ODEX2022]arXiv 2403.07974arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2403.07974· report.headline_claimsgpt-5.4

    A contamination-controlled rolling benchmark is a required condition for SFT-ready comparison; without a time window, high code-benchmark scores may partly reflect contamination rather than generalization [LiveCodeBench2024] [EvalPlus2023]

consensusc-b5194c18f0
部署阶段如果只看 pass@1,会低估 agent 系统差异;retry@k、test-exec rate、无效文件读取和 tokens-per-issue 往往更接近用户感知质量 [SelfDebug2023] [CodeT2022] [ClaudeCode2025]。
来源论文· 4[SelfDebug2023][CodeT2022][ClaudeCode2025]arXiv 2304.05128arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2304.05128· report.headline_claimsgpt-5.4

    If deployment evaluation looks only at pass@1, it understates system differences; retry@k, test-execution rate, unnecessary file reads, and tokens per issue are often closer to user-perceived quality [SelfDebug2023] [CodeT2022] [ClaudeCode2

contestedc-0e5da3aa2c
反证已经比较集中:EvalPlus [EvalPlus2023] 说明原始测试过宽;CoderEval [CoderEval2023]、RepoBench [RepoBench2023]、BigCodeBench [BigCodeBench2024] 说明真实工作负载包含库调用、跨文件上下文与复杂指令,这些都不在 HumanEval 的任务面上。
来源论文· 5[EvalPlus2023][CoderEval2023][RepoBench2023][BigCodeBench2024]arXiv 2305.01210arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2305.01210· report.positions[0].countergpt-5.4

    [counter to Camp A: HumanEval is enough] The pushback is now concentrated: EvalPlus [EvalPlus2023] shows that original tests are too loose; CoderEval [CoderEval2023], RepoBench [RepoBench2023], and BigCodeBench [BigCodeBench2024] show that

contestedc-07d87154fb
问题不在 Verified 不好,而在它覆盖不全。Multi-SWE-Bench [MultiSWEBench2025] 显示跨语言会洗牌;SWE-Gym [SWEGym2024] 与 Claude Code [ClaudeCode2025] 说明环境、trajectory 与 harness 也会改变结论。
来源论文· 4[MultiSWEBench2025][SWEGym2024][ClaudeCode2025]arXiv 2504.02605arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2504.02605· report.positions[1].countergpt-5.4

    [counter to Camp B: SWE-bench Verified is the only truth] The issue is not that Verified is poor, but that it is incomplete. Multi-SWE-Bench [MultiSWEBench2025] shows cross-language reordering; SWE-Gym [SWEGym2024] and Claude Code [ClaudeCo

contestedc-84c01f630d
现有文献并未给出足够直接的 patch-PPL / code BPB / CrossCodeEval 对 SWE-bench Verified 的系统相关性。CRUXEval [CRUXEval2024] 只能说明执行语义是有用代理,RepoBench [RepoBench2023] 只能说明 repo realism 重要;两者都不足以支持“似然可替代后续 benchmark”的强结论。
来源论文· 2[CRUXEval2024][RepoBench2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].countergpt-5.4

    [counter to Camp C: trajectory-level PPL is the most real pretrain-stage] The current literature does not provide sufficiently direct evidence that patch-PPL, code BPB, or CrossCodeEval systematically predict SWE-bench Verified. CRUXEval [C

contestedc-e1f5138355
只看 UX 指标也不够,因为没有成功率锚点时,系统可能只是更保守、更昂贵,或者靠更多预算换稳定。SWE-bench Verified [SWEbenchVerified2024] 仍需要保留作为结果层面的硬约束。
来源论文· 1[SWEbenchVerified2024]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].countergpt-5.4

    [counter to Camp D: agent UX metrics matter more than pass@1] Looking only at UX metrics is also insufficient, because without a success anchor a system may simply be more conservative, more expensive, or buying stability with more budget.

contestedc-b93955c99f
问题在于这条路线容易被误读成“synthetic 可以替代真实世界”。Phi 系列本身并没有证明这一点;其公开 recipe 仍依赖真实 seed、过滤后的 web 数据和强 teacher。离开这些锚点,递归蒸馏会更快贴近 teacher 偏好并丢掉长尾。[MAD2023][CollapseInevitable2024]
来源论文· 2[MAD2023][CollapseInevitable2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].countergpt-5.4

    [counter to Camp A: synthetic data-first (the Phi line)] The problem is that this line is easy to overread as “synthetic can replace the real world.” The Phi papers do not establish that. Their public recipes still rely on real seeds, filte

contestedc-e7300e4582
代价是工程复杂度更高:需要单独的 mid-train 调度、更多数据管线、更多 verifier / filter 组件,而且公开报告通常不完全披露配比,复现门槛高。
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].countergpt-5.4

    [counter to Camp B: web-heavy backbone + synthetic mid-train (the LLaMA ] The cost is higher engineering complexity: a separate mid-train schedule, more data pipelines, and more verifier/filter components. Public reports also rarely disclos

contestedc-781d479243
问题是纯 web-scale 对稀缺能力不够高效。WRAP 说明同样内容经 rephrase 后 token 密度更高;DeepSeekMath 说明没有 verifier-driven 合成,math 很难把正确性信号放大到同一水平。[WRAP2024][DeepSeekMath2024]
来源论文· 2[WRAP2024][DeepSeekMath2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].countergpt-5.4

    [counter to Camp C: pure web-scale, avoid synthesis as much as possible] The issue is that pure web-scale is inefficient for scarce capabilities. WRAP shows that the same content becomes denser after rephrasing, and DeepSeekMath shows that

contestedc-67ee0da4b6
现有公开证据并不支持“无限扩展”。MAD 和 Gerstgrasser et al. [CollapseInevitable2024] 已经把 replace real 的风险说清楚;同时,只有 code / math 这类 verifier 强于 teacher 的领域,才看到更长的 scaling 区间。自然语言 reasoning 和 agentic 轨迹没有同等级保证。[MAD2023][CollapseInevitable2024][DeepSeekMath2024
来源论文· 3[CollapseInevitable2024][MAD2023]arXiv 2404.01413arxiv.org
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrainarXiv 2404.01413· report.positions[3].countergpt-5.4

    [counter to Camp D: unlimited synthetic scaling] Current public evidence does not support “unlimited scaling.” MAD and Gerstgrasser et al. [CollapseInevitable2024] already clarify the risk of replacing real data; moreover, only domains wher

contestedc-897c44d0c3
Scaffolding 只是乘数,预训练能力是基数。如果基础模型在预训练中没有见过跨文件依赖或 commit 形态,外部框架的 prompt 无法凭空生成正确的 patch。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[0].countergemini-3.1-pro

    [counter to Camp A: Scaffolding and Test-Time Compute are Everything] Scaffolding is merely a multiplier; pretraining capability is the multiplicand. If the base model hasn't seen cross-file dependencies or commit formats during pretraining

contestedc-24f25e3e3c
RL 的成功高度依赖预训练提供的初始策略。Wei et al. [2502.18449] 的实验表明,如果没有高质量的预训练基座,RL 阶段的执行反馈会因为探索空间过大而失效。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2502.18449· report.positions[1].countergemini-3.1-pro

    [counter to Camp B: RL and Verifiers are the True Drivers] RL success is highly dependent on the initial policy provided by pretraining. Experiments by Wei et al. [2502.18449] show that without a high-quality pretraining base, execution fee

contestedc-d2e6952634
单纯增加单文件代码的 token 数量存在边际效应递减。如果不改变数据组织形式(如引入 topological packing),模型无法学会跨文件推理。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[2].countergemini-3.1-pro

    [counter to Camp C: Just Mix More Code] Simply increasing the token count of single-file code yields diminishing returns. Without changing the data organization format (e.g., introducing topological packing), the model cannot learn cross-fi

contestedc-a4fa94c482
Arora et al. [Arora2024LinearAttention] 和 Merrill et al. [Merrill2024IllusionState] 证明固定状态容量无法无损压缩长序列,导致 MQAR 掉点。
来源论文· 2[Arora2024LinearAttention][Merrill2024IllusionState]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[0].countergemini-3.1-pro

    [counter to Camp A: Pure SSM is the ultimate solution for long context] Arora et al. [Arora2024LinearAttention] and Merrill et al. [Merrill2024IllusionState] prove fixed state capacity cannot losslessly compress long sequences, causing MQAR

contestedc-dee36b986e
混合架构引入了异构算子,增加了 KV cache 管理的工程复杂度。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[1].countergemini-3.1-pro

    [counter to Camp B: Hybrid is the pragmatic production path] Hybrid architectures introduce heterogeneous operators, increasing the engineering complexity of KV cache management.

contestedc-55fa0cb36a
在 256K 规模下,即使使用 GQA,Transformer 的 KV cache 内存占用和 prefill 延迟依然是硬伤。
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[2].countergemini-3.1-pro

    [counter to Camp C: Transformer + long-context extensions suffice] At 256K scale, even with GQA, Transformer's KV cache memory footprint and prefill latency remain prohibitive.

contestedc-ddf51fbd72
Chen et al. [Chen2024StuffedMamba] 指出 RWKV 同样面临固定状态容量带来的状态干扰问题。
来源论文· 1[Chen2024StuffedMamba]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[3].countergemini-3.1-pro

    [counter to Camp D: RWKV is the correct RNN revival path] Chen et al. [Chen2024StuffedMamba] point out RWKV faces the same state interference issues due to fixed state capacity.

consensusc-675cdbdc8d
在固定 compute 的受控预训练里,仅 tokenizer 选择就能造成 0.6–5.1 pp 的下游差异,量级与常见 data mixture 改动相当 [Ali2024TokenizerChoice]。
来源论文· 2[Ali2024TokenizerChoice]arXiv 2402.01035arxiv.org
3 观测Tokenizer Scaling
证据 (3)
  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    In fixed-compute controlled pretraining, tokenizer choice alone induces 0.6–5.1 pp downstream variance, comparable to common data-mixture changes [Ali2024TokenizerChoice].

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    Under fixed-compute controlled pretraining, changing only the tokenizer induces 0.6–5.1 pp downstream variance, and higher compression alone does not explain the ranking [Ali2024TokenizerChoice].

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    At fixed compute and the same 2.6B model size, swapping the tokenizer alone induces 0.6–5.1 pp downstream variance; treating the tokenizer as constant misattributes this variance to data or training recipe [Ali2024TokenizerChoice].

consensusc-cb9c0c9e58
把 vocab 从 32K 扩到 128K 可带来约 0.02–0.04 nats 的训练 loss 下降,同时在大模型推理侧几乎不引入额外算力瓶颈;收益主要体现在 non-English 与 code 的 bytes/token 改善 [Dubey2024Llama3]。
来源论文· 1[Dubey2024Llama3]
3 观测Tokenizer Scaling
证据 (3)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Scaling vocab from 32K to 128K yields ~0.02–0.04 nats lower training loss with near-zero inference bottleneck at large-model serving; gains concentrate in better bytes/token for non-English and code [Dubey2024Llama3].

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Scaling vocab 32K→128K adds little inference overhead beyond embedding/softmax, yet at 15.6T training tokens it yields 0.02–0.04 nats lower loss and improves bytes/token for non-English and code [Dubey2024Llama3].

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Scaling vocab to 128K in industrial training corresponds to 0.02–0.04 nats lower loss [Dubey2024Llama3] and shorter non-English/code sequences via lower bytes/token, shifting inference cost pressure more toward KV cache and attention comput

consensusc-754f10db79
数字 tokenization(single-digit vs merged BPE)会在 3–5 位算术上造成 10–20 pp 的稳定差距,并且从 1B→7B 到 frontier 模型都不显示“随 scale 收敛”的迹象 [Spiess2023Digits][Singh2024TokenizationCounts]。
来源论文· 2[Spiess2023Digits][Singh2024TokenizationCounts]
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Digit tokenization (single-digit vs merged BPE) causes stable 10–20 pp gaps on 3–5 digit arithmetic, with no sign of “closing with scale” from 1B→7B up to frontier models [Spiess2023Digits][Singh2024TokenizationCounts].

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Digit-tokenization boundary choices create 10–20 pp gaps on 3–5 digit arithmetic, and the gap does not automatically converge from 1B→7B nor in frontier measurements [Spiess2023Digits][Singh2024TokenizationCounts].

consensusc-33abd8498f
已训练模型中可检测到 3k–10k+ 的 under-trained tokens,这些 token 在生成时更容易输出乱码/无意义片段;用 Magikarp 的 embedding-gap + probe 可在部署前定位风险点 [LandBartolo2024Magikarp]。
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.2

    Trained LMs contain 3k–10k+ detectable under-trained tokens that are more likely to emit garbage strings; Magikarp’s embedding-gap plus probes can localize these risks pre-deployment [LandBartolo2024Magikarp].

consensusc-603e6b4074
跨 tokenizer 的可比指标应以 byte 归一化(BPB/byte-normalized loss)为主;per-token PPL 会系统性偏向更高压缩率的 tokenizer,即使其下游并不更好 [Gao2020Pile][Schmidt2024TokenizationMoreThanCompression]。
来源论文· 2[Gao2020Pile]arXiv 2101.00027arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2101.00027· report.headline_claimsgpt-5.2

    Cross-tokenizer evaluation should be byte-normalized (BPB/byte-normalized loss); per-token perplexity systematically favors more compressed tokenizers even when downstream is not better [Gao2020Pile][Schmidt2024TokenizationMoreThanCompressi

contestedc-2f8e74ffe3
受控实验已经显示 tokenizer 不是“二阶细节”:固定 compute 下 0.6–5.1 pp 的差异无法用“coverage 已经够了”解释 [Ali2024TokenizerChoice];而且 under-trained tokens 在多款模型中普遍存在,属于“默认沿用”会积累的训练债务 [LandBartolo2024Magikarp]。
来源论文· 2[Ali2024TokenizerChoice][LandBartolo2024Magikarp]
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizers are frozen preprocessing; coverage is eno] Controlled experiments show tokenizers are not a second-order detail: 0.6–5.1 pp variance at fixed compute is not explained by “coverage is already fine” [Ali2024Toke

  • topic_reporttokenizer-scaling· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizers are frozen preprocessing; coverage is eno] Fixed-compute controlled experiments directly refute the engineering assumption that tokenizer effects are negligible: changing only the tokenizer yields 0.6–5.1 pp d

contestedc-65e2428842
现有公开证据主要覆盖到 128K,而不是 256K+;同时更大 vocab 会放大长尾 token 与 under-trained 风险 [LandBartolo2024Magikarp],且压缩率本身并不能预测下游 [Schmidt2024TokenizationMoreThanCompression]。缺少的是 32K/64K/128K/256K 在 fixed compute 下的系统对比(含 latency、embedding 带宽、under-trained tok
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is always better; push to 256K+] Public evidence largely covers up to 128K, not 256K+. Larger vocabs can amplify long-tail and under-trained risks [LandBartolo2024Magikarp], and compression alone does not pr

contestedc-05b5d73780
BLT 已把质量推到 BPE parity,但同时明确给出 2–3× 推理成本 [Pagnoni2024BLT];对吞吐敏感的产品,这个税往往比 tokenizer 带来的 0.6–5.1 pp 机会成本更“硬” [Ali2024TokenizerChoice]。此外,很多生产问题(数字、空白符、长尾 token)在 BPE 路线下可以用更小的工程改动修到可接受水平 [Spiess2023Digits][LandBartolo2024Magikarp]。
来源论文· 5[Pagnoni2024BLT][Ali2024TokenizerChoice][Spiess2023Digits][LandBartolo2024Magikarp]arXiv 2412.09871arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2412.09871· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE] BLT pushes quality to BPE parity but also reports a 2–3× inference cost [Pagnoni2024BLT]. For throughput-sensitive products, this tax is often harder than the opportunity cost

contestedc-225d41d32b
主要挑战不是理论,而是组织与工具链:tokenizer sweep、BPB 报告、Magikarp 扫描与修复需要进入训练流水线;同时多语言 vocab 分配与代码空白符策略需要与数据混合共同设计 [Limisiewicz2023VocabAllocation][Kocetkov2022TheStack]。
来源论文· 2[Limisiewicz2023VocabAllocation][Kocetkov2022TheStack]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp D: tokenizer is a pretrain product spec—make BPE right ] The main challenge is not theory but org/tooling: tokenizer sweeps, BPB reporting, and Magikarp scan/repair must enter the training pipeline; multilingual vocab alloc

consensusc-72bd8013e5
对稠密 decoder-only LLM,70B 以下优先 GQA 而不是 MQA:它保留大部分质量,同时把 KV heads 从 h 降到 h/8 一类分组,推理内存收益接近 MQA,但避免 MQA 常见的质量损失 [Ainslie2023GQA][Shazeer2019MQA]。
来源论文· 3[Ainslie2023GQA][Shazeer2019MQA]arXiv 2305.13245arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2305.13245· report.headline_claimsgpt-5.4

    For dense decoder-only LLMs below 70B, prefer GQA over MQA: it keeps most of the quality while reducing KV heads to grouped settings such as h/8, capturing much of MQA’s inference-memory gain without the usual quality drop [Ainslie2023GQA][

consensusc-fabc2eeabb
Norm/FFN 侧的默认项已经基本收敛到 pre-RMSNorm + SwiGLU;QK-Norm 的主要价值是抑制大规模训练中的 attention-logit spike,而不是在所有规模上单独带来下游分数提升 [RMSNorm2019][Shazeer2020GLU][Wortsman2023Instabilities]。
来源论文· 4[RMSNorm2019][Shazeer2020GLU][Wortsman2023Instabilities]arXiv 1910.07467arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 1910.07467· report.headline_claimsgpt-5.4

    On the Norm/FFN side, defaults have mostly converged to pre-RMSNorm + SwiGLU; the main value of QK-Norm is suppressing attention-logit spikes in large-scale training, not delivering standalone downstream gains at every scale [RMSNorm2019][S

consensusc-f12e918aa0
当已有稳定 base 且目标规模不超过约 30B 时,depth-up-scaling / block expansion 往往比 from-scratch 更划算;Solar [Kim2023Solar] 用 200B tokens 把 7B 扩到 10.7B,并优于同量级重训对照。
来源论文· 2[Kim2023Solar]arXiv 2312.15166arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2312.15166· report.headline_claimsgpt-5.4

    When a stable base already exists and the target size is roughly below 30B, depth-up-scaling or block expansion is often more cost-effective than from-scratch scaling; SOLAR [Kim2023Solar] grows a 7B model to 10.7B with 200B tokens and outp

contestedc-37c937f89c
反例来自部署约束而不是纯训练 loss。Ainslie et al. [Ainslie2023GQA]、Jiang et al. [Mistral2023]、Google DeepMind [Gemma3Report] 与 DeepSeek-AI [DeepSeekV2] 都表明,当上下文和模型规模上去后,KV-cache 成本足以改变默认架构选择。
来源论文· 3[Ainslie2023GQA][Mistral2023]arXiv 2305.13245arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2305.13245· report.positions[0].countergpt-5.4

    [counter to Camp A: the architecture is mostly done; just keep scaling f] The counterargument comes from deployment constraints rather than pure training loss. Ainslie et al. [Ainslie2023GQA], Jiang et al. [Mistral2023], Google DeepMind [Ge

contestedc-83026f4ce7
问题在于公开 LM 证据还不够同预算、同任务、同服务约束。相比之下,Gemma 3 [Gemma3Report]、DeepSeek-V2 [DeepSeekV2]、Mistral 7B [Mistral2023] 已经给出可部署的 Transformer 改造,而且 Lost in the Middle [LostMiddle2023] 提醒我们,长上下文 backbone 的评价不能只看支持长度。
来源论文· 2[Mistral2023][LostMiddle2023]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].countergpt-5.4

    [counter to Camp B: the next backbone should move to SSM / RetNet / Mamb] The issue is that public LM evidence is still not matched enough on budget, task, and serving constraints. By contrast, Gemma 3 [Gemma3Report], DeepSeek-V2 [DeepSeekV

contestedc-be8366ac8e
反面是 Kaplan et al. [Kaplan2020ScalingLaws] 与现代 from-scratch 报告 [Llama3Report] 所代表的读法:如果目标是全新大模型、全新数据分布和长期路线图,重训更干净,也更容易按 scaling law 规划。
来源论文· 2[Kaplan2020ScalingLaws]arXiv 2001.08361arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2001.08361· report.positions[2].countergpt-5.4

    [counter to Camp C: the second scaling path should be default—grow first] The counter-view, represented by Kaplan et al. [Kaplan2020ScalingLaws] and modern from-scratch reports [Llama3Report], is that if the goal is a new large model, new d

contestedc-a1b55bd677
Wortsman et al. [Wortsman2023Instabilities] 已经给出更具体的机制证据,Gemma 3 [Gemma3Report] 也把 QK-Norm 放进公开默认配方。真正证据不足的是 sandwich norm,而不是 QK-Norm 本身。
来源论文· 2[Wortsman2023Instabilities]arXiv 2309.14322arxiv.org
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvementsarXiv 2309.14322· report.positions[3].countergpt-5.4

    [counter to Camp D: QK-Norm / sandwich norm are optional details] Wortsman et al. [Wortsman2023Instabilities] already provide more specific mechanistic evidence, and Gemma 3 [Gemma3Report] also includes QK-Norm in a public default recipe. T

consensusc-22e57e5e00
公开 frontier recipe 已经从固定配比转向 curriculum:前期 web-heavy 占主导,尾段把 code/math/reasoning 上采样约 3–5× [MetaLlama32024][MiniCPM2024]。
来源论文· 2[MetaLlama32024][MiniCPM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.4

    Public frontier recipes have shifted from fixed mixtures to curricula: early training is web-heavy, while the tail upsamples code, math, and reasoning by roughly 3–5× [MetaLlama32024][MiniCPM2024].

consensusc-9b589f233f
稀缺域上采样存在重复上限:约 ≤4 epoch 常常近似免费,>16 epoch 往往应改为补新数据或 synthetic,而不是继续重复 [Muennighoff2023Repeat][Textbooks2023][Cosmopedia2024]。
来源论文· 3[Muennighoff2023Repeat][Textbooks2023][Cosmopedia2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.4

    Upsampling scarce domains has repetition ceilings: roughly ≤4 epochs is often nearly free, while >16 epochs usually calls for fresh or synthetic data rather than more repetition [Muennighoff2023Repeat][Textbooks2023][Cosmopedia2024].

contestedc-403c8a5035
反方会指出,正式搜索额外成本高,而且最优权重常常依赖 dedup、filter 和 domain 切分;如果这些前提不稳,精细搜索可能是在优化噪声 [D42023][DataCompLM2024]。
来源论文· 1[DataCompLM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].countergpt-5.4

    [counter to Camp A: Formal search first] The counterargument is that formal search adds substantial cost, and the resulting weights often depend on dedup, filtering, and domain partitioning; if those prerequisites are unstable, fine-grained

contestedc-669fad1d59
反方会说,这类 recipe 依赖具体模型、token 预算和数据管线,迁移时容易失真;没有显式搜索,团队往往不知道自己离最优差多远 [DoReMi2023][RegMix2024]。
来源论文· 2[DoReMi2023][RegMix2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.4

    [counter to Camp B: Heuristics plus curriculum are more robust] The counterargument is that such recipes depend on specific models, token budgets, and data pipelines, and may distort under transfer; without explicit search, teams often do n

contestedc-550db96fc5
反方会指出,在线方法把控制问题引入主训练回路,反馈更噪,系统复杂度更高;如果 sampler、logging 或 domain 标签不稳,收益很容易被实现误差吃掉 [DoReMi2023][OLMo2024]。
来源论文· 2[DoReMi2023][OLMo2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.4

    [counter to Camp C: Online adaptation beats one-shot offline search] The counterargument is that online methods inject a control problem into the main training loop, with noisier feedback and higher systems complexity; if the sampler, loggi

contestedc-ee715afd0f
反方会说,quality 主要压缩的是 web 内部噪声;一旦 web 质量上来,剩余优化空间仍然落在 code、math、多语和 synthetic 的配比上,ratio 并没有消失 [DoReMi2023][MetaLlama32024][MiniCPM2024]。
来源论文· 3[DoReMi2023][MetaLlama32024][MiniCPM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].countergpt-5.4

    [counter to Camp D: Ratio is secondary; quality is first-order] The counterargument is that quality mainly compresses noise inside the web domain; once web quality improves, the remaining optimization room still lies in the ratios of code,

consensusc-021da49dbd
mid-train 已从可选技巧变成 frontier pretrain 的标准阶段,公开 recipe 常把约 10–30% 总训练算力放在主干预训练之后的分布回拉上。[Phi3Report][DeepSeekMath2024][Llama3Herd]
来源论文· 2[DeepSeekMath2024]arXiv 2402.03300arxiv.org
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrainarXiv 2402.03300· report.headline_claimsgpt-5.4

    Mid-train has shifted from an optional trick to a standard stage in frontier pretraining, with public recipes often allocating roughly 10–30% of total training compute to post-backbone distribution pullback.[Phi3Report][DeepSeekMath2024][Ll

contestedc-a5952420d8
反方问题在于外推边界。Phi-1 [Phi1Textbooks] 和 Phi-3 Technical Report [Phi3Report] 的成功主要集中在小模型、受控分布和高质量 seed;一旦离开这些条件,teacher 风格泄漏、模板化和尾部分布缺失会更明显。[MAD2023][DataCompLM2024]
来源论文· 2[MAD2023][DataCompLM2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].countergpt-5.4

    [counter to Camp A: synthetic-first can be the main route] The counterpoint is about extrapolation limits. The successes of Phi-1 [Phi1Textbooks] and Phi-3 Technical Report [Phi3Report] are concentrated in small models, controlled distribut

contestedc-6aede28c70
反方会说这条路线保守,仍然依赖昂贵的真实数据抓取与过滤,没把 synthetic 的成本优势吃满。[Phi1Textbooks][Cosmopedia2024] 但公开证据里,真正可复用、可扩规模、可控风险的 recipe 多数都落在这一路线上。
来源论文· 1[Cosmopedia2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].countergpt-5.4

    [counter to Camp B: a web-heavy backbone plus synthetic mid-train is the] The counterargument is that this route is conservative and still depends on expensive real-data collection and filtering, leaving some of the cost advantage of synthe

contestedc-1da7d58c52
问题在于过滤只能在已有候选池里选,不能像 rephrase 或 verifier loop 那样主动扩展高价值分布。对于 math、code、长上下文解释等稀缺分布,只靠过滤往往不够。[DeepSeekMath2024][WRAP2024]
来源论文· 2[DeepSeekMath2024][WRAP2024]
2 观测Synthetic Data Midtrain
证据 (2)
  • topic_reportsynthetic-data-midtrain· report.positions[2].countergpt-5.4

    [counter to Camp C: avoid synthetic as much as possible; stronger filter] The problem is that filtering can only select from an existing candidate pool; it cannot actively expand high-value distributions the way rephrasing or verifier loops

  • topic_reportsynthetic-data-midtrain· report.positions[2].countergpt-5.4

    [counter to Camp C: avoid synthetic as much as possible; stronger filter] The limitation is that filtering can only reorder an existing distribution; it cannot reliably create scarce task shapes. In high-value but naturally low-frequency re

contestedc-8768256855
公开证据并不支持这种无限外推。MAD [MAD2023] 说明 replace real 会先丢尾部;CollapseInevitable [CollapseInevitable2024] 也只证明 accumulate real 时退化不是必然,并没有证明“无限 synthetic 没问题”。更关键的是,自然语言 reasoning 和 agentic 轨迹缺少 code / math 那样的硬 verifier。[MAD2023][CollapseInevitable20
来源论文· 2[MAD2023][CollapseInevitable2024]
2 观测Synthetic Data Midtrain
证据 (2)
  • topic_reportsynthetic-data-midtrain· report.positions[3].countergpt-5.4

    [counter to Camp D: synthetic can scale almost without bound; collapse i] The public evidence does not support this unbounded extrapolation. MAD [MAD2023] shows that replace-real dynamics lose tail support first; CollapseInevitable [Collaps

  • topic_reportsynthetic-data-midtrain· report.positions[3].countergpt-5.4

    [counter to Camp D: synthetic can scale almost without bound; collapse i] Counterexamples still matter. MAD [MAD2023] shows that once real data is replaced, tail modes collapse first; and the strongest verifier-backed successes are concentr

consensusc-6d180e9232
把 mixture 写成 curriculum 比固定向量更稳:公开配方里,Llama 3 与 MiniCPM 都采用“前期 web-heavy + 尾段 code/math/reasoning 约 3–5× 上采样”的阶段性策略,机制上对应“先压方差、后塑能力” [MetaLlama32024][MiniCPM2024]。
来源论文· 2[MetaLlama32024][MiniCPM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Curriculum mixtures are more robust than a fixed vector: public recipes from Llama 3 and MiniCPM both follow “web-heavy early + ~3–5× upweighting of code/math/reasoning late,” matching a variance-first then capability-shaping mechanism [Met

contestedc-d1026c5006
主要风险是迁移:最优点随模型规模与训练阶段漂移,单次离线搜索可能在目标规模失效;此外,离线方法往往假设 domain buckets 已经稳定且可记账,否则拟合到的是数据管线偏差。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].countergpt-5.2

    [counter to Camp A: Formal search first (laws/regression/robust optimiza] The main risk is transfer: optima drift with model scale and training stage, so one-shot offline search can fail at target scale; offline methods also assume stable,

contestedc-385691cc98
反例是:当目标是严格的 compute-optimal 或 worst-case 约束时,纯 heuristic 可能错过可观收益;另外,缺少系统化搜索会让团队难以解释“为什么是这个配比”。
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.2

    [counter to Camp B: Heuristics + curriculum are more robust (few ablatio] Counterpoint: for compute-optimal or worst-case constrained goals, pure heuristics can miss meaningful gains; lack of systematic search also makes it harder to justif

contestedc-8b8a5a22b3
证据仍偏稀:缺少大量 equal-compute 的 head-to-head 对比;同时在线方法对 domain 记账、采样器、以及 loss 噪声极敏感,容易把系统误差当作域难度 [D42023][OrganizeWeb2025]。
来源论文· 1[OrganizeWeb2025]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.2

    [counter to Camp C: Online adaptation beats one-shot offline search] Evidence is still sparse: few equal-compute head-to-head comparisons; online methods are highly sensitive to domain accounting, samplers, and loss noise, and can treat sys

contestedc-d25023a2f1
质量优先不等于 ratio 无用:在过滤后的池子里,剩余自由度往往集中在稀缺域(code/math/多语)与训练阶段控制(curriculum),这部分靠“更强过滤”不一定能替代 [MetaLlama32024][MiniCPM2024]。
来源论文· 2[MetaLlama32024][MiniCPM2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].countergpt-5.2

    [counter to Camp D: Ratios are second-order; quality is first-order] Quality-first does not mean ratios are useless: on filtered pools, remaining degrees of freedom concentrate in scarce domains (code/math/multilingual) and stage control (c

consensusc-e71b3a1a14
在总算力固定时,RegMix/规律类离线方法的主要价值不是“找到全局最优”,而是把 2–3 轮 ablation 的结果外推成可复用的局部响应面;但跨规模迁移不是免费午餐:最优 mixture 随 scale 漂移,必须把 scale 作为条件变量或做二次校准 [RegMix2024][BiMix2024][AutoScale2024][Shukor2025OptimalMixtures]。
来源论文· 4[RegMix2024][BiMix2024][AutoScale2024][Shukor2025OptimalMixtures]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Under a fixed total compute budget, the main value of RegMix/law-style offline methods is not “finding a global optimum” but turning 2–3 ablation rounds into a reusable local response surface; however, scale transfer is not free—optimal mix

consensusc-ceb5c339a5
把 ratio 写成 curriculum(时间函数)比静态向量更贴近公开大配方:阶段性 web-heavy + 尾段对 code/math/reasoning 约 3–5× 上采样,等价于在训练末期提高稀缺梯度信号的占比,而不是全程牺牲覆盖面 [MetaLlama32024]。
来源论文· 1[MetaLlama32024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Expressing ratios as a curriculum (a time function) matches public large-scale recipes better than a static vector: phased web-heavy training plus ~3–5× late upweighting of code/math/reasoning effectively increases scarce gradient signal ne

contestedc-4a494c410c
形式化方法的脆弱点在“坐标系漂移”:dedup/过滤/分桶一变,回归目标与权重含义就变;更关键的是跨规模迁移不稳,proxy 最优点可能在更大规模翻车 [AutoScale2024]。在总算力固定时,重型搜索还会挤占主训练预算,导致“找到了更好比例,但模型更小/训练更短”。
来源论文· 1[AutoScale2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].countergpt-5.2

    [counter to Camp A: Formal search first (laws/regression/robust optimiza] The fragility is coordinate drift: change dedup/filtering/buckets and the regression target and weight semantics change. Scale transfer is also unstable—proxy optima

contestedc-531e32741d
heuristics 的上限在于交互项与预算规划:当域数变多、目标变复杂(多语+代码+长上下文)时,靠少量 ablation 可能覆盖不到关键组合;同时缺少可迁移的“为什么是这个比例”的解释,难以跨规模与跨数据池复用 [Shukor2025OptimalMixtures]。
来源论文· 1[Shukor2025OptimalMixtures]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.2

    [counter to Camp B: Heuristics + curricula are more robust (a few ablati] The ceiling is interactions and budget planning: as domains and objectives grow (multilingual+code+long context), a few ablations may miss critical combinations. Heur

contestedc-de879538bb
在线方法的主要风险不是“算法不对”,而是工程记账与稳定性:需要低延迟统计、可解释的 domain 信号、以及在分布漂移/桶漂移下不发散。现有证据对“把 proxy 阶段成本、在线开销与不稳定性都算进去后是否仍然划算”给得不够直接 [AutoScale2024]。
来源论文· 1[AutoScale2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.2

    [counter to Camp C: Online adaptation beats one-shot offline search] The main risk is not the algorithm but engineering accounting and stability: low-latency statistics, interpretable domain signals, and non-divergence under distribution/bu

contestedc-83996e69bc
把 ratio 降格为“二级效应”会忽略稀缺域与能力定向:过滤往往先砍掉量,导致 code/math/多语等域更稀缺;此时剩余自由度主要在配比、重复上限与 curriculum,而不是继续加大过滤强度 [MetaLlama32024][Petty2024CodePretrainingEffects][NemotronCC2024]。
来源论文· 2[MetaLlama32024][NemotronCC2024]
3 观测Data Mixture
证据 (3)
  • topic_reportdata-mixture· report.positions[3].countergpt-5.2

    [counter to Camp D: Ratios are second-order; quality/selection is first-] Demoting ratios to “second-order” can ignore scarce domains and capability targeting: filtering often removes volume first, making code/math/multilingual even scarcer

  • topic_reportdata-mixture· report.positions[3].countergpt-5.2

    [counter to Camp D: Ratios are second-order; quality/selection is first-] The counterargument is that “cleaner” is not monotonic: over-filtering removes long-tail diversity useful for generalization; and for scarce domains (code/math/reason

  • topic_reportdata-mixture· report.positions[3].countergpt-5.2

    [counter to Camp D: Ratios are second-order; quality/selection is first-] Treating “cleaner” as a single objective ignores scarce domains and capability targeting: filtering reduces volume first, making code/math/multilingual domains scarce

contestedc-4ab34cb780
问题在于这些证据大多集中在小模型、code、儿童故事、教材式语料等高约束场景;一旦跨到通用大模型,真实数据锚定仍然没有被替代 [Phi3Report][Llama3Herd]。
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].countergpt-5.4

    [counter to Camp A: synthetic-first can be the main route] The limitation is that most of this evidence comes from highly constrained settings such as small models, code, children's stories, and textbook-like corpora. Once the setting moves

contestedc-ec450a766c
反方会说这只是因为工业界还没把 synthetic 做到足够强,或者因为公开报告保守,没有暴露更高 synthetic ratio 的内部 recipe [Phi1Textbooks2023][Nemotron42024]。
来源论文· 2[Phi1Textbooks2023][Nemotron42024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].countergpt-5.4

    [counter to Camp B: a web-heavy backbone plus synthetic mid-train is the] The counterargument is that this may only reflect incomplete synthetic pipelines in current industry practice, or conservative public reporting that does not reveal h

consensusc-df5e2e2006
RoPE 外推的主要失效点在低频维度:当 base 太小,低频相位在目标长度上覆盖不足,导致远距离位置不可分;这给出“base ↔ 可学有效长度”的上界解释 [Xu2024RoPEBaseBounds]。
来源论文· 1[Xu2024RoPEBaseBounds]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    RoPE extrapolation fails primarily in low-frequency dims: with too small a base, low-frequency phase coverage is insufficient at target lengths, making far positions inseparable; this yields a base↔learnable-effective-length bound [Xu2024Ro

consensusc-793089dcfd
PI 的统一压缩能把旋转角度拉回训练区间,但它把高频也压缩,等价于降低局部位置分辨率;NTK-aware 明确指出高频不该动,应该只放缩低频 [Chen2023PI][bloc972023NTK]。
来源论文· 1[Chen2023PI]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    PI’s uniform rescaling brings rotation angles back into the trained range, but it also compresses high-frequency dims, effectively reducing local positional resolution; NTK-aware makes the mechanism explicit: keep high-freq nearly fixed and

consensusc-eff9aeff55
在 32K–128K retrofit 区间,YaRN 的 per-dim ramp + attention temperature 是更稳的默认选项:它同时避免高频损伤并修正远距离 logits 熵塌缩,且可在 ~400 步微调内收敛 [Peng2023YaRN]。
来源论文· 1[Peng2023YaRN]
3 观测Long Context Rope Ntk
证据 (3)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    For 32K–128K retrofitting, YaRN’s per-dim ramp plus attention temperature is a more robust default: it avoids high-frequency damage while correcting long-range logit entropy collapse, and typically converges within ~400 fine-tune steps [Pen

  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    For 32K–128K retrofits, YaRN is the more reliable default: per-dim ramp avoids globally compressing high-frequency dimensions, and attention temperature prevents entropy collapse of long-range attention logits; with similar fine-tuning budg

  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    For 32K–128K retrofits, YaRN is the more reliable default: per-dim ramp interpolates mostly low-frequency dimensions while preserving high-frequency ones, and attention temperature stabilizes long-range logit entropy; PI’s global interpolat

consensusc-1674a289e6
当目标窗口 ≥512K,全局缩放(PI/NTK/YaRN 的平滑公式)会出现 per-dim 错配;LongRoPE 通过在维度上搜索非均匀 scale pattern,把问题从“选一个 s”变成“学一个向量”,并在 2M 上给出可复现配方 [Ding2024LongRoPE]。
来源论文· 2[Ding2024LongRoPE]arXiv 2402.13753arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2402.13753· report.headline_claimsgpt-5.2

    At ≥512K, global scaling (smooth PI/NTK/YaRN formulas) shows per-dim mismatch; LongRoPE turns “choose one s” into “learn a scale vector” via per-dim evolutionary search, with a reproducible 2M recipe [Ding2024LongRoPE].

consensusc-84f7eab808
对新训模型,ABF(base≈500000 + short-to-long curriculum)比“先 base=10000 再 retrofit”更少不可逆损伤:生产报告给出 6 阶长度课程与长文基准曲线,且与 base 上界分析一致 [Dubey2024Llama3][Xiong2023LongLlama][Xu2024RoPEBaseBounds]。
来源论文· 3[Dubey2024Llama3][Xiong2023LongLlama][Xu2024RoPEBaseBounds]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    For new pretraining, ABF (base≈500000 + short-to-long curriculum) incurs less irreversible damage than “pretrain at base=10000 then retrofit”: production reports disclose a 6-stage length curriculum and long-context benchmark curves, consis

consensusc-a264229e00
长文评估必须以任务为中心:RULER/LongBench/L-Eval/LV-Eval 能区分“宣称窗口”与“有效窗口”,而 PPL 对检索/聚合/多跳追踪的相关性不足 [Hsieh2024RULER][Bai2023LongBench][An2023LEval][Yuan2024LVEval]。
来源论文· 5[Hsieh2024RULER][Bai2023LongBench][An2023LEval][Yuan2024LVEval]arXiv 2404.06654arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2404.06654· report.headline_claimsgpt-5.2

    Long-context evaluation must be task-centric: RULER/LongBench/L-Eval/LV-Eval separate advertised from effective context, while perplexity correlates poorly with retrieval/aggregation/multi-hop tracing [Hsieh2024RULER][Bai2023LongBench][An20

contestedc-407de2fa9a
反方会说:retrofit 更便宜、更灵活,且 YaRN/LongLoRA 已经能在 400 步内把 4K 拉到 128K;对多数团队,重训不现实 [Peng2023YaRN][Chen2023LongLoRA]。
来源论文· 2[Peng2023YaRN][Chen2023LongLoRA]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].countergpt-5.2

    [counter to Camp A: pretrain-time ABF + curriculum is the cleaner long-c] The counterargument: retrofitting is cheaper and more flexible, and YaRN/LongLoRA can extend 4K→128K within ~400 steps; full pretraining is unrealistic for many teams

contestedc-e5c875079d
反方会说:如果只需要零样本把窗口放大到 ≤2×,NTK-aware 的轻量缩放更省事;而 PI 的实现最简单,工程风险更低 [bloc972023NTK][Chen2023PI]。
来源论文· 1[Chen2023PI]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].countergpt-5.2

    [counter to Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] The counterargument: for zero-shot window increases ≤2×, NTK-aware scaling is simpler; PI is the simplest to implement and may carry lower engineering risk [bloc97202

contestedc-8e7396ea81
反方会说:与其在 RoPE 上做复杂搜索,不如用系统/架构绕开百万长度注意力成本(Ring Attention、稀疏注意力、外部记忆),把“长历史”交给检索或压缩模块 [Liu2023RingAttention][Mohtashami2023LandmarkAttention][Zhang2024ActivationBeacon]。
来源论文· 3[Liu2023RingAttention][Mohtashami2023LandmarkAttention][Zhang2024ActivationBeacon]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].countergpt-5.2

    [counter to Camp C: ≥512K requires per-dim search (LongRoPE-like)] The counterargument: instead of complex RoPE searches, use systems/architectural approaches to avoid million-length attention costs (Ring Attention, sparse attention, extern

contestedc-f3d252c861
反方会说:这些方法往往把“能跑更长”与“能用更长”混在一起;在 RULER/LongBench 这类需要精确 recall 与跨段推理的任务上,能力交换会被放大,最终仍需要把有效长度做实 [Hsieh2024RULER][Bai2023LongBench]。
来源论文· 2[Hsieh2024RULER][Bai2023LongBench]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].countergpt-5.2

    [counter to Camp D: bypass the PI/NTK/YaRN lineage (ALiBi/RetNet/Mamba/m] The counterargument: these methods often conflate “can run longer” with “can use longer”. On RULER/LongBench-style tasks requiring precise recall and cross-span reaso

consensusc-c3ff875285
多款已发布模型中可检测到 3k–10k+ under-trained tokens;它们在部署期更容易生成乱码片段,可通过 Magikarp 扫描并用短 continued pretrain 修复 [LandBartolo2024Magikarp]。
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.2

    Across multiple released models, 3k–10k+ under-trained tokens can be detected; they are more likely to emit garbage at deployment and can be repaired by Magikarp scanning plus short continued pretraining [LandBartolo2024Magikarp].

consensusc-adb464d8c4
跨 tokenizer 的 per-token PPL 不可比:token 变短会机械性降低 PPL;更稳的比较需要 byte-normalized 指标(BPB/byte-level loss)[Gao2020Pile]。
来源论文· 2[Gao2020Pile]arXiv 2101.00027arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2101.00027· report.headline_claimsgpt-5.2

    Per-token perplexity is not comparable across tokenizers: shorter tokens mechanically lower PPL; robust comparisons require byte-normalized metrics (BPB/byte-level loss) [Gao2020Pile].

contestedc-c373262af7
现有公开证据只覆盖到 128K 的可行性与收益 [Dubey2024Llama3],对 256K+ 缺少固定 compute、BPB 归一化、并同时报告吞吐/延迟的直接对照(这是当前最大的证据缺口)。同时,under-trained token 是随长尾稀疏度上升的系统性风险:模型里出现 3k–10k+ 问题 token 并不罕见 [LandBartolo2024Magikarp],更大 vocab 可能把风险放大。再加上 merge 列表可被外部反推训练 mixture [
来源论文· 2[Dubey2024Llama3][LandBartolo2024Magikarp]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is always better; push to 256K+ by defa] Public evidence covers feasibility and gains up to 128K [Dubey2024Llama3], but lacks direct 256K+ comparisons under fixed compute with BPB normalization and measured

contestedc-343cbe4554
tokenizer-free 的“建模侧”论据在增强,但生产默认路线仍卡在系统成本:序列变长会把吞吐/延迟/显存推到不同曲线,且很多工作更强调可训性而非端到端吞吐对齐 [Xue2021ByT5][Yu2023Megabyte][Pagnoni2024BLT]。同时,BPE 的债务并非不可控:采样偏差可以在推理期做校正 [Phan2024TokenizationBias],under-trained token 可以用 Magikarp 扫描并短 continued pretr
来源论文· 4[Xue2021ByT5][Yu2023Megabyte][Pagnoni2024BLT][Phan2024TokenizationBias]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE as soon a] Tokenizer-free evidence on the modeling side is strengthening, but the default production path is still constrained by system cost: longer sequences shift throughput/

contestedc-48cd33b722
主要风险是流程复杂度上升:tokenizer 版本化、BPB 回归、末期修复会增加训练流水线的工程负担;同时,tokenizer-free 的结构创新可能在某些硬件/长上下文场景下更快达到吞吐对齐 [Yu2023Megabyte][Pagnoni2024BLT]。
来源论文· 2[Yu2023Megabyte][Pagnoni2024BLT]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp D: tokenizer is a pretrain product spec—make BPE right ] The main risk is pipeline complexity: tokenizer versioning, BPB regressions, and end-of-run repair add engineering overhead; and tokenizer-free architectural innovati

consensusc-84dffa7377
压缩率(bytes/token)与效果相关但不充分:同等压缩率下仍可能出现可观差异,因此评估必须同时报告 BPB/byte-level loss 与任务维度的失败模式,而不是只优化“更短 token”[Goldman2024UnpackingTokenization][Gao2020Pile]。
来源论文· 1[Gao2020Pile]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Compression (bytes/token) correlates with performance but is insufficient: sizable differences can remain at similar compression, so evaluation must report BPB/byte-level loss plus task-specific failure modes, not just “shorter tokens” [Gol

consensusc-155bf45208
128K vocab 在工业规模训练中带来 0.02–0.04 nats 的 loss 下降,并降低 non-English 与 code 的序列长度;这类收益在推理侧主要体现为 KV cache 压力下降,而不是额外 FLOPs [Dubey2024Llama3]。
来源论文· 1[Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    A 128K vocab yields 0.02–0.04 nats lower loss at industrial scale and reduces sequence length for non-English and code; the inference-side benefit is primarily lower KV-cache pressure rather than extra FLOPs [Dubey2024Llama3].

consensusc-d89fe32507
digit/date 的局部 merge 规则会制造 10–20 pp 的可复现缺口(3–5 位算术、日期推理),且在 frontier 评测中仍存在;“更大模型会自动冲掉 tokenizer 偏差”缺少支撑 [Singh2024TokenizationCounts][Bhatia2025DateFragments]。
来源论文· 2[Singh2024TokenizationCounts][Bhatia2025DateFragments]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Local digit/date merge rules create reproducible 10–20 pp gaps (3–5 digit arithmetic, date reasoning) that persist in frontier evaluations; the claim that scale automatically washes out tokenizer bias lacks support [Singh2024TokenizationCou

consensusc-f12f629031
部署风险不是只来自 OOV,而来自 under-trained tokens:多款模型可检测到 3k–10k+ 问题 token,且可用扫描 + 短 continued pretrain 修复;这应成为 tokenizer/vocab 变更的 release gate [LandBartolo2024Magikarp]。
来源论文· 1[LandBartolo2024Magikarp]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Deployment risk is not only OOV but under-trained tokens: 3k–10k+ problematic tokens are detectable across models and repairable via scan + short continued pretraining; this should be a release gate for tokenizer/vocab changes [LandBartolo2

consensusc-edc614031f
non-unique encoding 会让同一字符串对应多条 token 序列,从而在 RL 中把同一轨迹当成不同序列,直接伤害 reasoning;因此“缩词表/拆细长 token”可以是对齐稳定性的工程选项,而非倒退 [LiuEllis2026SayAnythingButThis][ClaudeMythos2025DynamicVocabPruning]。
来源论文· 2[LiuEllis2026SayAnythingButThis][ClaudeMythos2025DynamicVocabPruning]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Non-unique encodings map the same string to multiple token sequences, causing RL to treat identical trajectories as different sequences and harming reasoning; thus vocab shrinkage / splitting long tokens can be an alignment-stability option

contestedc-acce7b51cc
固定 compute 的 tokenizer-only ablation 直接反驳“影响可忽略”:同一 2.6B 与预算下出现 0.6–5.1 pp 方差 [Ali2024TokenizerChoice]。此外,压缩率并不能解释全部差异 [Goldman2024UnpackingTokenization],说明 tokenizer 影响不只是 I/O,而是改变了可学习的归纳偏置与长尾训练信号分配。
来源论文· 3[Ali2024TokenizerChoice][Vieira2024Characters]arXiv 2402.01035arxiv.org
8 观测Tokenizer Scaling
证据 (8)
  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute tokenizer-only ablations directly contradict “negligible impact”: at the same 2.6B and budget, variance is 0.6–5.1 pp [Ali2024TokenizerChoice]. Compress

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].countergpt-5.4

    [counter to Camp A: the tokenizer is frozen preprocessing; coverage is e] Ali et al. [Ali2024TokenizerChoice] directly show 0.6–5.1 pp differences under fixed compute, which rules out “negligible impact.” Petrov et al. [Petrov2023Unfairness

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].counterep-20260214160829-csjmc

    [counter to Camp A: Tokenizer is frozen preprocessing; coverage is enoug] Ali et al. [Ali2024TokenizerChoice] found 0.6–5.1 pp downstream differences across 24 tokenizers with coverage above 99.9% under fixed 2.6B model and compute budgets,

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Ali et al. [Ali2024TokenizerChoice] show 0.6–5.1 pp variance under fixed compute, contradicting the “it averages out” premise; Vieira et al. [Vieira2024Characters] fu

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute evidence contradicts “negligible impact”: swapping only the tokenizer yields 0.6–5.1 pp deltas [Ali2024TokenizerChoice]. Also, per-token PPL is non-comp

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.positions[0].countergpt-5.2

    [counter to Camp A: tokenizer is frozen preprocessing; coverage is enoug] Fixed-compute evidence directly contradicts “negligible”: swapping only the tokenizer yields 0.6–5.1 pp variance [Ali2024TokenizerChoice]. Compression does not fully

  • + 2 more
contestedc-c386e77d83
两类债务会随 vocab 增大而放大:其一是 under-trained tokens(数千到上万)带来的部署乱码风险 [LandBartolo2024Magikarp];其二是对齐/RL 侧的 tail token 不稳定与 non-unique encoding 问题 [ClaudeMythos2025DynamicVocabPruning][LiuEllis2026SayAnythingButThis]。此外,公开文献仍缺少 32K/64K/128K/256K 的固定
来源论文· 4[LandBartolo2024Magikarp][ClaudeMythos2025DynamicVocabPruning][LiuEllis2026SayAnythingButThis]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is always better; default to 256K+] Two debts scale up with vocab size: (1) under-trained tokens (thousands to 10k+) causing deployment garbage risks [LandBartolo2024Magikarp]; (2) alignment/RL instability f

contestedc-11161627d9
系统税是主要阻力:序列显著变长会放大 KV cache 与注意力开销,吞吐/延迟需要按部署栈实测对齐;而在 BPE 路线内,固定 compute 下 tokenizer 仍可作为可回归杠杆带来 0.6–5.1 pp 的差异 [Ali2024TokenizerChoice]。更稳的工程路径往往是“先把 BPE 做对(指标、规则、修债),再逐步引入 tokenizer-free 作为特定场景方案”。
来源论文· 1[Ali2024TokenizerChoice]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The main blocker is system tax: much longer sequences amplify KV-cache and attention costs, so throughput/latency must be aligned via deployment-level measurement; meanwhi

contestedc-8bfe3fd192
公开证据仍偏“机制与理论 + 观测”,缺少同一模型在对齐阶段做 vocab shrink/prune 的受控实验(稳定性指标、reward hacking、最终能力与系统税的量化)。此外,缩 vocab 可能加重多语与代码的序列长度压力,需要与 Llama 3 类的 bytes/token 账本一起结算 [Dubey2024Llama3]。
来源论文· 1[Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp E: counter-trend—shrink/prune vocab to buy alignment & ] Public evidence is still mostly “mechanism/theory + observation”, lacking controlled experiments that apply vocab shrink/prune during alignment on the same model (sta

consensusc-0ba541d177
固定模型与训练预算时,仅替换 tokenizer 就能带来 0.6–5.1 pp 下游差异;把 tokenizer 当常量会漏掉与数据配比同量级的一部分方差 [Ali2024TokenizerChoice]。
来源论文· 2[Ali2024TokenizerChoice]arXiv 2402.01035arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.4

    Under fixed model size and training budget, swapping the tokenizer alone yields 0.6–5.1 pp downstream variance; treating it as constant leaves out a source of variation comparable to other major recipe choices [Ali2024TokenizerChoice].

consensusc-6bb9739569
128K vocab 在工业规模上可带来约 0.02–0.04 nats 的 loss 下降,且主要系统收益来自更短序列与更低 KV cache 压力,而不是 embedding matmul 的额外成本 [Dubey2024Llama3][FlashAttention22023]。
来源论文· 2[Dubey2024Llama3][FlashAttention22023]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.4

    At industrial scale, a 128K vocabulary can deliver about 0.02–0.04 nats lower loss, and the main systems gain comes from shorter sequences and lower KV-cache pressure rather than extra embedding-matmul cost [Dubey2024Llama3][FlashAttention2

consensusc-e7b7f62525
“更大 vocab 永远更好”不成立:digit/date 的局部 merge 会制造可复现的 10–20 pp 缺口,说明词表结构比词表大小本身更先决定推理失真 [Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic]。
来源论文· 4[Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic]arXiv 2402.14903arxiv.org
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.headline_claimsgpt-5.4

    "Bigger vocab is always better" is false: local digit/date merges create reproducible 10–20 pp gaps, showing that vocabulary structure can dominate vocabulary size in reasoning failures [Singh2024TokenizationCounts][Bhatia2025DateFragments]

  • topic_reporttokenizer-scalingarXiv 2402.14903· report.headline_claimsgpt-5.2

    “Bigger vocab is better” is not monotonic: local digit/date merges can create 10–20 pp gaps on 3–5 digit carry-sensitive arithmetic and date reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments], making these tasks high-sensitivi

consensusc-f246a5fd2a
per-token PPL 不能比较 tokenizer;token 变短或变长会直接改写分母,至少要报告 BPB 或 character-string likelihood,否则会把分词变化误判成模型变好 [Gao2020Pile][Vieira2024Characters]。
来源论文· 3[Gao2020Pile][Vieira2024Characters]arXiv 2101.00027arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2101.00027· report.headline_claimsgpt-5.4

    Per-token perplexity cannot compare tokenizers; changing token length changes the denominator, so BPB or character-string likelihood is the minimum requirement, otherwise tokenizer changes get misread as model improvements [Gao2020Pile][Vie

consensusc-8fe1110b0c
词表设计不是一次性决策:部署后常能扫出 3k–10k+ under-trained tokens,而 RL 阶段的尾部 token 与 non-unique encoding 还会额外引入稳定性债务 [LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis][ClaudeMythos2025DynamicVocabPruning]。
来源论文· 4[LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis][ClaudeMythos2025DynamicVocabPruning]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.4

    Vocabulary design is not a one-shot decision: deployment often reveals 3k–10k+ under-trained tokens, and RL adds extra stability debt through tail tokens and non-unique encodings [LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis][Cla

contestedc-1a091dd2e4
Singh et al. [Singh2024TokenizationCounts]、Bhatia et al. [Bhatia2025DateFragments] 与 Liu and Ellis [LiuEllis2026SayAnythingButThis] 说明,局部 merge、日期碎片和 non-unique encoding 会让更大词表在 reasoning 与 RL 上付出代价。Land and Bartolo [LandBartolo2024Magikarp
来源论文· 4[Singh2024TokenizationCounts][Bhatia2025DateFragments][LiuEllis2026SayAnythingButThis]arXiv 2402.14903arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.positions[1].countergpt-5.4

    [counter to Camp B: bigger vocab is always better; default to 256K+] Singh et al. [Singh2024TokenizationCounts], Bhatia et al. [Bhatia2025DateFragments], and Liu and Ellis [LiuEllis2026SayAnythingButThis] show that local merges, date fragme

contestedc-6bf4cd1d71
反方并不是这些路线做不到,而是同 compute、同部署约束下还没有形成全面替代。Dubey et al. [Dubey2024Llama3] 说明在现有 Transformer 工业栈里,128K vocab 已经能以很低迁移成本换来系统收益;byte 路线仍需更强架构配合 [Wang2024MambaByte][Pagnoni2024BLT]。
来源论文· 3[Dubey2024Llama3][Wang2024MambaByte][Pagnoni2024BLT]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.4

    [counter to Camp C: tokenizer-free is the endgame; BPE should be abandon] The counterargument is not that these routes are impossible, but that they are not yet a full replacement under the same compute and deployment constraints. Dubey et

contestedc-312f165f31
反方会指出,缩词表会拉长序列、增加 KV cache 压力,并可能损失多语与 code 压缩收益 [Dubey2024Llama3]。如果没有明确的 RL 或对齐目标,过早剪尾可能只是把系统成本往上推。
来源论文· 1[Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.4

    [counter to Camp E: shrink or prune the vocabulary to buy alignment and ] The counterargument is that shrinking the vocabulary lengthens sequences, increases KV-cache pressure, and may give up multilingual and code compression gains [Dubey2

consensusc-ea876e12f6
固定计算预算下,不同词表的下游任务性能差异可达0.6–5.1pp,超过同预算下多数架构调整的收益 [Ali2024TokenizerChoice]
来源论文· 2[Ali2024TokenizerChoice]arXiv 2402.01035arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsep-20260214160829-csjmc

    Under fixed compute budgets, downstream performance gaps across different tokenizers reach 0.6–5.1 pp, exceeding gains from most architecture adjustments at the same budget [Ali2024TokenizerChoice]

consensusc-aeb586cbc2
128K词表相比32K词表可降低0.02–0.04 nats预训练损失,同时非英语与代码序列平均缩短15%–22%,降低推理时KV缓存占用与延迟 [Dubey2024Llama3]
来源论文· 1[Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsep-20260214160829-csjmc

    A 128K vocabulary reduces pretraining loss by 0.02–0.04 nats compared to a 32K vocabulary, while shortening non-English and code sequences by 15%–22% on average, reducing inference KV cache footprint and latency [Dubey2024Llama3]

consensusc-dcb01df6a1
数字、日期的不合理BPE合并会导致算术与时序推理性能下降10–20pp,该效应在7B到405B参数规模下均稳定存在 [Singh2024TokenizationCounts][Bhatia2025DateFragments]
来源论文· 3[Singh2024TokenizationCounts][Bhatia2025DateFragments]arXiv 2402.14903arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.headline_claimsep-20260214160829-csjmc

    Unreasonable BPE merges for numbers and dates reduce arithmetic and temporal reasoning performance by 10–20 pp, an effect consistent across 7B to 405B parameter scales [Singh2024TokenizationCounts][Bhatia2025DateFragments]

consensusc-ac62c43290
主流开源与闭源LLM普遍存在3k–10k+训练不足的尾部token,可通过10B–50B token的持续预训练修复90%以上的异常输出问题 [LandBartolo2024Magikarp]
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsep-20260214160829-csjmc

    Mainstream open-source and closed-source LLMs universally have 3k–10k+ under-trained tail tokens, and over 90% of abnormal output issues can be repaired via 10B–50B tokens of continued pretraining [LandBartolo2024Magikarp]

contestedc-e30b1029d3
256K词表相比128K词表,预训练损失改善边际递减至0.01 nats以内,同时embedding层参数增加1倍,推理时KV缓存访问延迟上升2%–5%,且训练不足的尾部token数量翻倍至11k以上 [LandBartolo2024Magikarp][Dubey2024Llama3]
来源论文· 2[LandBartolo2024Magikarp][Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].counterep-20260214160829-csjmc

    [counter to Camp B: Bigger vocab is always better; default to 256K+] Compared to 128K vocabularies, 256K vocabularies have diminishing pretraining loss improvements below 0.01 nats, while embedding layer parameters double, inference KV cach

contestedc-5bacd3e4a4
当前字节级模型在相同计算预算下,预训练损失比128K BPE词表模型高0.12–0.18 nats,下游性能低8–12pp,推理延迟高3倍以上 [Vieira2024Characters][Gao2020Pile]
来源论文· 2[Vieira2024Characters][Gao2020Pile]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].counterep-20260214160829-csjmc

    [counter to Camp C: Tokenizer-free is the endgame; BPE should be abandon] Current byte-level models have 0.12–0.18 nats higher pretraining loss, 8–12 pp lower downstream performance, and 3x higher inference latency than 128K BPE tokenizer m

contestedc-b55448b039
无差别裁剪词表会导致多语、小众领域术语的输出错误率上升3–7pp,且仅能提升RL稳定性2%–3%,收益低于通过持续预训练修复尾部token的方案 [LandBartolo2024Magikarp][Hayase2024DataMixtureInference]
来源论文· 2[LandBartolo2024Magikarp][Hayase2024DataMixtureInference]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].counterep-20260214160829-csjmc

    [counter to Camp D: Shrink or prune the vocabulary to buy alignment and ] Indiscriminate vocabulary pruning increases output error rates of multilingual and niche domain terms by 3–7 pp, and only improves RL stability by 2%–3%, with lower b

consensusc-580047c83c
在 fixed-compute 设定下,tokenizer 不是二阶因素:同一 2.6B 模型与预算,24 个 tokenizer 的替换带来 0.6–5.1 pp 下游差异 [Ali2024TokenizerChoice],说明“coverage 够了就行”无法作为默认工程假设。
来源论文· 2[Ali2024TokenizerChoice]arXiv 2402.01035arxiv.org
5 观测Tokenizer Scaling
证据 (5)
  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    Under fixed compute, the tokenizer is not second-order: with the same 2.6B model and budget, swapping among 24 tokenizers yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice], so “coverage is enough” is not a safe default assumpti

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Under fixed compute, tokenization is not a minor “compression tweak”: at 2.6B scale with the same budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance, not explained by coverage or average token length alone [Ali2024Tok

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Under fixed compute, swapping only the tokenizer yields 0.6–5.1 pp downstream variance (2.6B model, 24 tokenizers), comparable to typical data-mixture changes; therefore ‘tokenization is negligible’ is an experimentally falsifiable default

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    Under fixed-compute controlled pretraining, with the same 2.6B model and budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice]; treating tokenization as a “second-order detail” misses regression

  • topic_reporttokenizer-scalingarXiv 2402.01035· report.headline_claimsgpt-5.2

    Under fixed compute, tokenization is not second-order: with the same 2.6B model and budget, swapping only the tokenizer yields 0.6–5.1 pp downstream variance [Ali2024TokenizerChoice], comparable to typical data-mixture changes.

consensusc-5975ebc20c
部署期的词表问题是可治理的工程债务:多款模型存在 3k–10k+ under-trained tokens,可通过扫描定位并用短 continued pretrain 修复 [LandBartolo2024Magikarp];因此 tokenizer 需要 post-training 流程,而不是一次性定稿。
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.2

    Deployment-time vocab issues are governable engineering debt: multiple models contain 3k–10k+ under-trained tokens, detectable via scanning and repairable with short continued pretraining [LandBartolo2024Magikarp]; tokenizers therefore need

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Tokenizers create post-training debt: multiple models contain 3k–10k+ under-trained tail tokens that can be detected via scanning and repaired with short continued pretraining; vocabularies need post-deployment governance [LandBartolo2024Ma

consensusc-7ba57e8465
跨 tokenizer 的 per-token PPL 不可比:模型是 token-string 分布,但应用语义是 character-string;tokenizer 改变会改变分母与可达字符串集合,因此应使用 BPB 或 character-string likelihood 做主指标 [Vieira2024Characters][Gastaldi2024FoundationsTokenization]。
来源论文· 2[Vieira2024Characters]arXiv 2412.03719arxiv.org
3 观测Tokenizer Scaling
证据 (3)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Per-token PPL is not comparable across tokenizers: models define distributions over token strings while application semantics live over character strings; changing tokenizers changes the denominator and reachable string set, so BPB or chara

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Per-token PPL is mathematically non-comparable across tokenizers: LMs define distributions over token strings, and tokenizers change the denominator and reachable sequence set; more stable aligned metrics are BPB, character-string likelihoo

  • topic_reporttokenizer-scalingarXiv 2412.03719· report.headline_claimsgpt-5.2

    Per-token PPL is not comparable across tokenizers because models define distributions over token strings while application semantics live on character strings [Vieira2024Characters]; BPB or exact byte-level probabilities align denominators

contestedc-080f3f5769
证据目前主要覆盖到 128K,并未给出 256K+ 的 fixed-compute 单调曲线;同时非单调风险在数字/日期上是可复现的:局部 merge 会造成 10–20 pp 缺口 [Singh2024TokenizationCounts][Bhatia2025DateFragments]。此外更大 vocab 会放大 under-trained tail 的治理成本 [LandBartolo2024Magikarp],并可能在 RL 中放大策略分歧 [ClaudeMyth
来源论文· 3[Singh2024TokenizationCounts][Bhatia2025DateFragments][LandBartolo2024Magikarp]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is always better; default to 256K+] Evidence is strong up to 128K but does not yet provide monotonic fixed-compute curves for 256K+. Non-monotonic risks are reproducible for digits/dates: local merges can cr

contestedc-60af290a45
tokenizer-free 的主要代价是序列显著变长,系统瓶颈从词表矩阵转移到长序列建模与吞吐 [Wang2024MambaByte];在当前工业 Transformer serving 栈里,这通常意味着更高的 KV cache 与 attention 成本。与此同时,subword 的问题并非都需要“彻底放弃”才能解决:尾部坏 token 可扫描修复 [LandBartolo2024Magikarp],RL 阶段可用 safe vocabulary/pruning 缓解
来源论文· 3[Wang2024MambaByte][LandBartolo2024Magikarp]arXiv 2401.13660arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2401.13660· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Tokenizer-free’s main cost is much longer sequences, shifting bottlenecks from vocab matrices to long-sequence modeling and throughput [Wang2024MambaByte]; in today’s Tran

contestedc-b0511fa14e
剪尾会引入表达能力与兼容性成本:多语与代码往往依赖长尾符号与稀有片段,过度剪尾可能把序列变长或把信息挤回 bytes 层,抵消系统收益 [Dubey2024Llama3]。更稳的做法是把 pruning 作为 post-training 的“阶段性策略”,并用字符串口径指标与多语/代码 bytes/token 监控是否出现退化 [Vieira2024Characters][Dubey2024Llama3]。
来源论文· 2[Dubey2024Llama3][Vieira2024Characters]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Pruning has expressivity and compatibility costs: multilingual and code often rely on long-tail symbols and rare fragments; over-pruning can lengthen sequences or pus

consensusc-15b763ce96
在 decode 阶段,只要 KV cache 读写占到 token 生成时间的主导项(典型长上下文/大 batch),把 KV head 数从 H 降到 8/4/1(GQA/MQA)或把 KV 参数化改成低秩 latent(MLA),等价于把 bytes/token 近似按比例缩小;这类结构变化比“更快的 attention kernel”更直接命中瓶颈。[Shazeer2019MQA][Ainslie2023GQA][DeepSeek2024V2]
来源论文· 3[Shazeer2019MQA][Ainslie2023GQA][DeepSeek2024V2]
2 观测Cuda Kernel Pretrain
证据 (2)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    In decode, once KV-cache traffic dominates token time (typical for long context / large batch), reducing KV heads from H to 8/4/1 (GQA/MQA) or re-parameterizing KV as a low-rank latent (MLA) shrinks bytes/token roughly proportionally; these

  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    When KV-cache traffic dominates decode time, reducing KV heads from H to 1/4/8 (MQA/GQA) or re-parameterizing KV as low-rank latents (MLA) approximately scales down bytes/token proportionally; these structural moves hit memory-bound bottlen

consensusc-0e7fe932f7
FlashAttention 把 attention 的主要成本从“HBM 往返 N×N 中间矩阵”改写为“SMEM 分块 + online softmax”,因此在相同精度下可把可用 context 拉长并提高吞吐;但 fused attention 的数值偏差会跨 step 外泄到 loss spike,使得 kernel 选择成为训练稳定性变量。[Dao2022FlashAttention][Golden2024FAStability]
来源论文· 2[Dao2022FlashAttention][Golden2024FAStability]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    FlashAttention rewrites attention cost from “HBM round-trips of an N×N intermediate” to “SMEM tiling + online softmax,” enabling longer usable context and higher throughput at the same precision; however, numeric deviations in fused attenti

consensusc-e1cae97f92
FP8 训练的稳定性更像“scale 更新规则 + outlier 分布”的系统问题:trillion-token 才显形的漂移说明小规模 ablation 不足;MXFP8 的 per-block scaling 与 FP32 master/optimizer 交互把这一问题收敛成可复现 recipe。[Fishman2024FP8Scale][Mishra2025MXFP8Recipes][Sun2024MassiveActivations]
来源论文· 3[Fishman2024FP8Scale][Mishra2025MXFP8Recipes][Sun2024MassiveActivations]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    FP8 stability behaves like a system problem of “scale update rules + outlier distribution”: drift that only appears at trillion-token scale implies small-scale ablations are insufficient; MXFP8 per-block scaling plus FP32 masters/optimizer

consensusc-d81222447b
把 kernel 优化从“堆 fusion”改成“roofline 先行”能减少无效迭代:先用 arithmetic intensity 判定 memory/compute/latency bound,再选择 tiling、并行拆分或结构改写;微基准对 Hopper 的剖析提供了 roofline 参数的现实校准。[Williams2008Roofline][Luo2024HopperDissect]
来源论文· 2[Williams2008Roofline][Luo2024HopperDissect]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    Replacing “fusion-first” with “roofline-first” reduces wasted iterations: use arithmetic intensity to classify memory/compute/latency bound, then choose tiling, parallel decomposition, or structural rewrites; Hopper microbenchmark dissectio

contestedc-9aa9e5033b
反方会说:框架/编译器已经能把算子图降到高效 kernel,算法团队不必下沉;而且也有不少有效的架构变化(如位置插值)并不依赖 kernel 重写。
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: kernels and algorithms must be co-designed (decide b] The counterargument is that frameworks/compilers can lower graphs into efficient kernels so algorithm teams need not go low-level; and some effective architecture cha

contestedc-98497c43d1
反例来自两类事实:其一,最关键的吞吐提升往往来自“改数据流/调度”的专用 kernel(FlashAttention-3),不是通用 lowering;其二,kernel numerics 会影响训练稳定性,不能只靠图层抽象屏蔽。[Shah2024FA3][Golden2024FAStability]
来源论文· 2[Shah2024FA3][Golden2024FAStability]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: PyTorch/graph-compiler level is enough; handwritten ] Two facts push back: (1) the biggest throughput wins often come from specialized kernels that change dataflow/scheduling (FlashAttention-3), not generic lowering; (2)

contestedc-87f87c3c3f
FP8 的证据更像相反方向:trillion-token 才显形的漂移说明“短程稳定”不足以外推;而 MXFP8 recipes 把 per-block scaling 与 optimizer 交互写成规范,暗示必须显式适配数值合约。[Fishman2024FP8Scale][Mishra2025MXFP8Recipes]
来源论文· 2[Fishman2024FP8Scale][Mishra2025MXFP8Recipes]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].countergpt-5.2

    [counter to Camp C: hardware will get faster; algorithms need not adapt ] FP8 evidence points the other way: drift that appears only at trillion-token scale shows “short-horizon stability” does not extrapolate; MXFP8 recipes codify per-bloc

contestedc-cd948f4593
目前证据更多是“格式/规范出现”,而不是“端到端训练/推理关键 kernel 达到同等吞吐与稳定性”。例如 FA3 把 Hopper 特性吃到调度层,MXFP8 recipes 也与 Blackwell 的实现细节强耦合;短期内可移植性往往以吞吐或稳定性为代价。[Shah2024FA3][Mishra2025MXFP8Recipes]
来源论文· 2[Shah2024FA3][Mishra2025MXFP8Recipes]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: non-NVIDIA hardware will catch up; CUDA ecosystem wi] Current evidence is more about “formats/specs emerging” than “end-to-end critical kernels matching throughput and stability.” FA3 bakes Hopper features into schedulin

consensusc-7171322504
在 8×GPU NVLink 域内,TP 的 per-layer all-reduce 频次高到足以主导端到端吞吐;把 TP 放到跨 IB 往往比“少开 PP”更亏,工程上更稳的默认是 TP≤8 绑定 NVLink、PP 跨 IB、DP/FSDP 跨 pod [Shoeybi2019Megatron][Narayanan2021PTD][Jiang2024MegaScale]。
来源论文· 3[Shoeybi2019Megatron][Narayanan2021PTD][Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    Within an 8-GPU NVLink island, TP’s per-layer all-reduce is frequent enough to dominate end-to-end throughput; placing TP across IB is often worse than “using less PP”. A more robust default is TP≤8 on NVLink, PP across IB, DP/FSDP across p

consensusc-20f14524fe
当 PP stage 数 P 较大且 microbatch 数 M 受显存限制时,bubble 不是二阶项:zero-bubble 1F1B 通过拆分 backward(input-grad / weight-grad)把理论 bubble 压到 0(ZB-H2),在同一 mesh 下通常能回收约 2–5 个 MFU 点 [Qi2023ZeroBubble][Narayanan2021PTD]。
来源论文· 2[Qi2023ZeroBubble][Narayanan2021PTD]
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    With large PP stage count P and memory-limited microbatches M, bubbles are not second-order: zero-bubble 1F1B splits backward into input-grad/weight-grad and drives theoretical bubble to 0 (ZB-H2), typically recovering ~2–5 MFU points under

  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    Zero-bubble 1F1B turns bubbles from a heuristic into a constructive condition: ZB-H2 splits backward (input-grad/weight-grad) to drive theoretical bubble to 0 [Qi2023ZeroBubble], often translating to ~2–5 MFU points recovered under the same

consensusc-878d98e60c
长上下文的分界线更接近“attention 占比”而不是固定 L:当 attention FLOPs/时间占比超过 ~30% 时,SP 往往是最低风险的第一步;超过 ~50% 时,CP(Ring/Ulysses/USP)从可选变成必需,否则 4BL²HL_layer 的显存与通信会把 PP/DP 的优化空间吃光 [Korthikanti2022SP][Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]。
来源论文· 4[Korthikanti2022SP][Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    The long-context breakpoint is closer to “attention share” than a fixed L: once attention exceeds ~30% of FLOPs/time, SP is usually the lowest-risk first step; above ~50%, CP (Ring/Ulysses/USP) becomes required, otherwise 4BL²HL_layer memor

consensusc-1ff5ef3c1f
公开证据层面,>10K GPU、100B+ 的可复现 MFU 仍主要来自手工 4D 与显式拓扑映射(dense 175B MFU 55.2%)[Jiang2024MegaScale];auto-parallel 在 <100B 的接近程度有论文支撑,但缺少 100B+ matched-scale 对照,无法证明能稳定替代手工 mesh [Zheng2022Alpa][Chowdhery2022PaLM]。
来源论文· 4[Jiang2024MegaScale][Zheng2022Alpa][Chowdhery2022PaLM]arXiv 2402.15627arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2402.15627· report.headline_claimsgpt-5.2

    In public evidence, reproducible MFU at >10K GPUs and 100B+ still mainly comes from hand-tuned 4D with explicit topology mapping (dense 175B MFU 55.2%) [Jiang2024MegaScale]. Auto-parallel has paper-backed closeness below 100B, but lacks pub

consensusc-f9d52149a2
“FSDP-only 足够”更像是组织/代码形态的选择而不是吞吐最优:现代系统能把 FSDP 做得更灵活更快 [Wang2026veScaleFSDP],但公开材料仍缺少 dense 预训练的规模封顶与主导瓶颈定量,因此在 70B+ 或长上下文下很难把它当作唯一并行维度 [Smith2022MTNLG530B][Wang2026veScaleFSDP]。
来源论文· 3[Wang2026veScaleFSDP][Smith2022MTNLG530B]arXiv 2602.22437arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2602.22437· report.headline_claimsgpt-5.2

    “FSDP-only is enough” is more a choice about org/code shape than throughput optimality: modern systems can make FSDP more flexible and faster [Wang2026veScaleFSDP], but public materials still lack a quantitative dense-pretraining ceiling an

contestedc-b906757eb8
组织成本高:需要对拓扑、内核、调度、容错都有强控制;对模型结构变化(MoE、长上下文)需要持续手工适配。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].countergpt-5.2

    [counter to Camp A: hand-tuned 4D (Megatron / MegaScale style)] High organizational cost: requires strong control over topology, kernels, scheduling, and fault tolerance; model changes (MoE, long context) demand continuous manual adaptation

contestedc-67db579a3b
缺少 100B+ matched-scale 的公开对照与拓扑细节;当故障域、重试、拥塞控制进入约束后,自动计划难以解释与稳定复现,导致“能跑”与“MFU 在合理带”之间仍有距离 [Jiang2024MegaScale]。
来源论文· 1[Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].countergpt-5.2

    [counter to Camp B: auto-parallel (Alpa / GSPMD / pjit)] Lacks public matched-scale 100B+ comparisons and topology details; once failure domains, retries, and congestion control become constraints, automatic plans can be hard to explain and

contestedc-140b1e1fed
公开证据仍缺少 dense 预训练的规模封顶与主导瓶颈;当 TP/PP/CP 变成必需维度时,FSDP-only 往往把瓶颈推到跨节点同步与碎片化开销上,最终还是要引入模型并行 [Smith2022MTNLG530B]。
来源论文· 1[Smith2022MTNLG530B]
2 观测4d Parallelism Megatron
证据 (2)
  • topic_report4d-parallelism-megatron· report.positions[2].countergpt-5.2

    [counter to Camp C: FSDP-only is enough (low intrusion first)] Public evidence still lacks a quantitative dense-pretraining ceiling and dominant bottleneck; when TP/PP/CP become necessary, FSDP-only often shifts bottlenecks to cross-node sy

  • topic_report4d-parallelism-megatron· report.positions[2].countergpt-5.2

    [counter to Camp C: FSDP-only is enough (low intrusion first)] Public evidence still lacks a dense-pretraining ceiling regime: when model state, activation state, and long-context attention all grow, TP/PP/CP often shift from optional to ma

contestedc-28f6cfba4e
当 L 抬高时,attention 的 4BL²HL_layer 会把“极端场景”变成常态;只靠 3D 往往会在错误层级上消耗带宽(例如把 CP 的压力误转移到 PP/DP),导致 MFU 掉出合理带 [Korthikanti2022SP][Liu2023RingAttn][Fang2024USP]。
来源论文· 3[Korthikanti2022SP][Liu2023RingAttn][Fang2024USP]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[3].countergpt-5.2

    [counter to Camp D: 3D (DP+TP+PP) is enough; SP/CP optional] As L rises, attention’s 4BL²HL_layer turns “extreme cases” into the norm; relying on 3D alone often burns bandwidth at the wrong layer (e.g., shifting CP pressure onto PP/DP), pus

consensusc-dd7b8bf472
mixed-data 的相图给出一个可迁移的阈值:当某类数据占比 <10–15% 时更常处于 synergy 区(边际收益高),当占比推到 ~40% 附近更容易进入 interference 区(开始挤压其它能力);把 code 当作一种“数据模态”,这解释了 15–25% 的工程甜区与 >30–40% 的 NL 风险区 [Aghajanyan2023SciMix]。
来源论文· 1[Aghajanyan2023SciMix]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.headline_claimsgpt-5.2

    Mixture phase plots provide a transferable threshold: below ~10–15% a data type tends to sit in a synergy regime (high marginal gains), while pushing toward ~40% more often enters an interference regime (capability displacement). Treating c

consensusc-d02dff057b
在相近训练 loss 下,code 域的 downstream variance 比 web 更低(论文报告约 30% 量级差异),与“code token entropy 更低→梯度噪声更小→等效更低 effective LR”的优化解释一致;因此 code 的收益不需要完全诉诸“语义迁移”,仅靠优化稳定性就能解释一部分现象 [Gadre2023Overtraining]。
来源论文· 1[Gadre2023Overtraining]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.headline_claimsgpt-5.2

    At comparable training loss, the code domain shows lower downstream variance than web (reported on the order of ~30%), consistent with an optimization account: lower token entropy → lower gradient noise → lower effective LR. Hence, part of

consensusc-29e0d034b5
把 code token 组织成“更长的可用上下文”比单纯加占比更确定:repo-level packing 把跨文件依赖显式塞进同一序列,提升 cross-file completion 与长上下文理解;这类收益来自信号组织,而不是把 code 比例推到更高 [Shi2023InContextPretraining][DeepSeekCoder2024]。
来源论文· 2[Shi2023InContextPretraining][DeepSeekCoder2024]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.headline_claimsgpt-5.2

    Organizing code tokens into “usable context” is more reliable than merely increasing ratios: repo-level packing places cross-file dependencies into the same sequence and improves cross-file completion and long-context understanding; these g

contestedc-bfa5809243
现有公开证据不足以支持“单调更好”。相图更像预测 >40% 会进入 interference 区并挤压其它分布 [Aghajanyan2023SciMix];而 specialist 的公开代价数字表明高 code 会带来 6–9 pp 的通用缺口,并非总能靠 post-training 无损补回 [DeepSeekCoder2024]。更关键的是:目前缺少覆盖 <10%、15–25%、>40% 的同一套 ratio ablation(同 compute、同时报告 reaso
来源论文· 2[Aghajanyan2023SciMix][DeepSeekCoder2024]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: generalists should push code past >40%; more is alwa] Public evidence does not support monotonicity. Phase plots predict that >40% is more likely to enter an interference regime and displace other distributions [Aghajany

contestedc-2d3017ee84
优化解释能覆盖一部分现象,但不足以解释“结构性 ICL 信号”的差异:induction head 的形成依赖可重复的模式与显式边界,code 在这类统计上更密集 [Olsson2022InductionHeads];repo-level packing 进一步放大了这种结构信号,带来跨文件能力提升 [Shi2023InContextPretraining]。如果把 code 完全替换为“低 entropy 非 code 数据”,需要在控制 entropy/压缩率后仍能复现这
来源论文· 2[Olsson2022InductionHeads][Shi2023InContextPretraining]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: code helps mostly via optimization/regularization (l] The optimization account covers part of the story but does not fully explain differences in structured ICL signals: induction head formation depends on repeatable pat

contestedc-0353ea9eee
公开 continual 结果直接削弱了“只要 code 多就会伤 NL”的强版本:Code Llama 在 code-heavy 继续训后 MMLU 下降 <1 pp [CodeLlama2023]。更细的机制解释是:低占比时处于 synergy 区,反而可能补齐 ICL/结构性推理短板 [Aghajanyan2023SciMix][Olsson2022InductionHeads]。真正需要避免的是把占比推到 interference 主导的区间,而不是把 code 一刀
来源论文· 4[CodeLlama2023][Aghajanyan2023SciMix][Olsson2022InductionHeads]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.positions[2].countergpt-5.2

    [counter to Camp C: keep code <10% to protect NL; generalists should not] Public continual results weaken the strong form of “more code necessarily harms NL”: Code Llama shows <1 pp MMLU drop after code-heavy continuation [CodeLlama2023]. M

contestedc-951a42fe00
“continual 必然崩”的一般化说法与公开证据不符:Code Llama 展示了在强 code 继续训下仍可保持 NL 指标 [CodeLlama2023];更一般地,BLOOM+1 在 continual 增加语言支持时也能维持可用的 zero-shot 能力 [BLOOMPlus1_2022]。更合理的担忧是:continual 与 from-scratch 在混合路径、学习率日程、数据顺序上不等价,不能用一个失败案例推翻整个路线;但目前确实缺少“同预算同配比”的
来源论文· 2[CodeLlama2023]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.positions[3].countergpt-5.2

    [counter to Camp D: code ability must be trained from scratch; continual] The generalized claim “continual inevitably fails” conflicts with public evidence: Code Llama preserves NL metrics under code-heavy continuation [CodeLlama2023]; more

contestedc-47345db041
Fu et al. [Fu2024DataEngineering] 和 Xiong et al. [Xiong2023EffectiveLongCtx] 的受控消融显示,仅改位置不改数据的模型在 RULER 基准上得分比同时优化数据的模型低 35% 以上,有效上下文上限仅 32K,无法支撑组合推理任务。
来源论文· 3[Fu2024DataEngineering][Xiong2023EffectiveLongCtx]arXiv 2309.16039arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.positions[0].counterep-20260214160829-csjmc

    [counter to Camp A: Positional extrapolation is enough to achieve long c] Controlled ablations by Fu et al. [Fu2024DataEngineering] and Xiong et al. [Xiong2023EffectiveLongCtx] show that models with only positional modification without data

contestedc-e46389fca8
纯数据优化如果没有匹配的位置编码扩展,无法突破预训练时的位置长度限制,标称上下文无法超过训练序列长度。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[1].counterep-20260214160829-csjmc

    [counter to Camp B: The data recipe is the main variable] Pure data optimization without matching positional encoding extension cannot break through the position length limit during pretraining, and nominal context cannot exceed the trainin

contestedc-75d7f6acd4
如果数据本身 burstiness 不足、位置编码没有适配长序列,packing 优化的增益上限仅为 10–15 pp,无法单独实现 128K 以上有效上下文。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].counterep-20260214160829-csjmc

    [counter to Camp C: Packing engineering is the under-exploited lever] If the data itself has insufficient burstiness and the positional encoding is not adapted to long sequences, the gain ceiling of packing optimization is only 10–15 pp, an

contestedc-f171668c15
目前 SSM 架构的长程推理质量尚未超过同量级、优化了数据和 packing 的 Transformer 模型,仅在序列处理效率上有优势,[Hsieh2024RULER]、[Karpinska2024NoCha] 的评测显示同参数 SSM 模型的有效上下文比优化后的 Transformer 低 20–30%。
来源论文· 2[Hsieh2024RULER][Karpinska2024NoCha]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].counterep-20260214160829-csjmc

    [counter to Camp D: Switch architectures (SSM / linear) to bypass positi] Currently, the long-range reasoning quality of SSM architectures has not exceeded Transformer models of the same scale with optimized data and packing, and only has a

consensusc-2d46276b40
在低质 web 占主导的 regime,质量过滤/选择通常比细调 mixture 更稳;但过滤并不单调更好,过度清洗会伤害泛化与长尾覆盖 [DataCompLM2024][FineWeb2024][Gao2021QualityFiltering]。
来源论文· 3[DataCompLM2024][FineWeb2024][Gao2021QualityFiltering]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    In regimes dominated by low-quality web, quality filtering/selection is usually more reliable than fine mixture tuning; but filtering is not monotonic—over-cleaning can hurt generalization and long-tail coverage [DataCompLM2024][FineWeb2024

consensusc-075681824a
公开训练配方更像“分阶段控制”而非恒定比例:Llama 3 报告后期对 code/math/reasoning 做约 3–5× 上采样,符合“尾段补稀缺能力”的机制解释 [MetaLlama32024]。
来源论文· 1[MetaLlama32024]
3 观测Data Mixture
证据 (3)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Public recipes look like staged control rather than constant ratios: Llama 3 reports ~3–5× late-stage upweighting of code/math/reasoning, consistent with a tail capability acquisition mechanism [MetaLlama32024].

  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Public large-model recipes are shifting ratios from a static vector to a curriculum: Llama 3 reports ~3–5× late upweighting of code/math/reasoning, consistent with tail capability acquisition rather than a constant mixture [MetaLlama32024].

  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Public recipes look like phase-wise trajectories rather than constant ratios: Llama 3 reports ~3–5× late upweighting of code/math/reasoning, consistent with tail capability acquisition; encoding ratios as a time function matches training dy

contestedc-0af7bf3580
反对意见是可迁移性经常被高估:估计结论对数据集选择、桶定义、去重/过滤版本敏感,目标函数会漂移;把 law 当作“替代 ablation”会在版本迭代中反复踩坑 [Besiroglu2024ChinchillaReplication][Porian2024ResolvingDiscrepancies][Aioli2024]。
来源论文· 1[Aioli2024]
2 观测Data Mixture
证据 (2)
  • topic_reportdata-mixture· report.positions[0].countergpt-5.2

    [counter to Camp A: Formal search first (laws/regression/robust optimiza] The counterargument is that transferability is often overestimated: conclusions are sensitive to dataset choice, bucket definitions, and dedup/filter versions, causin

  • topic_reportdata-mixture· report.positions[0].countergpt-5.2

    [counter to Camp A: Formal search first (regression/laws/robust optimiza] Formal methods are highly sensitive to bucket definitions and dataset versions: coarse buckets average out signals, and version drift shifts regression targets; in un

contestedc-3779dc7a90
反对意见是启发式缺少可迁移的预算规划:当桶数变多、目标从通用转向专用时,纯 ablation 的 sample complexity 会迅速上升;需要 law/回归提供结构化先验与可解释敏感度 [Aioli2024][DataMixingLaws2024]。
来源论文· 2[Aioli2024][DataMixingLaws2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.2

    [counter to Camp B: Heuristics + curricula are more robust (a few ablati] The counterargument is that heuristics lack transferable budgeting: as bucket count grows and objectives shift from generalist to specialist, pure ablations’ sample c

contestedc-133e821cf8
反对意见是在线方法的净收益经常被额外阶段、系统复杂度与记账不稳定吞掉:如果桶定义、去重/过滤版本、采样器状态无法稳定复现,线上调权很难审计与回滚,且跨规模迁移更难验证 [DoReMi2023][Aioli2024]。
来源论文· 2[DoReMi2023][Aioli2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.2

    [counter to Camp C: Online/adaptive mixing beats one-shot offline search] The counterargument is that net gains are often absorbed by extra phases, system complexity, and unstable accounting: if bucket definitions, dedup/filter versions, an

contestedc-21028d67cd
反对意见是:ladder 在关键能力上会失真(HumanEval 相关性低),而且它只能回答“这个决策在这个配方下是否更好”,难以解释机制与定位责任样本;当出现安全/合规事故时,仅靠 ladder 不够快。[DCLMLadder2024][DetectPretrainingData2023]
来源论文· 2[DCLMLadder2024][DetectPretrainingData2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[0].countergpt-5.2

    [counter to Camp A: quality classifiers + ablation ladders are sufficien] The pushback: ladders can distort on key capabilities (low HumanEval correlation), and they only answer “is this decision better under this recipe”, not mechanisms or

contestedc-0e68b9ba8f
主要问题是可迁移性与可操作性:influence 的 top-k 在 scale 上漂移(重叠 <10%),使得“固化高价值样本库”不可行;同时 attribution 给的是相关性排序,不给反事实收益,最终仍要回到 ladder 验证删改是否真的提升目标能力。[AnthropicInfluence2023][DCLM2024]
来源论文· 2[AnthropicInfluence2023][DCLM2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[1].countergpt-5.2

    [counter to Camp B: influence/attribution is the main path (infer data r] The main issues are transferability and actionability: top-k influence drifts with scale (<10% overlap), making “freezing a high-value example library” infeasible; at

contestedc-de3f0823ba
识别假设是硬成本:IV 是否有效往往不可证,失效会给方向性错误;同时,因果方法需要更细的变量定义与更强的评估矩阵,否则“估计的是哪个能力”的问题会被平均 loss 掩盖。[CausalLL2024][DCLMLadder2024]
来源论文· 2[CausalLL2024][DCLMLadder2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[2].countergpt-5.2

    [counter to Camp C: full causal inference is the future (solve confoundi] Identification assumptions are a hard cost: IV validity is often untestable, and invalid IV can be directionally wrong. Causal methods also require sharper variable d

contestedc-b00ca2a5e3
反对点在于:scale 并不能替你决定“哪些 token 值得花钱训练”。DCLM 与 FineWeb-Edu 的受控结果显示,数据配方在固定 compute 下能带来多 pp 的下游差异;而且 capability 的敏感度差异会让“平均 loss 最优”与“关键能力最优”分离,单靠 scale 规划会把预算投错方向。[DCLM2024][FineWeb2024][DCLMLadder2024]
来源论文· 3[DCLM2024][FineWeb2024][DCLMLadder2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[3].countergpt-5.2

    [counter to Camp D: skip measurement, rely on intuition and scale (scali] The counterpoint: scale does not decide which tokens are worth paying for. Controlled results in DCLM and FineWeb-Edu show multi-pp downstream differences under fixed

consensusc-65b535d87f
在 H100 上,FA3 [Shah2024FA3] 通过 warp-specialization + TMA 把 attention 的关键路径从“等待 HBM”转成“持续喂 MMA”,论文报告 BF16 达到约 75% 的 H100 peak、FP8 达到约 1.2 PFLOPs/s;这意味着在 Hopper 上 dense attention 更接近 compute-bound,而不是 FA1 时代的 memory-bound。
来源论文· 2[Shah2024FA3]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.headline_claimsgpt-5.2

    On H100, FA3 [Shah2024FA3] uses warp specialization + TMA to turn attention’s critical path from “waiting on HBM” into “keeping MMA fed”, reporting ~75% of H100 BF16 peak and ~1.2 PFLOPs/s in FP8. This places dense attention closer to compu

consensusc-c35b79cef1
在相同“exact attention”数学形式下,FA2 [Dao2023FA2] 仅通过 work partitioning 与并行粒度调整即可在 A100 上相对 FA1 [Dao2022FA1] 获得约 2× 吞吐;因此“算法不变但调度改变”在 attention kernel 上是一级杠杆。
来源论文· 3[Dao2023FA2][Dao2022FA1]arXiv 2307.08691arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2307.08691· report.headline_claimsgpt-5.2

    With the same exact-attention math, FA2 [Dao2023FA2] achieves ~2× throughput over FA1 [Dao2022FA1] on A100 purely via work partitioning and parallel granularity. Hence, “same algorithm, different scheduling” is a first-order lever for atten

consensusc-f5f26cb8bf
推理 decode 场景(小 batch、长上下文)下,训练态 FA kernel 的并行维度不足会导致 SM 占用率显著下降;FlashDecoding++ [Hong2023FlashDec] 通过沿 seq 切 chunk 的方式恢复并行度,说明 serving 的核心优化对象往往不是“更快的训练态 attention kernel”。
来源论文· 1[Hong2023FlashDec]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsgpt-5.2

    In decode serving (tiny batch, long context), training-time FA kernels underutilize SMs due to insufficient parallel dimensions; FlashDecoding++ [Hong2023FlashDec] restores parallelism by chunking along sequence length, indicating serving o

consensusc-5753cada4b
当 KV cache 成为长上下文推理的主带宽/容量瓶颈时,KV 量化/压缩(如 KIVI [Liu2024KIVI])对端到端吞吐的影响可能超过单点 attention kernel 提速;这把优化优先级从 kernel 推向 KV 表示与系统策略。
来源论文· 1[Liu2024KIVI]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsgpt-5.2

    When KV cache becomes the dominant bandwidth/capacity bottleneck in long-context inference, KV quantization/compression (e.g., KIVI [Liu2024KIVI]) can affect end-to-end throughput more than single-kernel speedups, shifting priority from ker

consensusc-20e2d607f5
FlexAttention [Dong2024Flex] 把“mask/score 语义”提升为可编译接口,使大量变体(ALiBi/SWA/soft mask)可在不写 CUDA 的情况下接近 fused kernel 性能;但像 FlashMask [Wang2024FlashMask] 这类工作也表明:当 mask 语义改变 tiling/访存结构时,专用实现仍可能在速度或显存上占优。
来源论文· 3[Dong2024Flex][Wang2024FlashMask]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.headline_claimsgpt-5.2

    FlexAttention [Dong2024Flex] elevates mask/score semantics into a compilable interface, letting many variants (ALiBi/SWA/soft masks) reach near fused-kernel performance without CUDA; yet works like FlashMask [Wang2024FlashMask] show that wh

contestedc-57067419cc
反例来自两类需求:一是复杂 mask/变体 attention 可能需要专用实现才能避免额外 IO 或实现特定语义;二是跨平台可移植性与供应链约束下,依赖 Hopper 特性的 kernel 可能不是可接受的默认。[Wang2024FlashMask][Luo2024HopperDissect]
来源论文· 2[Wang2024FlashMask][Luo2024HopperDissect]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[0].countergpt-5.2

    [counter to Camp A: FA is largely the endpoint of attention engineering ] Counterexamples come from (1) complex masks/variants that may need specialized implementations to avoid extra IO or to realize specific semantics, and (2) portability

contestedc-af4c899071
反方担忧是两点:其一,编译生成的 kernel 在极端形态(动态稀疏、跨 block 状态、特殊 layout)下可能无法达到手写上限;其二,编译栈的可移植性与可调试性未必优于手写 CUDA,尤其在非 Hopper 或非 NVIDIA 平台上缺少公开对照数据。[Wang2024FlashMask][Luo2024HopperDissect]
来源论文· 2[Wang2024FlashMask][Luo2024HopperDissect]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[1].countergpt-5.2

    [counter to Camp B: the main line is Triton/FlexAttention—move optimizat] Concerns are (1) generated kernels may not match hand-tuned ceilings for extreme cases (dynamic sparsity, cross-block state, special layouts), and (2) portability/deb

contestedc-1c6d87749d
反方的主要质疑是质量与任务覆盖:在依赖 retrieval/ICL 的评测上,替代结构可能落后,且需要 hybrid 才能稳定补齐;同时,很多产品瓶颈来自系统层(decode 串行、KV IO、调度),结构替换未必比成熟 engine + KV 策略更快落地。[Waleffe2024MambaStudy][Ye2024FlashInfer][Leviathan2022SpecDec]
来源论文· 3[Waleffe2024MambaStudy][Ye2024FlashInfer][Leviathan2022SpecDec]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[2].countergpt-5.2

    [counter to Camp C: attention itself should be replaced (SSM / sparse / ] The main pushback is quality and task coverage: on retrieval/ICL-heavy evaluations, replacements can trail and often need hybrids to close gaps. Also, many product bo

contestedc-9556f77000
反方认为 lock-in 的成本常被高估:在实际训练/推理集群里,NVIDIA 仍是主流平台;且成熟库(FA3/FlashInfer)带来的可预测性能与工程稳定性,往往比“理论可移植”更有价值。缺少跨平台 benchmark 时,过早为可移植性牺牲性能与成熟度,可能是反向风险。[Ye2024FlashInfer][Ye2025FlashInferEngine]
来源论文· 2[Ye2024FlashInfer][Ye2025FlashInferEngine]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[3].countergpt-5.2

    [counter to Camp D: FA3 embodies NVIDIA lock-in; avoid binding critical ] The counter-argument is that lock-in cost is often overstated: NVIDIA remains the dominant platform in real training/serving fleets; mature libraries (FA3/FlashInfer)

consensusc-e6d0e1c63e
perplexity 与长上下文任务收益在实践中经常脱钩:用 perplexity 做 128K+ 训练的主指标,会把优化压力错误地推向“多堆 token”,而不是“让模型在 RULER/LongBench/RepoQA 上用到中段证据” [Gao2024EffectiveLongCtx][RULER2024][LostInTheMiddle2023][RepoQA2024]。
来源论文· 4[Gao2024EffectiveLongCtx][RULER2024][LostInTheMiddle2023][RepoQA2024]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    Perplexity often decouples from long-context task gains in practice: using perplexity as the primary metric for 128K+ training misallocates effort toward “more tokens” rather than “using mid-context evidence on RULER/LongBench/RepoQA” [Gao2

consensusc-b303e2c788
在 128K 级别,数据分布比总 token 更决定性:把长文档按领域上采样到 ≥25% 能用约 5B token 把 128K NIAH 打满 [Fu2024DE128K];若长文档比例偏低,继续堆 token 往往只提升“能背短片段”,不提升长依赖任务 [Gao2024EffectiveLongCtx][RULER2024]。
来源论文· 3[Fu2024DE128K][Gao2024EffectiveLongCtx][RULER2024]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    At ~128K, data distribution dominates total tokens: upsampling long documents to ≥25% per domain can saturate 128K NIAH with ~5B tokens [Fu2024DE128K]. With a low long-doc ratio, adding tokens often improves short-span memorization rather t

consensusc-077d370d49
RoPE 外推的工程主线从“线性插值”演进到“频率/温度联合校准”再到“非均匀搜索”:PI→YaRN→LongRoPE 分别对应 32K、128K、2M+ 的稳定性需求 [PI2023][YaRN2023][LongRoPE2024]。
来源论文· 4[PI2023][YaRN2023][LongRoPE2024]arXiv 2309.00071arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2309.00071· report.headline_claimsgpt-5.2

    The practical RoPE extrapolation line evolved from linear interpolation to joint frequency/temperature calibration to non-uniform search: PI→YaRN→LongRoPE map to stability needs at ~32K, ~128K, and 2M+ [PI2023][YaRN2023][LongRoPE2024].

contestedc-86807a1079
反例来自“能外推但用不上”:RULER 指出 needle 会高估能力 [RULER2024],Lost in the Middle 说明中段证据利用不足是结构性问题 [LostInTheMiddle2023],Gao et al. 进一步证明 perplexity 与长任务脱钩,必须用任务型评估与后训练去对齐 [Gao2024EffectiveLongCtx]。
来源论文· 4[RULER2024][LostInTheMiddle2023][Gao2024EffectiveLongCtx]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.positions[0].countergpt-5.2

    [counter to Camp A: explicit positional encoding is required; RoPE extra] Counterevidence comes from “extrapolates but not used”: RULER shows needles overestimate capability [RULER2024]; Lost in the Middle shows mid-context underuse is stru

contestedc-9261417a77
反例来自“系统跑通但任务不涨”:Fu et al. 指出长文档比例不足时,继续训练并不会带来 128K 的有效收益 [Fu2024DE128K];Gao et al. 说明即使 perplexity 下降,RULER/下游仍可能不变 [Gao2024EffectiveLongCtx]。也就是说,系统并行解决吞吐,但无法替代数据与评估对齐。
来源论文· 2[Fu2024DE128K][Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[1].countergpt-5.2

    [counter to Camp B: SP and DP are orthogonal; scaling to million tokens ] Counterevidence is “systems run but tasks don’t improve”: Fu et al. show that without sufficient long-doc ratio, continued training does not yield effective 128K gain

contestedc-bba47f7a28
RULER 的核心结论就是 NIAH 会高估真实可用上下文 [RULER2024];Gao et al. 明确展示 perplexity 与长任务脱钩,并给出更可靠的评估协议 [Gao2024EffectiveLongCtx];Lost in the Middle 与 RepoQA 则提供了可复现的失败模式与更贴近生产的压力测试 [LostInTheMiddle2023][RepoQA2024]。
来源论文· 5[RULER2024][Gao2024EffectiveLongCtx][LostInTheMiddle2023][RepoQA2024]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.positions[2].countergpt-5.2

    [counter to Camp C: NIAH/perplexity is sufficient; task benchmarks are t] RULER’s central result is that NIAH overestimates real usable context [RULER2024]. Gao et al. explicitly show perplexity decouples from long tasks and provide a more

contestedc-5d2cbd7303
反例来自工程闭环成本:即便替代架构在复杂度上更优,评估与配方仍决定“用得上”。RULER/Lost in the Middle 指向的是证据利用与任务对齐问题,而不是单纯算子复杂度 [RULER2024][LostInTheMiddle2023];同时 RoPE 外推 + kernel + SP 已经把 128K→4M 做到可复现流程 [YaRN2023][FA2][Ulysses2023][Xu2025UltraLong]。
来源论文· 5[RULER2024][LostInTheMiddle2023][YaRN2023][Ulysses2023][Xu2025UltraLong]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[3].countergpt-5.2

    [counter to Camp D: alternative architectures (sparse/SSM/linear attenti] Counterpoint is end-to-end engineering cost: even with better asymptotics, evaluation and recipes still determine whether context is used. RULER/Lost in the Middle po

consensusc-b5f79b3682
在负载均衡上,aux-loss-free 的 bias EMA 把“均衡信号”从主损失梯度里剥离,主要收益是把实现敏感性(micro-batch 统计、DP 同步、detach 选择)对结果的影响压到更低;这类敏感性在 aux loss 路线中可导致比调 aux 系数更大的 usage CV / 专家分化差异 [Wang2024AuxFree][Qiu2025DemonsLBL]。
来源论文· 2[Wang2024AuxFree][Qiu2025DemonsLBL]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    For load balancing, aux-loss-free bias EMA removes balancing signals from the main-loss gradients; the primary win is reduced implementation sensitivity (micro-batch stats, DP sync, detach choices), which in aux-loss setups can dominate out

consensusc-82b7c4438e
在专家结构上,fine-grained(≥64)+ 1 shared expert 更像是在混合模型里显式分离“公共成分”,可降低 routed experts 的 identifiability 冲突;这解释了为什么同 active 参数下,DeepSeekMoE 报告的 zero-shot 增益能稳定落在 1.8–3.4 pp 区间 [Dai2024DeepSeekMoE][Nguyen2025SharedExperts]。
来源论文· 2[Dai2024DeepSeekMoE][Nguyen2025SharedExperts]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    On expert structure, fine-grained (≥64) + one shared expert acts like explicitly factoring out a “shared component” in a mixture model, reducing identifiability conflicts among routed experts; this aligns with DeepSeekMoE’s reported 1.8–3.4

consensusc-16c30784b8
routing 的代际更替不是“更聪明的路由”取代“更笨的路由”,而是把拥塞从训练稳定性的主导变量里移走:expert-choice 把容量约束内化进路由机制,从而减少 token drop 对训练曲线的破坏 [Zhou2022ExpertChoice][Lepikhin2020GShard]。
来源论文· 2[Zhou2022ExpertChoice][Lepikhin2020GShard]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    The routing turnover is less about “smarter routing” and more about removing congestion as the dominant stability variable: expert-choice internalizes capacity constraints into the routing mechanism, reducing token-drop-driven training disr

consensusc-13e37d029e
“DeepSeek 模板”在 2025–2026 之所以成为事实默认,是因为它把监控指标标准化:bias norm、usage CV、token drop、dead experts;这让 MoE 从“调参艺术”变成“可观测的控制系统”,复刻平台也能稳定复现 [DeepSeekAI2024V3][Kang2025FLAMEMoE]。
来源论文· 2[DeepSeekAI2024V3][Kang2025FLAMEMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    The DeepSeek template became the de-facto default in 2025–2026 by standardizing operational metrics—bias norm, usage CV, token drop, dead experts—turning MoE from ‘tuning art’ into an observable control system that reproduction platforms ca

contestedc-fd876b9d34
该阵营常把“训练 FLOPs”当作唯一账本,但 ROI 还包括 all-to-all、显存驻留、推理缓存命中率与后训练迁移成本;在这些维度上,dense 或阶段解耦可能更划算。[Pan2024DenseTrainSparseInfer][Liew2025Upcycling][Hui2024UpcyclingSFT]
来源论文· 2[Liew2025Upcycling][Hui2024UpcyclingSFT]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: MoE becomes the default backbone (dense remains for ] This camp often treats training FLOPs as the only ledger, but ROI also includes all-to-all, memory residency, inference cache hit rates, and post-training transfer co

contestedc-fcaf30ab3a
该阵营的证据多集中在“复用”路径,而 DeepSeek 模板的主战场是“从零训练 + 模板化监控 + 系统兑现”。当组织确实要做 frontier 级从零预训练时,MoE 的 active/total 结构仍能把容量拉开,并且复刻平台已降低实现不确定性。[DeepSeekAI2024V2][DeepSeekAI2024V3][Kang2025FLAMEMoE]
来源论文· 3[DeepSeekAI2024V2][DeepSeekAI2024V3][Kang2025FLAMEMoE]
2 观测MOE Landscape
证据 (2)
  • topic_reportmoe-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: dense wins on full-lifecycle ROI (especially under u] Most evidence here targets the reuse/upcycling path, while the DeepSeek template’s main regime is scratch training with templated monitoring and systems realization.

  • topic_reportmoe-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: dense wins on full-lifecycle ROI (especially under u] The counter is that “training MoE from scratch” and “upcycling” are different problems: DeepSeek-V3’s templated monitoring and bias EMA primarily address from-scratch

contestedc-c1f645b871
这类结论常依赖特定规模与评测口径;在 LLM 规模,路由/均衡的主要价值未必体现在“更高的验证分数”,而是体现在避免训练失败模式(dead experts、token drop、表示塌缩)。DeepSeek-V3 把这些失败模式的监控与纠偏做成默认流程,属于“稳定性工程”而非“追分技巧”。[DeepSeekAI2024V3][Wang2024AuxFree][Qiu2025DemonsLBL]
来源论文· 3[DeepSeekAI2024V3][Wang2024AuxFree][Qiu2025DemonsLBL]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: learned routing/balancing is overrated (random/froze] These results can be regime-dependent. At LLM scale, routing/balancing value may show less as higher validation scores and more as avoiding failure modes (dead expert

contestedc-9b74cbab53
目前证据更多是“合理的工程推断”而非 LLM 规模的 matched 对照。DeepSeek-V3 的监控与 bias EMA 让稀疏训练更可控,但是否足以让 post-train 也保持稀疏,仍缺少公开的端到端实验。[DeepSeekAI2024V3]
来源论文· 1[DeepSeekAI2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: MoE is mainly for pretraining; post-training should ] Evidence is currently more ‘reasonable engineering inference’ than LLM-scale matched comparisons. DeepSeek-V3’s monitoring and bias EMA make sparse training more cont

consensusc-c1697b56a7
在 32K–128K retrofit 区间,用 per-dim ramp(YaRN)替代全局插值(PI)能把“高频维度被压缩导致局部模式退化”的风险从结构性问题降为可调超参;工程上对应的预算是 ~400 步长序列微调 + 推理栈支持 attention temperature[Peng2023YaRN][Chen2023PI][bloc972023NTK]。
来源论文· 2[Peng2023YaRN][Chen2023PI]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    In 32K–128K retrofitting, replacing global interpolation (PI) with a per-dim ramp (YaRN) turns “high-frequency compression causing local-pattern degradation” from a structural failure mode into a tunable hyperparameter; practically this map

consensusc-b058f831ac
对新训/持续预训模型,RoPE base 不是“推理期 trick”,而是限制低频相位覆盖的上界超参:base 太小会让远距离位置不可分[Xu2024RoPEBaseBounds];生产配方把 base 设到 ~500000 并用 6 阶 short-to-long curriculum(每阶段 ~5% token)来对齐训练分布[Dubey2024Llama3]。
来源论文· 2[Xu2024RoPEBaseBounds][Dubey2024Llama3]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    For new/continual pretraining, RoPE base is not an inference-time trick but an upper-bound hyperparameter limiting low-frequency phase coverage: too small a base makes far positions indistinguishable [Xu2024RoPEBaseBounds]. Production recip

consensusc-fd3ae5fdc5
当目标长度进入 512K–2M,统一缩放公式的误差主要来自 per-dim mismatch;LongRoPE 用 evolutionary search 学非均匀 per-dim scale,并配更长微调把窗口推到 2M[Ding2024LongRoPE],这类“按维度拟合”在该长度段比“固定公式 + 少量步数”更稳。
来源论文· 2[Ding2024LongRoPE]arXiv 2402.13753arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2402.13753· report.headline_claimsgpt-5.2

    At 512K–2M, errors of global scaling formulas are dominated by per-dim mismatch; LongRoPE learns non-uniform per-dim scales via evolutionary search and uses longer fine-tuning to reach 2M [Ding2024LongRoPE], making per-dim fitting more reli

consensusc-fd4cdc9e7b
“宣称窗口”与“有效上下文”在任务型基准上经常相差 2–4×:RULER 用 13 类任务测到大量模型宣称 128K 但有效仅 ~32K[Hsieh2024RULER];因此 PPL/needle 不能作为长文能力的主指标,LongBench/LV-Eval 这类多任务、多长度档位更接近工程需求[Bai2023LongBench][Yuan2024LVEval]。
来源论文· 4[Hsieh2024RULER][Bai2023LongBench][Yuan2024LVEval]arXiv 2404.06654arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2404.06654· report.headline_claimsgpt-5.2

    Advertised windows and effective context often differ by 2–4× on task benchmarks: RULER’s 13 task types show many 128K-claimed models are effectively ~32K [Hsieh2024RULER]. Hence PPL/needle tests should not be primary metrics; multi-task an

contestedc-0960d16510
反方会说:多数团队没有持续预训预算,且 32K–128K 的需求可以用 YaRN 以几百步微调解决;把问题前移到预训会增加数据工程与训练复杂度,并不总是性价比最优[Peng2023YaRN][Giraffe2023]。
来源论文· 2[Peng2023YaRN][Giraffe2023]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].countergpt-5.2

    [counter to Camp A: pretraining-time ABF + curriculum is the clean long-] The counterargument is that many teams lack continual-pretraining budget, and 32K–128K needs can be met with YaRN in a few hundred fine-tune steps; shifting the probl

contestedc-4ea1c3f4ce
反方会说:PI 的实现最简单、推理栈改动最少,且在一些以检索为主的评估上足够好;YaRN 需要 per-dim ramp 与 temperature 的推理实现,工程门槛更高[Chen2023PI][LongBench2023]。
来源论文· 2[Chen2023PI][LongBench2023]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].countergpt-5.2

    [counter to Camp B: YaRN is the de-facto standard for 32K–128K retrofitt] The counterargument is that PI is simplest and requires minimal inference-stack changes, and can be good enough on some retrieval-heavy evaluations; YaRN needs infere

contestedc-fa20321c4c
反方会说:百万级上下文的瓶颈更多在系统(KV cache、带宽、近似 attention),per-dim 搜索带来的质量收益可能被推理近似吞掉;此外,前沿报告的任务构造可能偏向检索式,未必代表真实长文推理[Gemini2024][Lee2025InfiniteHiP]。
来源论文· 2[Gemini2024][Lee2025InfiniteHiP]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].countergpt-5.2

    [counter to Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] The counterargument is that at million-token scale, bottlenecks are mostly systems (KV cache, bandwidth, approximate attention), and quality gains from per-dim search

contestedc-59e4854232
反方会说:效率方案不等于质量方案。RULER/LongBench 这类任务型基准显示,很多模型在 recall-heavy 与多跳追踪上会在远短于宣称窗口的位置失效;绕开路线如果没有 head-to-head 的任务证据,很容易把“能处理更长输入”误当成“能在更长输入上推理”[Hsieh2024RULER][Bai2023LongBench]。
来源论文· 2[Hsieh2024RULER][Bai2023LongBench]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].countergpt-5.2

    [counter to Camp D: bypass RoPE (Mamba / length-extrapolatable Transform] The counterargument is that efficiency is not quality. Task benchmarks like RULER/LongBench show many models fail on recall-heavy and multi-hop tracing far before the

consensusc-bec8191ecd
在“同 tokenizer + 同目标函数 + 同模型家族”的训练环内,验证 loss/PPL 的幂律拟合误差通常足够小,能支撑 early-stop 与 compute allocation;但把同一标量外推到“是否发版/是否更会做题”会引入目标不一致与度量非线性两类系统误差 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Schaeffer2024WhyElusive]。
来源论文· 3[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Schaeffer2024WhyElusive]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Within the training loop (same tokenizer/objective/model family), validation loss/PPL power-law fits are typically accurate enough for early stopping and compute allocation; using the same scalar to decide “ship readiness” introduces system

consensusc-eac101c03d
raw PPL 不是跨 tokenizer/跨语言的统一单位:词表大小与分词粒度会改变 token 计数与条件概率分解,从而改写数值语义;跨语言报告至少应补充 BPB 或语言均衡的任务面板 [BigScience2022BLOOM][Yong2022BLOOMPlus1][Ustun2024AyaModel]。
来源论文· 3[BigScience2022BLOOM][Yong2022BLOOMPlus1][Ustun2024AyaModel]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Raw PPL is not a common unit across tokenizers or languages: vocabulary size and segmentation change token counts and the probability factorization, altering what the number means; cross-lingual reporting should at least add BPB or a langua

consensusc-7790b11b55
“PPL 几乎没变”不足以作为压缩/稀疏化验收:压缩可能保持平均 loss 但改变输出分布与错误模式,导致任务分数下滑;行为分布距离(如 JS divergence)与多任务面板更接近部署风险 [KhanalCapone2024CompressionTasks][HongLiu2022SameLossBetterDownstream]。
来源论文· 1[KhanalCapone2024CompressionTasks]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    “PPL barely moved” is insufficient for compression/sparsification sign-off: compression can preserve average loss while changing output distributions and failure modes, reducing task scores; behavior-distribution distances (e.g., JS diverge

consensusc-88cdc71a5b
对齐与 instruction tuning 会把“能力”从 next-token 预测迁移到指令遵循与偏好满足,使得 base PPL 对使用场景质量的解释力下降;因此需要把对齐后模型纳入独立的任务/行为评估闭环 [Chung2022ScalingInstructionFinetuned][Rafailov2023DPO][Touvron2023Llama2]。
来源论文· 2[Rafailov2023DPO][Touvron2023Llama2]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Alignment and instruction tuning shift “quality” from next-token prediction to instruction following and preference satisfaction, weakening base PPL as an explanatory variable for product quality; post-alignment models should be evaluated i

consensusc-d6afda52ad
更可执行的替代不是“找另一个单标量”,而是两阶段:阶段一用 PPL/BPB 做训练监控与数据/算力调参;阶段二用逐任务缩放律(含 overtraining 与 model ladders)做继续训练、选型与发版决策 [Gadre2024OvertrainingDownstream][Bhagia2024TaskScalingLadders][Isik2024DownstreamScalingLaws]。
来源论文· 1[Bhagia2024TaskScalingLadders]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    The practical replacement is not “find another scalar,” but a two-stage pipeline: stage 1 uses PPL/BPB for training monitoring and data/compute tuning; stage 2 uses per-task scaling laws (including overtraining and model ladders) for contin

contestedc-aa6dadd74a
反例集中在“训练外决策”:同 loss 仍可不同下游 [HongLiu2022SameLossBetterDownstream];压缩/稀疏化会让 PPL 近似不变但任务掉分 [KhanalCapone2024CompressionTasks];阈值任务使得从平滑 loss 外推离散分数不稳 [Schaeffer2024WhyElusive]。
来源论文· 2[KhanalCapone2024CompressionTasks][Schaeffer2024WhyElusive]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].countergpt-5.2

    [counter to Camp A: PPL remains the most reliable primary variable (at l] Counterexamples concentrate in out-of-loop decisions: same loss can yield different downstream outcomes [HongLiu2022SameLossBetterDownstream]; compression/sparsificat

contestedc-b2896a7b5b
短板是证据仍偏“研究设置”:逐任务缩放律需要稳定的评测协议与足够多的 checkpoints/模型梯子;对频繁改 tokenizer、频繁改 recipe 的团队,落地成本可能高于继续用 PPL 做粗筛。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].countergpt-5.2

    [counter to Camp B: PPL is only stage-1 inside a task-scaling pipeline] The weakness is that evidence is still somewhat “research-setting”: per-task scaling needs stable eval protocols and enough checkpoints/model ladders; for teams frequen

contestedc-5de9ec860d
风险在于“面板驱动的过拟合”:如果面板与产品分布不一致,可能把优化方向带偏;同时面板成本高,难以替代训练环内的高频信号。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].countergpt-5.2

    [counter to Camp C: stop searching for a scalar; define quality via stan] The risk is panel-driven overfitting: if panels mismatch product distributions, optimization can drift; panels are also expensive and cannot replace high-frequency in

contestedc-1e9d4a5356
目前证据偏稀疏:即便 token 加权更合理,也需要回答“如何与工程可控性兼容”“如何与任务缩放律衔接”。在缺少统一协议前,直接替换训练目标会增加不可控变量。
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].countergpt-5.2

    [counter to Camp D: the issue is ontological—next-token loss is not the ] Evidence is still sparse: even if token weighting is more principled, it must answer how it integrates with engineering controllability and task scaling. Without stan

consensusc-44ca506a87
在现代 Transformer 组件(如 QK-Norm、tied embeddings、residual scaling)存在时,原版 µP 的“宽度零样本 LR 迁移”经常被模块尺度破坏;Complete-P 的核心价值是把这些模块纳入 scaling 规则,并把迁移验收收敛到 coord check 表/图这一条硬指标 [Cerebras2024CompleteP][Yang2022muP]。
来源论文· 2[Cerebras2024CompleteP][Yang2022muP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    With modern Transformer components (e.g., QK-Norm, tied embeddings, residual scaling), original µP’s width-wise zero-shot LR transfer is often broken by module-scale mismatches; Complete-P’s practical value is to bring those modules into sc

consensusc-278dc89166
把迁移从“宽度一维”扩展到“宽度+深度”需要新的缩放极限:depth-µP 的规则不是把 1/√width 之类经验硬套到层数,而是来自 residual 动力学与 infinite-depth feature learning 的不同极限 [Bordelon2023DepthwiseTransfer][Yang2023TPVI]。
来源论文· 2[Bordelon2023DepthwiseTransfer][Yang2023TPVI]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    Extending transfer from “width-only” to “width+depth” requires a new scaling limit: depth-µP is not a heuristic reuse of 1/√width-like rules over layers, but follows from residual dynamics and a distinct infinite-depth feature-learning limi

consensusc-e16bad96d1
fp8 会把“同一 LR 是否稳定”变成数值范围问题:u-µP/Unit scaling 的可迁移对象不止是 LR,还包括激活/梯度/权重的 RMS 约束;否则 bf16→fp8 的迁移误差可能主要来自溢出/下溢而非优化理论 [Blake2024UMUP][Micikevicius2022FP8]。
来源论文· 2[Blake2024UMUP][Micikevicius2022FP8]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    fp8 turns “is the same LR stable” into a numerical-range problem: u-µP/unit scaling transfers not only LR but also RMS constraints on activations/gradients/weights; otherwise bf16→fp8 transfer error can be dominated by overflow/underflow ra

consensusc-bdb9b37ba4
对已有稳定 SP recipe 的团队,强行迁移到 µP 的 ROI 往往为负:经验公式/联合 scaling law 能以接近零工程成本给出 LR、batch 的初值;但其可迁移性依赖“模型形状与 recipe 不变”,一旦 aspect ratio 或 HP 组合变化,最优点可漂移到 3× 量级 [Dey2023CerebrasGPT][Bi2024DeepSeekLLM][McLeish2025Gemstones]。
来源论文· 3[Dey2023CerebrasGPT][Bi2024DeepSeekLLM][McLeish2025Gemstones]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    For teams with stable SP recipes, force-migrating to µP often has negative ROI: empirical formulas/joint scaling laws provide near-zero-engineering starting points for LR and batch; but transferability depends on keeping model shape and rec

consensusc-314781e824
weight decay 与 β₂ 往往是 µP 迁移闭包之外的主导误差源:在 AdamW 下 wd 与 LR 不等价,wd 变化会移动稳定边界与最优 LR;更务实的做法是把 wd/β₂ 交给 cost-aware 局部 BO(如 CARBS)用 10–20 次试验补齐 [Loshchilov2017AdamW][Cohen2022EdgeStability][Kosson2025WDMoreThanMuP][CARBS2024]。
来源论文· 4[Loshchilov2017AdamW][Cohen2022EdgeStability][Kosson2025WDMoreThanMuP][CARBS2024]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    Weight decay and β₂ are often dominant error sources outside µP’s transfer closure: under AdamW, wd is not equivalent to L2 and changing wd shifts stability boundaries and the optimal LR; a more pragmatic approach is to hand wd/β₂ to cost-a

contestedc-486c7cf724
反方指出两点:第一,成熟 SP 栈迁移到 µP 的工程成本高,且经验公式在固定 recipe 下已足够接近最优 [Dey2023CerebrasGPT][Bi2024DeepSeekLLM];第二,wd/β₂ 等变量会主导 LR 迁移误差,单靠 µP 不能闭环 [Kosson2025WDMoreThanMuP][Loshchilov2017AdamW]。
来源论文· 5[Dey2023CerebrasGPT][Bi2024DeepSeekLLM][Kosson2025WDMoreThanMuP][Loshchilov2017AdamW]arXiv 2304.03208arxiv.org
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transferarXiv 2304.03208· report.positions[0].countergpt-5.2

    [counter to Camp A: µP (upgraded to Complete-P) should be the default; f] Opponents argue (1) migrating mature SP stacks to µP is costly, and formulas are already close enough under a fixed recipe [Dey2023CerebrasGPT][Bi2024DeepSeekLLM]; (2

contestedc-f2d34371c0
反方指出经验公式的稳定性依赖“形状与 HP 组合不变”:Gemstones 显示 aspect ratio/HP 改动会让最优点漂移到 3×;此外,现代模块与 fp8 会引入尺度诊断需求,单靠公式难以定位失败原因 [McLeish2025Gemstones][Cerebras2024CompleteP][Blake2024UMUP]。
来源论文· 3[McLeish2025Gemstones][Cerebras2024CompleteP][Blake2024UMUP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].countergpt-5.2

    [counter to Camp B: empirical formulas + small sweeps are enough; µP is ] The counterargument is that formula stability depends on keeping shape and HP combinations fixed: Gemstones shows aspect ratio/HP changes can shift optima by ~3×; add

contestedc-d99ec8e9c1
反方认为大规模 pretrain 的主要约束是试验预算与可复现性:端到端搜索需要大量 trial 或高保真代理任务;而 µP/公式能把搜索空间降维,把“结构变量”从搜索中剥离出来 [Yang2022muP][Dey2023CerebrasGPT]。
来源论文· 2[Yang2022muP][Dey2023CerebrasGPT]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].countergpt-5.2

    [counter to Camp C: end-to-end Bayesian/automatic optimization will repl] Opponents argue that pretraining-scale constraints are trial budget and reproducibility: end-to-end search needs many trials or high-fidelity proxies; µP/formulas red

contestedc-b2f04a8bbd
反方认为把所有变量都联合建模会导致工程复杂度与试验预算爆炸;更可控的做法是先用 Complete-P/µP 把结构变量对齐,再把不可迁移变量交给局部 BO,避免把“可迁移部分”也拖进搜索 [Cerebras2024CompleteP][Yang2022muP][CARBS2024]。
来源论文· 3[Cerebras2024CompleteP][Yang2022muP][CARBS2024]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].countergpt-5.2

    [counter to Camp D: transfer is dominated by non-transferable HPs; wd/β₂] The counterargument is that jointly modeling everything explodes engineering complexity and trial budget; a more controllable approach is to align structural variable

consensusc-d283326e35
对 web 爬取语料,substring/MinHash 级别的强 exact/near-exact dedup 往往同时降低 PPL 与 train-test 泄漏;Lee et al. [Lee2021Dedup] 在多个语料中报告近重复占比可到两位数百分比量级,去重后模型更不容易逐字复述训练片段。
来源论文· 2[Lee2021Dedup]arXiv 2107.06499arxiv.org
2 观测Pretrain Data Repetition
证据 (2)
  • topic_reportpretrain-data-repetitionarXiv 2107.06499· report.headline_claimsgpt-5.2

    For scraped web corpora, strong exact/near-exact dedup at the substring/MinHash level often lowers perplexity while reducing train-test leakage; Lee et al. [Lee2021Dedup] report near-dup rates reaching double-digit percentages in multiple c

  • topic_reportpretrain-data-repetitionarXiv 2107.06499· report.headline_claimsgpt-5.4

    For scraped web corpora, exact/near-exact dedup at the substring + MinHash level often lowers perplexity, train-test leakage, and extractable memorization at once; document-level exact hashing alone is not enough [Lee2021Dedup].

consensusc-56364a74b5
“单语料内 dedup”不足以控制重复曝光:同一文档在多个公开预训练语料间复用并不少见,需要跨语料指纹与全局审计才能把曝光次数算清 [Elazar2023WhatsInMyBigData]。
来源论文· 2[Elazar2023WhatsInMyBigData]arXiv 2310.20707arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2310.20707· report.headline_claimsgpt-5.2

    Per-corpus dedup is insufficient to control repeated exposure: the same documents recur across public pretraining corpora, so cross-corpus fingerprinting and global audits are required to account for exposures [Elazar2023WhatsInMyBigData].

consensusc-460da41bf2
在“漂白后的有限高质量池”上,均匀重复训练在约 2–4 epochs 内接近新鲜 token 的边际收益;超过后“等效新鲜 token”折算率快速下降 [Muennighoff2023DataConstrained]。
来源论文· 2[Muennighoff2023DataConstrained]arXiv 2305.16264arxiv.org
2 观测Pretrain Data Repetition
证据 (2)
  • topic_reportpretrain-data-repetitionarXiv 2305.16264· report.headline_claimsgpt-5.2

    On a finite, bleached high-quality pool, uniform repetition is close to fresh-token marginal gains up to ~2–4 epochs; beyond that, the “fresh-token equivalence” drops quickly [Muennighoff2023DataConstrained].

  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.4

    Uniform repetition on a finite high-quality pool is close to fresh-token equivalent for roughly 2–4 epochs, after which marginal returns fall quickly; this heuristic does not apply to hot-subset over-exposure [Muennighoff2023DataConstrained

consensusc-5890a10022
重复的主要风险不是 epoch 数,而是曝光分布的偏斜:即便只有小比例 hot subset 被高倍重复,也会带来可测退化,并在 induction head 中留下可解释指纹 [Hernandez2022RepeatedData]。
来源论文· 2[Hernandez2022RepeatedData]arXiv 2205.10487arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2205.10487· report.headline_claimsgpt-5.2

    The main risk is not epoch count but exposure skew: even a small hot subset repeated many times causes measurable degradation and leaves interpretable induction-head fingerprints [Hernandez2022RepeatedData].

consensusc-5197e95008
对敏感/版权/评测数据,按曝光次数计费比“尽量去重”更稳:默认 0–1 次曝光,配合 provenance 与前后缀 dedup;重复曝光会随模型规模与更长 context 推高可恢复记忆化风险 [Carlini2022Memorization][Deng2023BenchmarkContamination]。
来源论文· 1[Carlini2022Memorization]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.2

    For sensitive/copyright/eval data, exposure accounting is safer than “best-effort dedup”: default to 0–1 exposure with provenance and prefix/suffix dedup; repeated exposure increases extractable memorization risk, which grows with model siz

contestedc-f680ffa182
对“漂白后的有限高质池”,重复并不等价于噪声:在数据受限时,均匀多 epoch 可以把 compute 用满,并在前几轮接近新鲜 token 的收益 [Muennighoff2023DataConstrained]。把所有重复都当成要删除,会把“算力分配问题”误当成“数据卫生问题”。
来源论文· 1[Muennighoff2023DataConstrained]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].countergpt-5.2

    [counter to Camp A: Dedup as much as possible (treat repetition as noise] For a bleached finite high-quality pool, repetition is not equivalent to noise: under data constraints, uniform multi-epoch can spend compute effectively and behaves

contestedc-1ed069626a
“≤4 epochs”不是通用护身符:一旦采样分布产生 hot subset,非均匀重复会带来可测退化,并在 induction head 中留下机制指纹 [Hernandez2022RepeatedData];此外对敏感/评测数据,重复曝光会随规模与长 context 推高可恢复记忆化风险 [Carlini2022Memorization][Deng2023BenchmarkContamination]。
来源论文· 2[Hernandez2022RepeatedData][Carlini2022Memorization]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[1].countergpt-5.2

    [counter to Camp B: Uniform repetition ≤4 epochs is almost free (treat r] “≤4 epochs” is not a universal shield: once sampling creates hot subsets, non-uniform repetition causes measurable degradation and leaves induction-head fingerprints

contestedc-7cd72b9987
语义 dedup 的误杀成本更高:embedding 选择与阈值会改变保留分布,且跨时间/跨域漂移会让同一阈值失效;在缺少强审计时,语义 dedup 可能把稀有但有用的长尾内容删掉,反而降低覆盖。更稳的起步仍是 substring/MinHash + 跨语料审计 [Lee2021Dedup][Elazar2023WhatsInMyBigData],语义 dedup 作为“冗余仍高时的加码”。
来源论文· 2[Lee2021Dedup][Elazar2023WhatsInMyBigData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[2].countergpt-5.2

    [counter to Camp C: Semantic dedup is the real battleground (exact dedup] Semantic dedup has higher false-positive costs: embedding choice and thresholds change the retained distribution, and temporal/domain drift can invalidate a fixed thr

contestedc-3721f798bf
“绝对零”在工程上难以证明:跨语料复用与模板改写会让“没见过”变成不可证伪的口号 [Elazar2023WhatsInMyBigData]。更可审计的做法是曝光计数:默认 0–1 次曝光,保留 provenance 与指纹账本,允许在可控条件下做必要的清洗与回溯 [Deng2023BenchmarkContamination]。
来源论文· 2[Elazar2023WhatsInMyBigData]arXiv 2310.20707arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2310.20707· report.positions[3].countergpt-5.2

    [counter to Camp D: Zero repetition for sensitive/eval/copyright data (t] Absolute zero is hard to prove operationally: cross-corpus reuse and template rewrites make “never seen” an unfalsifiable slogan [Elazar2023WhatsInMyBigData]. A more

consensusc-7792cb7f35
在固定 HP 搜索预算(例如 ≤N 次试验、固定 schedule 家族)下,optimizer 之间的差距通常会缩小到“原报告的一半左右”,否则 A/B 更像是在比较调参投入而非算法本身[Dahl2023AlgoPerf][Agarwal2020LRConfound][Zhao2024DeconstructOptimizerLM]。
来源论文· 2[Dahl2023AlgoPerf][Agarwal2020LRConfound]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Under a fixed HP-search budget (e.g., ≤N trials within a fixed schedule family), optimizer gaps often shrink to roughly half of what unconstrained reports claim; otherwise A/Bs mostly compare tuning effort rather than algorithms[Dahl2023Alg

consensusc-66430c7e09
AdamW 之所以在 ≥70B 仍常作为默认,不是因为它在每个设定里最优,而是因为 μP 下的学习率迁移能把跨宽度/跨模型族的 LR 搜索从“每次重来”压缩到“少量校准”[Lingle2024muPTransfer],并且其失败模式更可预期[Das2024PreconditioningAdam][Schlotthauer2025Budget]。
来源论文· 4[Lingle2024muPTransfer][Das2024PreconditioningAdam][Schlotthauer2025Budget]arXiv 2404.05728arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2404.05728· report.headline_claimsgpt-5.2

    AdamW stays a common default at ≥70B not because it wins every setting, but because μP learning-rate transfer compresses cross-width/cross-family LR search from “start over” to “light calibration”[Lingle2024muPTransfer], with more predictab

consensusc-349150eb31
Muon 的可落地性来自“参数分区”:只对 hidden 的 2D 权重做近似二阶(正交化)更新,其余敏感参数(embedding/norm/head)保留 AdamW,从而把风险集中在可控子集;公开证据目前更偏 speedrun/小规模,≥7B 的稳定性与收益边界仍需补齐[Jordan2024Muon][Wortsman2023ProxyInstability]。
来源论文· 2[Jordan2024Muon][Wortsman2023ProxyInstability]
2 观测Optimizer Landscape
证据 (2)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Muon’s deployability comes from parameter partitioning: apply approximate second-order (orthogonalized) updates only to hidden 2D weights while keeping sensitive parameters (embeddings/norms/heads) on AdamW, concentrating risk into a contro

  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Muon’s deployability comes from parameter partitioning: Newton–Schulz orthogonalized updates only on hidden 2D weights, while embeddings/norms/heads stay on AdamW; this isolates near-second-order instability and tuning risk to a controllabl

consensusc-0d05fd2e76
SOAP 把 Shampoo 的额外超参从 4 个压到 1 个,并在 360M–1.3B 做到 wall-clock 与 AdamW 接近[Vyas2024SOAP];但“二阶也能像 μP 一样做 LR transfer”的证据链尚不完整,理论与参数化约束提示这不是自动成立[Ishikawa2023SecondOrderParam]。
来源论文· 3[Vyas2024SOAP][Ishikawa2023SecondOrderParam]arXiv 2409.11321arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2409.11321· report.headline_claimsgpt-5.2

    SOAP reduces Shampoo’s extra HPs from 4 to 1 and reaches near AdamW wall-clock at 360M–1.3B[Vyas2024SOAP]; but evidence for “second-order LR transfer like μP” is still incomplete, and theory/parameterization constraints suggest it won’t com

consensusc-1dcfa87be5
在显存约束下,“不改训练循环”的低 state 方法往往比“改变优化几何”的方法更稳:Apollo 用 per-tensor 标量替代 per-param 二阶矩,把 state 逼近 SGD 且在 7B/13B 接近 AdamW loss[Zhu2024Apollo];相比之下 GaLore 通过低秩投影省显存,但引入额外超参并改变动态,迁移风险更高[Zhao2024GaLore][GurAri2018TinySubspace]。
来源论文· 3[Zhu2024Apollo][Zhao2024GaLore][GurAri2018TinySubspace]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Under VRAM constraints, low-state methods that don’t change the training loop are often more reliable than methods that change optimization geometry: Apollo replaces per-param second moments with per-tensor scalars, pushing state toward SGD

contestedc-dbda73db9f
反方会指出:在受控对比里,很多“AdamW 的优势”来自 schedule 与调参惯性,而不是算法本身;如果允许同等预算,二阶/近二阶方法可能在 wall-clock 与 token 上都更划算[Vyas2024SOAP][Zhao2024DeconstructOptimizerLM]。
来源论文· 1[Vyas2024SOAP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: AdamW won’t be retired (highest default priority)] Critics argue many “AdamW advantages” come from schedule and tuning inertia rather than the algorithm; with equal budgets, second-order or near-second-order methods may

contestedc-2b176373b1
反方会强调:公开证据偏 speedrun/小规模,而小规模 proxy 可能系统性漏掉大规模不稳定性;在没有 ≥7B/≥30B 的受控 head-to-head 前,把 Muon 当默认会把风险转嫁给生产[Wortsman2023ProxyInstability][Dahl2023AlgoPerf]。
来源论文· 2[Wortsman2023ProxyInstability][Dahl2023AlgoPerf]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: Muon is the next default (but only as a hybrid)] Critics emphasize public evidence is skewed toward speedruns/small-scale, and small proxies can systematically miss large-scale instabilities; without controlled ≥7B/≥30B

contestedc-cb20dafd76
反方会指出:二阶方法的生产门槛不只在算力,还在“可迁移调参”。理论与实证都提示二阶在宽度扩展下可能需要特定 parameterization 与 LR 缩放规则,否则每个规模都要重调[Ishikawa2023SecondOrderParam][Lingle2024muPTransfer]。
来源论文· 2[Ishikawa2023SecondOrderParam][Lingle2024muPTransfer]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Critics note production barriers are not only compute, but transferable tuning. Theory and evidence suggest second-order methods may require specific parameterization

contestedc-8a0328a518
反方会指出:即便在受控预算下,二阶/近二阶仍可能在相同 wall-clock 下给出更低 loss,尤其当系统实现成熟后(分布式 Shampoo、SOAP 的超参简化)[Shi2023DistShampoo][Vyas2024SOAP]。
来源论文· 2[Shi2023DistShampoo][Vyas2024SOAP]
3 观测Optimizer Landscape
证据 (3)
  • topic_reportoptimizer-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: optimizers matter less; many gains are evaluation ar] Opponents argue that even under controlled budgets, second-order/near-second-order can still yield lower loss at the same wall-clock, especially as systems implementa

  • topic_reportoptimizer-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: optimizers matter less; many gains are evaluation ar] Even if gaps shrink, second-order/hybrid routing can still yield stable wall-clock wins under specific tensor and system constraints; attributing everything to artifa

  • topic_reportoptimizer-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: optimizers matter less; many gains are evaluation ar] Revision to c-8a0328a518: even under controlled budgets, second-/near-second-order can still yield lower loss at similar wall-clock, especially with mature systems (d

consensusc-37f5821f1a
在 FA2 varlen attention 把 doc 边界作为 cu_seqlens 输入后,per-doc causal mask 的系统开销从“显式大 mask 张量”降为“接口参数”,使 packing ratio 接近 100% 与 cross-doc 隔离可以同时成立 [FlashAttention2Varlen][FlashAttention32024]。
来源论文· 1[FlashAttention32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    Once FA2 varlen attention takes document boundaries via cu_seqlens, the systems cost of per-doc causal masking shifts from explicit large mask tensors to lightweight API parameters, making near-100% packing ratio compatible with strict cros

consensusc-91029ab837
naive concat + cross-doc visible 会系统性低估训练 loss,因为后文 token 能 attend 到无关 doc 的 token;per-doc mask 的价值首先是“让 objective 与评估一致”,其次才是“防泄漏” [Krell2021Packing]。
来源论文· 1[Krell2021Packing]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    Naive concat with cross-doc visibility systematically underestimates training loss because later tokens can attend to unrelated-document tokens; the primary value of per-doc masking is objective–evaluation alignment, not just “leak preventi

consensusc-6d5c209f43
short-to-long 的工程含义是把长窗训练从“全程成本”变成“末段预算”:LLaMA-3 披露约 95% compute 在 8K,末段扩到 128K [Llama32024];Qwen2.5 给出 RoPE base 从 10K 调到 1M 的可复现配置 [Qwen25Tech]。
来源论文· 1[Llama32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    The engineering meaning of short-to-long is to turn long-window training from a full-run cost into a tail-budget decision: LLaMA-3 reports ~95% compute at 8K with a final 128K stage [Llama32024], while Qwen2.5 provides a reproducible RoPE-b

consensusc-8b1a262a08
“减少 truncation 比更激进的 concat-chunk 更稳”已经有直接证据:保留文档完整性、减少 token 丢弃能改善 LM 结果 [Ding2024FewerTruncations],因此超长 doc 的默认应是 split-then-pack 而不是 truncate-and-drop。
来源论文· 2[Ding2024FewerTruncations]arXiv 2404.10830arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2404.10830· report.headline_claimsgpt-5.2

    There is direct evidence that “fewer truncations is more robust than aggressive concat-chunk”: preserving document integrity and avoiding token drops improves LM outcomes [Ding2024FewerTruncations], so the default for over-length docs shoul

consensusc-d67a1bf7f6
FIM 在 code 上有稳定 recipe(StarCoder 50% rate)[StarCoder2023][FIM2022],但把 FIM 作为 NL 默认 objective 缺少干净的 NL-only rate sweep;UL2 的 mixture-of-denoisers 提供动机但也引入目标函数口径分裂 [UL22022]。
来源论文· 4[StarCoder2023][FIM2022][UL22022]arXiv 2305.06161arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2305.06161· report.headline_claimsgpt-5.2

    FIM has a stable code recipe (StarCoder at 50% rate) [StarCoder2023][FIM2022], but making FIM the default objective for NL lacks clean NL-only rate sweeps; UL2’s mixture-of-denoisers provides motivation but also fragments objective definiti

contestedc-9ee164c199
反方会说:真实使用常跨文档(prompting/检索拼接),硬隔离边界会让模型缺少跨片段建模能力;此外 short-to-long 可能引入 length-based overfitting 或对某些任务不友好。
3 观测Packing Masking Length
证据 (3)
  • topic_reportpacking-masking-length· report.positions[0].countergpt-5.2

    [counter to Camp A: short-to-long + per-doc masking (default engineering] Opponents argue that real usage often crosses documents (prompting/retrieval concatenation), so hard boundary isolation may under-train cross-segment modeling; and sh

  • topic_reportpacking-masking-length· report.positions[0].countergpt-5.2

    [counter to Camp A: per-doc masking + short-to-long (default engineering] Common objections: hard boundary isolation mismatches real usage (prompting/retrieval crosses documents), and length curricula may cause length overfitting or insuffi

  • topic_reportpacking-masking-length· report.positions[0].counterep-20260214160829-csjmc

    [counter to Camp A: per-doc masking + short-to-long (default engineering] Opponents argue that hard isolation of document boundaries reduces cross-segment modeling capability, short-to-long may introduce length overfitting, full-cycle mixed

contestedc-8cfcabf9ea
现有公开证据缺少“等 compute、等系统吞吐”的直接对照来证明全程混合更优;同时,长窗训练的系统成本与数据构造复杂度更高,容易把工程噪声引入结论。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].countergpt-5.2

    [counter to Camp B: uniformly mixed-length training (anti-curriculum, al] Public evidence still lacks direct equal-compute, matched-throughput comparisons proving always-mixed is better; long-window training also has higher systems cost and

contestedc-f1de830f4d
naive concat 会把“跨 doc 任务”与“吞吐优化副作用”混在一起:loss 低估与评估口径不一致是可预期的 [Krell2021Packing];更可控的做法是像 LinkBERT 那样用结构约束跨 doc,而不是随机可见。
来源论文· 2[Krell2021Packing]arXiv 2107.02027arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2107.02027· report.positions[2].countergpt-5.2

    [counter to Camp C: naive concat + cross-doc visible (cross-boundary by ] Naive concat conflates “cross-doc tasks” with “throughput side effects”: loss underestimation and evaluation mismatch are expected [Krell2021Packing]. A more controll

contestedc-600af45ce9
code 侧证据充分,但 NL 缺少干净的 rate sweep;同时 mixture-of-denoisers 会改变 objective 口径,使得“困惑度/下游”对照更依赖评估集选择。把它设为默认会把不确定性扩散到所有训练 run。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[3].countergpt-5.2

    [counter to Camp D: FIM for everything (infilling as a universal default] Code-side evidence is strong, but NL lacks clean rate sweeps; mixture-of-denoisers also changes the objective definition, making perplexity/downstream comparisons mor

consensusc-aa89f2610c
compute-optimal tokens/param 不是常数:在公开重拟合中可落在 5–100 的宽带,且对数据配方与 batch-size schedule 敏感 [DeepSeek2024LLM];因此直接复用 Kaplan≈1.7 [Kaplan2020ScalingLaws] 或 Chinchilla≈20 [Hoffmann2022Chinchilla] 会把预算分配误差放大到“同算力不同最优点”的量级。
来源论文· 4[DeepSeek2024LLM][Kaplan2020ScalingLaws][Hoffmann2022Chinchilla]arXiv 2401.02954arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2401.02954· report.headline_claimsgpt-5.2

    Compute-optimal tokens/param is not a constant: public refits place it in a wide 5–100 band and show sensitivity to data mixture and batch-size schedules [DeepSeek2024LLM]. Copying Kaplan≈1.7 [Kaplan2020ScalingLaws] or Chinchilla≈20 [Hoffma

consensusc-719fa1d538
在数据受限场景,重复训练的可用区间有经验边界:≤4 epochs 的重复近似等价新鲜 tokens,之后收益按可拟合速率衰减 [Muennighoff2023DataConstrained];把 D 当成一维 tokens 会系统性高估“多训就行”。
来源论文· 2[Muennighoff2023DataConstrained]arXiv 2305.16264arxiv.org
2 观测Scaling Laws LLM
证据 (2)
  • topic_reportscaling-laws-llmarXiv 2305.16264· report.headline_claimsgpt-5.2

    In data-constrained regimes, repetition has an empirical safe window: repeating up to ≤4 epochs is approximately equivalent to fresh tokens, after which gains decay at a fit-able rate [Muennighoff2023DataConstrained]. Treating D as one-dime

  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.2

    Treating D as a 1D total-token count systematically overestimates “just train longer”: in data-constrained regimes, repetition up to ≤4 epochs is roughly comparable to fresh tokens, after which marginal gains decay at a fit-able rate [Muenn

consensusc-4789f5cd2d
loss scaling 可以稳定外推到 over-training,但单个下游 benchmark 的分数在训练过程中会大幅波动,直到 loss 下降到阈值附近才稳定 [Gadre2024OverTraining];因此“用 loss 直接预测某个任务分数”在早期训练阶段误差会被波动主导。
来源论文· 2[Gadre2024OverTraining]arXiv 2403.08540arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2403.08540· report.headline_claimsgpt-5.2

    Loss scaling can extrapolate reliably into over-training, but individual downstream benchmark scores can vary widely during training until a loss threshold is reached [Gadre2024OverTraining]. Thus, “predict a specific task score from loss”

consensusc-0f382b6448
task-level scaling 需要单独建模:用“模型阶梯 + 两段回归”在 7 个 multiple-choice 任务上把平均预测误差压到 1.9% [Bhagia2024TaskLadders],明显优于把所有任务硬映射到同一条 loss 幂律的做法。
来源论文· 2[Bhagia2024TaskLadders]arXiv 2412.04403arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2412.04403· report.headline_claimsgpt-5.2

    Task-level scaling needs its own model: “model ladders + two-stage regression” achieves 1.9% average prediction error on 7 multiple-choice tasks [Bhagia2024TaskLadders], outperforming approaches that force all tasks onto a single loss power

consensusc-884e576e9d
数据 mixture 是独立 scaling 轴:在固定模型与 tokens 预算下,仅改变数据筛选/配方即可带来 ≥7 pp 的下游差距 [Li2024DCLM];这类差距量级通常大于“把 tokens/param 从 20 调到 30”的二阶收益。
来源论文· 1[Li2024DCLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.2

    Data mixture is an independent scaling axis: at fixed model and token budgets, changing filtering/mixture alone can yield ≥7 pp downstream gaps [Li2024DCLM], often larger than second-order gains from tweaking tokens/param (e.g., 20→30).

contestedc-947c3e3004
反例来自更充分训练与更接近 IsoFLOP 的对照:当训练更到位时,数据不足更早成为瓶颈,compute-optimal 会向更多 tokens 移动 [Hoffmann2022Chinchilla];并且在不同数据配方与 schedule 下,最优 tokens/param 会在 5–100 间滑动,指数难以跨团队复用 [DeepSeek2024LLM]。
来源论文· 2[Hoffmann2022Chinchilla][DeepSeek2024LLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].countergpt-5.2

    [counter to Camp A: Kaplan-style—portable exponents, compute-optimal fav] Counterevidence comes from more thorough training and IsoFLOP-style controls: when training is closer to completion, data scarcity becomes the bottleneck earlier and

contestedc-d0c5d1c4ac
两个方向的反驳都指向“指数不稳定”:数据受限时,重复 tokens 与新鲜 tokens 的等价只在 ≤4 epochs 成立,之后收益衰减会改变最优点 [Muennighoff2023DataConstrained];在不同数据配方与 schedule 下,最优 tokens/param 可在 5–100 间滑动 [DeepSeek2024LLM]。此外,数据 mixture 本身可在固定预算下带来 ≥7 pp 差距,使得只优化 token/param 可能是二阶问题 [L
来源论文· 2[Muennighoff2023DataConstrained][DeepSeek2024LLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[1].countergpt-5.2

    [counter to Camp B: Chinchilla-style—balance N and D under fixed compute] Two lines of counterevidence point to exponent instability: under data constraints, repeated tokens are only equivalent to fresh tokens up to ≤4 epochs, after which d

contestedc-fad8c0ff55
反驳点在于“数据第一轴”不自动给出预算分配:即使 mixture 做对,compute-optimal 的 N:D 仍需要在具体 schedule 与训练到位程度下重拟合 [Hoffmann2022Chinchilla][DeepSeek2024LLM];并且在数据受限时,重复训练的收益衰减会把“更多 tokens”变成低效投入 [Muennighoff2023DataConstrained]。
来源论文· 3[Hoffmann2022Chinchilla][DeepSeek2024LLM][Muennighoff2023DataConstrained]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].countergpt-5.2

    [counter to Camp C: Data-mixture pragmatists—data is the first axis; get] The pushback is that “data-first” does not automatically solve budget allocation: even with a good mixture, the compute-optimal N:D must still be re-fit under the spe

contestedc-0c81450b6e
主要反驳是:即使指标更连续,某些能力仍可能在特定数据/训练阶段出现明显加速;此外,生产关心的往往就是 0-1 指标,因此“指标伪影”并不等于“工程上可忽略”。目前公开文献中,对这些反驳的系统性对照仍不足(见 open questions)。
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].countergpt-5.2

    [counter to Camp D: Against emergence-as-magic—many “emergent” effects c] The main pushback is: even with more continuous metrics, some capabilities may still accelerate in specific data/training phases; and production often cares about 0-1

consensusc-4db589bba8
在 repo-level 任务上,跨文件上下文是“结构性变量”而非细节:RepoBench[RepoBench2023] 与 RepoCoder[RepoCoder2023] 都把单文件设定暴露为系统性高估,因此 pretrain 阶段只报 HumanEval 类单文件分数不足以预测 issue resolving。
来源论文· 3[RepoBench2023][RepoCoder2023]arXiv 2306.03091arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2306.03091· report.headline_claimsgpt-5.2

    For repo-level tasks, cross-file context is a structural variable rather than a detail: both RepoBench[RepoBench2023] and RepoCoder[RepoCoder2023] show single-file setups systematically overestimate capability, so reporting only HumanEval-s

consensusc-5715790a36
执行语义与函数合成不是同一瓶颈:CRUXEval[CRUXEval2024] 用 I/O 预测把“程序状态跟踪”单独测出来;因此 code BPB/patch-PPL 的提升若不伴随 CRUXEval 提升,优先解释为分布拟合而非可执行理解。
来源论文· 2[CRUXEval2024]arXiv 2401.03065arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2401.03065· report.headline_claimsgpt-5.2

    Execution semantics and function synthesis are different bottlenecks: CRUXEval[CRUXEval2024] isolates program-state tracking via I/O prediction; thus BPB/patch-PPL gains without CRUXEval gains should be read as distribution fitting rather t

consensusc-a792f4bd7a
SFT-ready 阶段单报 HumanEval 会把 false positive 当能力:EvalPlus[EvalPlus2023] 证明原始测试集不足会系统性抬高分数,因此“HumanEval-only”不应作为可接受的对比口径。
来源论文· 2[EvalPlus2023]arXiv 2305.01210arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2305.01210· report.headline_claimsgpt-5.2

    At the SFT-ready stage, reporting HumanEval alone turns false positives into “capability”: EvalPlus[EvalPlus2023] shows inadequate tests systematically inflate scores, so “HumanEval-only” should not be an acceptable comparison protocol.

consensusc-3ca84444a5
freshness 不是附加说明而是评测参数:LiveCodeBench[LiveCodeBench2024] 把时间窗口机制化;不披露训练截止与评测窗口时,跨模型差异更可能来自 contamination 而非方法改进。
来源论文· 2[LiveCodeBench2024]arXiv 2403.07974arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2403.07974· report.headline_claimsgpt-5.2

    Freshness is an evaluation parameter, not a footnote: LiveCodeBench[LiveCodeBench2024] operationalizes time windows; without disclosing training cutoff and eval window, cross-model differences are more likely contamination than method impro

consensusc-507b35782c
agent loop 的成本结构会改变结论:self-debug/self-repair 工作表明“最终是否通过”掩盖了重试次数与反馈质量差异[SelfDebug2023][SelfRepair2023];部署评估必须单列 retry@k、test-exec rate、tokens/issue。
来源论文· 3[SelfDebug2023][SelfRepair2023]arXiv 2304.05128arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2304.05128· report.headline_claimsgpt-5.2

    Agent-loop cost structure can flip conclusions: self-debug/self-repair work shows final pass/fail hides differences in retries and feedback quality[SelfDebug2023][SelfRepair2023]; deployment evaluation must separately report retry@k, test-e

contestedc-65ba228a57
RepoBench[RepoBench2023] 与 RepoCoder[RepoCoder2023] 直接指出单文件设定遗漏跨文件依赖与项目结构;EvalPlus[EvalPlus2023] 说明弱测试会抬高分数;因此 HumanEval 高分既可能来自模板与测试漏洞,也可能与 issue resolving 的瓶颈无关。
来源论文· 4[RepoBench2023][RepoCoder2023][EvalPlus2023]arXiv 2306.03091arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2306.03091· report.positions[0].countergpt-5.2

    [counter to Camp A: HumanEval/MBPP is sufficient to represent coding abi] RepoBench[RepoBench2023] and RepoCoder[RepoCoder2023] show single-file setups miss cross-file dependencies and project structure; EvalPlus[EvalPlus2023] shows weak te

contestedc-7438d38d40
Multi-SWE-Bench[MultiSWEBench2025] 显示语言/生态变化会重排排名,说明 Verified 单点难以外推;SWERebench2025 强调去污染与持续任务收集,否则长期对比会被数据漂移与泄漏扭曲;此外 harness 未披露时,0.x 的差异更像策略差异而非能力差异。
来源论文· 2[MultiSWEBench2025]arXiv 2504.02605arxiv.org
2 观测Swe Agent Evaluation
证据 (2)
  • topic_reportswe-agent-evaluationarXiv 2504.02605· report.positions[1].countergpt-5.2

    [counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Multi-SWE-Bench[MultiSWEBench2025] shows language/ecosystem shifts can reorder rankings, so a single Verified score does not extrapolate; SWERebench2025 emphasizes de

  • topic_reportswe-agent-evaluation· report.positions[1].countergpt-5.2

    [counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Counter to c-e8339ebcb6 / c-33451ad4c4: Verified reduces noise but is Python-heavy; Multi-SWE-Bench shows rankings can reshuffle across 7 languages, so a single Verif

contestedc-9368974e3e
现有公开证据更多支持“编辑建模更贴近任务形态”[InCoder2022],但并未给出 patch-PPL 与 Verified 的系统相关性;同时 CRUXEval[CRUXEval2024] 表明执行语义可被独立测出,repo-level 评测[RepoBench2023] 表明跨文件是结构性变量,单靠似然可能把“可执行理解”与“分布拟合”混在一起。
来源论文· 4[InCoder2022][CRUXEval2024][RepoBench2023]arXiv 2204.05999arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2204.05999· report.positions[2].countergpt-5.2

    [counter to Camp C: trajectory-level PPL (e.g., patch-PPL) is the most r] Public evidence mostly supports “editing modeling matches task form”[InCoder2022], but does not systematically establish correlation between patch-PPL and Verified; m

contestedc-3bbc985ffb
缺点是缺少统一、可复现的部署级基准与公开日志口径:多数工作仍停留在研究环境,难以跨团队对齐;同时如果没有一个任务锚点(如 Verified),UX 指标可能被“少做事”或“早停”策略投机。
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].countergpt-5.2

    [counter to Camp D: deployment UX metrics reflect SWE-agent value better] The weakness is the lack of standardized, reproducible deployment-level benchmarks and public logging protocols: many results remain lab-specific and hard to align ac

consensusc-f7eb71a460
在需要精确复制/寻址的评测上,纯 SSM/attention-free 模型的主要失败模式不是“上下文不够长”,而是“状态压缩不可逆”:Zoology 的 recall 指标与 ICL 对照实验都显示其落后是系统性的,而不是单一实现细节 [Zoology2023][Grazzi2024IsMambaICL][Lee2023IsAttentionRequired]。
来源论文· 3[Zoology2023][Grazzi2024IsMambaICL][Lee2023IsAttentionRequired]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    On evaluations requiring exact copying/addressing, the dominant failure mode of pure SSMs/attention-free models is not “insufficient context length” but “non-invertible state compression”: recall-centric metrics and ICL comparisons show a s

consensusc-98d4ce8820
“注意力与 SSM 可统一”应读作:因果注意力矩阵存在可被结构化递推表示的子类,而不是“所有注意力都可无损替换”。SSD 把可互译部分形式化,并解释为何需要少量 attention 层来承担精确寻址 [Dao2024TransformersAreSSMs]。
来源论文· 1[Dao2024TransformersAreSSMs]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    “Attention and SSMs can be unified” should be read as: causal attention contains a subclass representable by structured recurrences, not that all attention can be losslessly replaced. SSD formalizes the translatable subset and explains why

consensusc-cc72e2b32a
混合架构的收益来自“职责分离”而非简单堆叠:attention 层负责可解释的 copy/检索算法(与 induction head 机制相容),SSM 层负责长程路由与压缩;Jamba/Griffin 的设计与实验都符合这一分工 [Lieber2024Jamba][De2024Griffin][Olsson2022InductionHeads]。
来源论文· 3[Lieber2024Jamba][De2024Griffin][Olsson2022InductionHeads]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    Hybrid gains come from role separation, not naive stacking: attention layers implement copy/retrieval algorithms compatible with induction-head mechanisms, while SSM layers handle long-range routing and compression; Jamba/Griffin designs an

consensusc-c9225892e2
“次二次模型必须从零预训练”在 2024 后证据不足:蒸馏与转换工作表明,从已训练 Transformer 迁移交互模式到 SSM/线性注意力骨架是可行路径,且更符合工程成本约束 [Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Zhang2024HedgehogPorcupine]。
来源论文· 3[Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Zhang2024HedgehogPorcupine]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    The claim “subquadratic models must be pretrained from scratch” is weak after 2024: distillation and conversion show feasible paths to transfer interaction patterns from pretrained Transformers into SSM/linear-attention backbones, matching

contestedc-91b6b71d6d
反例主要集中在“精确寻址”任务:在统一训练对照下,attention-free 模型的 recall/ICL 掉点可复现,且与 induction head 这类可解释 copy 机制相冲突 [Zoology2023][Grazzi2024IsMambaICL][Olsson2022InductionHeads]。SSD 也提示注意力并非都能被结构化递推覆盖 [Dao2024TransformersAreSSMs]。
来源论文· 4[Zoology2023][Grazzi2024IsMambaICL][Olsson2022InductionHeads][Dao2024TransformersAreSSMs]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].countergpt-5.2

    [counter to Camp A: Pure SSMs will eventually replace Transformers] Counterevidence concentrates on “precise addressing” tasks: under controlled training, recall/ICL deficits of attention-free models are reproducible and conflict with mecha

contestedc-c5b45c4e2e
同类不等于等价:线性注意力常维护矩阵态(或低秩近似),在表达“多槽位记忆/可寻址索引”上更接近 attention;而很多 SSM 选择对角或受限结构,优化与带宽特性不同,导致在 copy/ICL 上的失败模式也不同 [Zoology2023][Grazzi2024IsMambaICL]。此外,softmax mimicry 的近似误差与稳定性是线性注意力独有的工程风险 [Choromanski2020Performer][Zhang2024HedgehogPorcupine
来源论文· 3[Zoology2023][Grazzi2024IsMambaICL][Choromanski2020Performer]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].countergpt-5.2

    [counter to Camp B: Linear attention and SSMs are fundamentally the same] Same family does not mean equivalent: linear attention often maintains matrix states (or low-rank approximations), closer to attention for “multi-slot memory/addressa

contestedc-4ef459286c
2024 后的证据更偏向“可控迁移”:注意力结构蒸馏到 SSM 的框架已给出明确的对齐目标与分阶段策略 [Bick2024TransformersToSSMs];从已训练 Transformer 出发做混合化/替换并保持质量也已有实证 [Wang2024MambaInTheLlama];线性注意力侧也在强调 finetune/转换路径 [Zhang2024HedgehogPorcupine]。
来源论文· 3[Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Zhang2024HedgehogPorcupine]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[2].countergpt-5.2

    [counter to Camp C: Subquadratic models must be pretrained from scratch;] Post-2024 evidence leans toward “controllable transfer”: explicit objectives and staged strategies for distilling attention structure into SSMs exist [Bick2024Transfo

contestedc-1af343c8ee
主要反驳来自两点:一是“最少需要多少 attention、放在哪里”缺少系统曲线;二是混合会引入实现复杂度与 kernel 维护成本,且在短序列下未必更快(瓶颈可能转为带宽/launch overhead)。这些问题目前更多是工程经验而非论文级定量结论 [Liu2023LEval][Zoology2023]。
来源论文· 2[Liu2023LEval][Zoology2023]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[3].countergpt-5.2

    [counter to Camp D: The engineering optimum is hybrid; minimize attentio] The main pushback is twofold: (1) we lack systematic curves for the minimal number and placement of attention layers; (2) hybrids add implementation complexity and ke

consensusc-e565207da9
对 decode 来说,关键 KPI 不是 attention FLOPs,而是 bytes/token:MQA 把 K/V 写入从“每个 head 一份”改为共享,直接针对 KV cache 带宽瓶颈 [Shazeer2019MQA];GQA/MLA 延续同一目标,把“结构改动”变成可测的 HBM traffic 预算 [Ainslie2023GQA] [DeepSeek2024V2].
来源论文· 4[Shazeer2019MQA][Ainslie2023GQA][DeepSeek2024V2]arXiv 1911.02150arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 1911.02150· report.headline_claimsgpt-5.2

    For decoding, the primary KPI is bytes/token rather than attention FLOPs: MQA shares K/V across heads to directly target KV-cache bandwidth [Shazeer2019MQA]; GQA/MLA continue the same objective, turning “architecture changes” into measurabl

consensusc-bd09665ee8
FlashAttention 的核心不是“更快的 attention”,而是把 N×N 中间矩阵从 HBM 移走:SMEM tiling + online softmax 把 attention 的上限从纯 memory-bound 推向可通过分块与并行度调参的区间 [Dao2022FlashAttention];到 Hopper/Blackwell,这个区间进一步受 async/TMA 与同步边界主导 [Shah2024FlashAttention3].
来源论文· 3[Dao2022FlashAttention][Shah2024FlashAttention3]arXiv 2205.14135arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2205.14135· report.headline_claimsgpt-5.2

    FlashAttention is not just “faster attention”; it removes the N×N intermediate from HBM: SMEM tiling + online softmax shifts the ceiling from purely memory-bound to a regime tunable via tiling and parallelism [Dao2022FlashAttention]; on Hop

consensusc-243d064b2d
“同一算法不同 kernel 默认等价”在训练上是可被证伪的:fused attention 的数值偏差会通过 optimizer state 外泄并触发 loss spike [Golden2024FAStability],而 FP8 的漂移可能在短跑实验中不可见、在 trillion-token 才出现 [Fishman2024FP8Scale].
来源论文· 3[Golden2024FAStability][Fishman2024FP8Scale]arXiv 2405.02803arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2405.02803· report.headline_claimsgpt-5.2

    “Same algorithm, different kernel is equivalent” is falsifiable in training: numerical deviations in fused attention can leak through optimizer state and trigger loss spikes [Golden2024FAStability], while FP8 drift may be invisible in short

consensusc-b8b1ed2ace
低精度训练更像“数值合约”而不是单点技巧:MXFP8/MXFP4 的 recipe 把 per-block scaling、master 权重与 optimizer update 的耦合写成可复现规范,意味着 kernel 的 scale/accumulation 路径属于算法的一部分 [Mishra2025MXFP8Recipe] [Tseng2025MXFP4Training].
来源论文· 2[Mishra2025MXFP8Recipe][Tseng2025MXFP4Training]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    Low-precision training behaves like a “numeric contract,” not a one-off trick: MXFP8/MXFP4 recipes codify coupling among per-block scaling, master weights, and optimizer updates, implying kernel scale/accumulation paths are part of the algo

consensusc-05d07a7e6e
“kernel 决定一切”不成立:长上下文扩展可以主要靠位置/训练策略完成(PI、YaRN、continual pretraining),在不重写 kernel 的情况下把上下文推到 32k 级别 [Chen2023PI] [Peng2023YaRN] [Xiong2023LongContextScaling];更稳的边界是“当成本/吞吐成为主目标时,最终会回到 kernel 物理”。
来源论文· 4[Chen2023PI][Peng2023YaRN][Xiong2023LongContextScaling]arXiv 2306.15595arxiv.org
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrainarXiv 2306.15595· report.headline_claimsgpt-5.2

    “Kernels decide everything” does not hold: long-context can be extended mainly via positional/training strategies (PI, YaRN, continual pretraining), reaching ~32k without kernel rewrites [Chen2023PI] [Peng2023YaRN] [Xiong2023LongContextScal

contestedc-a6d291a9f0
反方会指出:大量扩展(长上下文、分布式训练)可以在框架/训练策略层完成,不必把团队拖进 CUDA 细节;而且编译器栈在吸收硬件特化方面正在成熟 [Chen2023PI] [Zhao2023FSDP] [Lattner2020MLIR]。
来源论文· 3[Chen2023PI][Zhao2023FSDP][Lattner2020MLIR]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: algorithms and kernels must be co-designed (bytes/FL] Opponents argue that many extensions (long context, distributed training) can be done at framework/training-strategy level without dragging teams into CUDA details; a

contestedc-bd9ae26209
反方会指出:attention 与低精度路径的上限往往由硬件特性(SMEM/TMA/warp specialization、FP8/MXFP8 scale 语义)决定,编译器很难在短周期内自动生成同等级实现;更关键的是,kernel numerics 会影响训练稳定性,不能只靠 IR 等价性判断 [Shah2024FlashAttention3] [Golden2024FAStability] [Mishra2025MXFP8Recipe]。
来源论文· 3[Shah2024FlashAttention3][Golden2024FAStability][Mishra2025MXFP8Recipe]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: PyTorch/graph-compiler level is sufficient; handwrit] The counterpoint is that attention and low-precision ceilings are often dictated by hardware features (SMEM/TMA/warp specialization, FP8/MXFP8 scale semantics), which

contestedc-7b4c6c8ef8
反方证据集中在两点:其一,FP8/MXFP8 的稳定性更依赖 scale 语义与 per-block 机制,不能简单外推 BF16 [Micikevicius2022FP8Formats] [Mishra2025MXFP8Recipe];其二,很多瓶颈是 bytes/token 与 IO,而不是算力,硬件算力提升并不自动解除 KV cache 带宽与 attention IO 上限 [Shazeer2019MQA] [Dao2022FlashAttention]。
来源论文· 4[Micikevicius2022FP8Formats][Mishra2025MXFP8Recipe][Shazeer2019MQA][Dao2022FlashAttention]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].countergpt-5.2

    [counter to Camp C: hardware will get faster; algorithms need not adapt ] The counter-evidence has two parts: (1) FP8/MXFP8 stability depends more on scale semantics and per-block mechanisms, so BF16 extrapolation breaks [Micikevicius2022FP

contestedc-732388d5a9
反方会指出:前沿 LLM 的关键路径往往依赖特定硬件能力与成熟库生态(例如 FlashAttention-3 对 Hopper 特性显式编码)[Shah2024FlashAttention3],而且低精度 recipe 与 kernel 语义需要端到端一致性,移植成本不只在 kernel 代码,还在数值回归与工具链。
来源论文· 1[Shah2024FlashAttention3]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] The counterpoint is that frontier LLM critical paths often depend on specific hardware capabilities and mature libraries (e.g., FlashAttention-3 explicitly encodes Ho

consensusc-0b742f8fa8
把 TP/EP 放到跨节点(IB)通常会把每层 collectives 的延迟放大到“layer 数 × 每步”级别;因此在公开 >10K GPU 配方里,TP=8 绑定 NVLink 域是默认,而 PP 才是跨 IB 的第一选择。[Shoeybi2019Megatron][Jiang2024MegaScale]
来源论文· 2[Shoeybi2019Megatron][Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    Placing TP/EP across nodes (IB) amplifies per-layer collective latency into a “layers × per-step” tax; accordingly, public >10K GPU recipes default to TP=8 within NVLink islands, and use PP as the first dimension to span IB.[Shoeybi2019Mega

consensusc-f919ac34af
当 PP 不可避免时,bubble 不是小修小补:zero-bubble 的构造把同步语义下的 bubble 压到接近 0,工程上常对应 2–5 个 MFU 点的可回收空间。[Qi2023ZeroBubble]
来源论文· 1[Qi2023ZeroBubble]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    When PP is unavoidable, bubbles are not a minor tweak: zero-bubble constructions drive bubbles near 0 under synchronous semantics, often corresponding to ~2–5 recoverable MFU points in practice.[Qi2023ZeroBubble]

consensusc-4d5bec70d9
长上下文的第一层杠杆通常是 SP:它把 activation 显存降到约 1/5,并把 recompute overhead 从约 36% 压到约 2%,且不增加 TP 的通信量级。[Korthikanti2022SP]
来源论文· 2[Korthikanti2022SP]arXiv 2205.05198arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2205.05198· report.headline_claimsgpt-5.2

    For long context, SP is often the first lever: it reduces activation memory by ~5×, cuts recompute overhead from ~36% to ~2%, and does not increase TP communication volume.[Korthikanti2022SP]

consensusc-9694fce84d
当 attention 时间占比超过约 50% 或 L≥32K 时,CP 从“更快的 attention”变成“必须的 mesh 维度”:Ring 把 KV 沿 ring 分布,[Liu2023RingAttn];Ulysses 把 heads 通过 all-to-all 重排,[Jacobs2023Ulysses];USP 用 NVLink/IB 带宽比与 head 数在两者间选型。[Fang2024USP][Liu2023RingAttn][Jacobs2023Ulyss
来源论文· 4[Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]arXiv 2310.01889arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2310.01889· report.headline_claimsgpt-5.2

    Once attention exceeds ~50% of step time or L≥32K, CP shifts from “faster attention” to a required mesh dimension: Ring distributes KV along a ring [Liu2023RingAttn]; Ulysses reshapes heads via all-to-all [Jacobs2023Ulysses]; USP selects be

consensusc-4fd5f5918b
auto-parallel 的公开证据更像“把手工经验编码进编译器”,而不是“自动找到封顶方案”:Alpa 的 ILP 搜索与 GSPMD 的编译分片能减少标注,但缺少 100B+、>1K GPU 的 matched-scale MFU/拓扑细节对照,无法替代手工 4D 的复现路径。[Zheng2022Alpa][Xu2021GSPMD][Jiang2024MegaScale]
来源论文· 3[Zheng2022Alpa][Xu2021GSPMD][Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.headline_claimsgpt-5.2

    Public auto-parallel evidence reads more like “encode manual heuristics into a compiler” than “automatically find the ceiling plan”: Alpa’s ILP search and GSPMD’s compiler sharding reduce annotations, but without matched-scale 100B+ and >1K

contestedc-853ebc9dd9
手工方案对团队经验依赖高,迁移到新模型结构(MoE、长上下文、异构层)时需要重复调参;而且“可复现”不等于“最优”,可能被更好的编译器/自动搜索替代。[Zheng2022Alpa][Xu2021GSPMD]
来源论文· 2[Zheng2022Alpa][Xu2021GSPMD]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].countergpt-5.2

    [counter to Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] Manual plans are experience-heavy and require retuning when architectures shift (MoE, long context, heterogeneous layers); and “reproducible” is not “optimal,” potent

contestedc-6c78a2526b
缺口在“封顶证据”而不是“可行性”:公开材料缺少 100B+、>1K GPU 的 matched-scale 对照(同模型、同拓扑、报告 MFU 与 NVLink/IB/cross-pod 映射细节)。在缺少这类对照时,auto-parallel 更像把手工策略固化为默认,而不是在大规模上证明能自动找到更好的拓扑映射。[Jiang2024MegaScale]
来源论文· 1[Jiang2024MegaScale]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].countergpt-5.2

    [counter to Camp B: auto-parallel (Alpa / GSPMD / pjit line)] The gap is ceiling evidence, not feasibility: public materials lack matched-scale 100B+ and >1K GPU comparisons (same model, same topology, reporting MFU and NVLink/IB/cross-pod

contestedc-5c762abdbf
长上下文的公开系统工作已经把 SP/CP 推到“常用维度”:SP 能在不增加 TP 通信量级的前提下降显存 [Korthikanti2022SP];当 L 上到 32K–128K,甚至出现 sequence-level PP 的新瓶颈形态 [Sun2024Seq1F1B];当 L 更大时,Ring/Ulysses/USP 这类 CP 方案直接决定是否能扩展。[Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]
来源论文· 6[Korthikanti2022SP][Sun2024Seq1F1B][Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]arXiv 2205.05198arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2205.05198· report.positions[3].countergpt-5.2

    [counter to Camp D: 3D (DP+TP+PP) is enough; SP/CP are optional optimiza] Public long-context systems already push SP/CP into commonly used dimensions: SP reduces memory without increasing TP comm volume [Korthikanti2022SP]; at L=32K–128K,

consensusc-7152431c72
在受控组合任务上,标称 128K 的“长上下文”常表现为 ~32K 的有效上下文;只用 NIAH 类检索命中率会系统性高估利用率,[Hsieh2024RULER] 给出了可复现的拆解协议。
来源论文· 2[Hsieh2024RULER]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.headline_claimsgpt-5.2

    On controlled compositional tasks, nominal 128K often behaves like ~32K effective context; NIAH-style retrieval hit rates systematically overestimate utilization, and [Hsieh2024RULER] provides a reproducible decomposition protocol.

consensusc-628a26538b
“中间位置掉点”是长上下文退化的稳定现象:证据位于序列中部时可比首尾低 20+ pp,[Liu2023LostInMiddle] 说明有效上下文不是窗口大小的单调函数。
来源论文· 2[Liu2023LostInMiddle]arXiv 2307.03172arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2307.03172· report.headline_claimsgpt-5.2

    “Lost-in-the-middle” is a stable degradation mode: evidence in the middle can be 20+ pp worse than at the ends, and [Liu2023LostInMiddle] shows effective context is not a monotonic function of window size.

consensusc-7f1e3883a6
仅做位置外推(PI/ALiBi/YaRN)主要解决“能跑到更长”,但不足以保证非检索型长上下文任务的稳定提升;[Kazemnejad2023PEImpact] 的系统对比更接近“PE 重要但不是单点解”。
来源论文· 2[Kazemnejad2023PEImpact]arXiv 2305.19466arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2305.19466· report.headline_claimsgpt-5.2

    Positional extrapolation alone (PI/ALiBi/YaRN) mainly makes longer runs possible but does not guarantee stable gains on non-retrieval long-context tasks; [Kazemnejad2023PEImpact] supports “PE matters but is not a single-point solution”.

consensusc-53ccbfbab5
在同一 base model 上,长文档上采样并保持领域分布的继续训练,比“只改位置编码”更直接抬升长文本任务表现;[Fu2024DataEngineering]、[Xiong2023EffectiveLongCtx] 提供了可复现的消融配方。
来源论文· 3[Fu2024DataEngineering][Xiong2023EffectiveLongCtx]arXiv 2309.16039arxiv.org
3 观测Context Scaling Pretrain
证据 (3)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.headline_claimsgpt-5.2

    On the same base model, continued pretraining with long-document upsampling plus domain distribution preservation more directly lifts long-text tasks than PE-only changes; [Fu2024DataEngineering] and [Xiong2023EffectiveLongCtx] provide repr

  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.headline_claimsgpt-5.2

    Long-doc upsampling plus domain distribution preservation yields ~3–7 pp gains on long-text tasks on the same base model, and tracks effective context more directly than PE-only changes.[Fu2024DataEngineering][Xiong2023EffectiveLongCtx]

  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.4

    PE-only often moves a model to longer position ranges, but on the same base model long-document upsampling and domain preservation restore long-task performance more directly than PE-only changes [Fu2024DataEngineering][Xiong2023EffectiveLo

consensusc-002525ec9c
相关文档拼接把“跨段引用/重复”变成序列内统计结构,能显式增强 ICL 与跨文档推理;机制动机来自 burstiness 驱动 ICL 的发现 [Chan2022DataDist],工程实证来自 [Shi2023InContextPretraining]。
来源论文· 2[Chan2022DataDist][Shi2023InContextPretraining]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.2

    Related-document concatenation turns cross-span reference/repetition into within-sequence statistics, strengthening ICL and cross-document reasoning; the mechanism is motivated by burstiness-driven ICL [Chan2022DataDist] and evidenced by [S

contestedc-c81715de7f
受控评测与消融更像在说“PE 让模型能跑长,但不会自动学会用长”。[Hsieh2024RULER] 显示组合任务上有效上下文会坍缩;[Fu2024DataEngineering] 在同一 base model 的消融里,PE-only 难以恢复长上下文任务;[Kazemnejad2023PEImpact] 也给出“PE 重要但不是单点解”的系统证据。
来源论文· 4[Hsieh2024RULER][Fu2024DataEngineering][Kazemnejad2023PEImpact]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.positions[0].countergpt-5.2

    [counter to Camp A: positional extrapolation is enough; long context is ] Controlled evaluation and ablations suggest “PE makes longer runs possible, but doesn’t automatically teach long utilization”. [Hsieh2024RULER] shows effective-contex

contestedc-64b32b98bb
仅靠数据也会撞到“算子与系统”瓶颈:没有高效注意力 kernel,长序列训练成本会限制迭代速度,[Dao2023FlashAttention2] 说明系统工程是数据路线的前提;此外,评测若只看 NIAH 仍可能误判收益,[Hsieh2024RULER] 需要作为门槛。
来源论文· 3[Dao2023FlashAttention2][Hsieh2024RULER]arXiv 2307.08691arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2307.08691· report.positions[1].countergpt-5.2

    [counter to Camp B: the data recipe is the main variable; long-doc upsam] Data alone can hit operator/systems bottlenecks: without efficient attention kernels, long-sequence training cost limits iteration speed; [Dao2023FlashAttention2] sho

contestedc-f7cca1237d
证据链仍偏稀疏:除了 [Shi2023InContextPretraining] 这类直接实证,行业里大量 packing 细节停留在工程经验,缺少与“长文档上采样”同台的受控消融;评测也需要用聚合/变量跟踪任务才能看出差异,[Hsieh2024RULER] [Laban2024SummaryHaystack]。
来源论文· 4[Shi2023InContextPretraining][Hsieh2024RULER][Laban2024SummaryHaystack]arXiv 2309.16039arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.positions[2].countergpt-5.2

    [counter to Camp C: packing/concatenation is under-exploited; sequence c] The evidence chain is still sparse: beyond direct evidence like [Shi2023InContextPretraining], many packing details remain engineering lore without controlled ablatio

contestedc-e4c588d1f3
架构绕过常把指标推向“能处理更长”,但对“能否组合推理”未必占优;需要用受控组合任务对齐,否则容易只得到更强的检索/更低的成本。[Hsieh2024RULER] 同时,系统优化(如 [Dao2023FlashAttention2])已经让 Transformer 在 32K–128K 区间可迭代,短期内数据/packing 的边际收益更可控。
来源论文· 2[Hsieh2024RULER][Dao2023FlashAttention2]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: switch architectures (sparse / memory / recurrent) t] Architecture bypasses often optimize for “processable length” but may not win on “compositional reasoning”; without controlled compositional alignment, they can becom

consensusc-d59a18ca9e
对 generalist,code 占比 15–25% 更像稳态默认:公开 continual 结果显示在追加约 500B code token 后,MMLU 下降 <1 pp 仍可获得大幅代码增益 [CodeLlama2023];把 >30% 当作需要 NL-retention gate 的实验区更稳。
来源论文· 2[CodeLlama2023]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.headline_claimsgpt-5.2

    For generalists, 15–25% code behaves like a stable default: public continual results show that after adding ~500B code tokens, MMLU drops by <1 pp while coding improves substantially [CodeLlama2023]; treating >30% as an experiment zone requ

consensusc-5594a82efb
mixture 的收益曲线不是线性的:synergy/interference 相图预测低占比“加一点就涨”,而接近 >~40% 时更容易互相压制 [Aghajanyan2023SciMix],因此“>40% 对 generalist 单调更好”缺少机制支撑与 matched-compute 证据。
来源论文· 2[Aghajanyan2023SciMix]arXiv 2301.03728arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2301.03728· report.headline_claimsgpt-5.2

    Mixture gains are not linear: synergy/interference phase plots predict steep gains at low fractions and more mutual suppression near >~40% [Aghajanyan2023SciMix], so “>40% is monotonically better for generalists” lacks both a mechanism and

consensusc-d0fa51007d
code token 往往“更稳”但也更会挤预算:在相近训练 loss 下,code 域 downstream variance 低于 web(约 30% 量级差异)[Gadre2023Overtraining];在固定 compute 下,任何 code 增量都等价于替换其它 token [Hoffmann2022Chinchilla]。
来源论文· 3[Gadre2023Overtraining][Hoffmann2022Chinchilla]arXiv 2312.00752arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2312.00752· report.headline_claimsgpt-5.2

    Code tokens are often “more stable” but also displace budget: at similar training loss, code-domain downstream variance is lower than web (reported ~30% scale difference) [Gadre2023Overtraining]; under fixed compute, any increase in code to

consensusc-70f1f088eb
在讨论占比之前,先把 token 组织与目标函数做对:repo-level packing 把真实依赖结构塞进同一序列 [Shi2023InContextPretraining],FIM/infilling 让编辑式任务更贴近训练分布且副作用较小 [Bavarian2022FIM],通常比把 code 占比从 20% 推到 40% 更可控。
来源论文· 3[Shi2023InContextPretraining][Bavarian2022FIM]arXiv 2309.16039arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2309.16039· report.headline_claimsgpt-5.2

    Before debating ratios, fix token organization and objectives: repo-level packing injects real dependency structure into a single sequence [Shi2023InContextPretraining], and FIM/infilling aligns training with edit-style usage with small sid

contestedc-7b227d56f0
现有公开证据更像“specialist 可行”而非“generalist 单调更好”。高 code 的通用代价在 from-scratch specialist 上被量化(MMLU 缺口 ~6–9 pp)[DeepSeekCoder2024];mixture 相图也预测接近高占比时更容易进入 interference [Aghajanyan2023SciMix]。更关键的是,支持高 code 的工作多以代码任务为主目标,缺少 matched-compute 的 NL/reas
来源论文· 2[DeepSeekCoder2024][Aghajanyan2023SciMix]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: generalists should push code past >40%; more is alwa] Public evidence reads more like “specialists are feasible” than “generalists are monotonically better.” High-code generality cost is quantified for from-scratch speci

contestedc-06ca721f88
优化效应解释了“更稳”,但解释不了“为什么是 code 而不是任意低 entropy 语料”。NL-PL 联合预训练在多种任务上表现出结构化迁移 [Feng2020CodeBERT][Ahmad2021PLBART],而且 FIM/packing 这类工程改动能在不增加 code 占比的情况下提升跨文件与编辑式能力 [Shi2023InContextPretraining][Bavarian2022FIM],更像是在利用 code 的结构而不只是可压缩性。
来源论文· 4[Feng2020CodeBERT][Ahmad2021PLBART][Shi2023InContextPretraining][Bavarian2022FIM]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: code helps mostly via optimization/regularization (l] Optimization effects explain “stability,” but not “why code rather than any low-entropy corpus.” NL-PL joint pretraining shows structured transfer across tasks [Feng2

contestedc-463b6bc260
公开证据并不支持“<10% 才安全”的硬阈值:Code Llama 的 continual 结果表明大量 code token 也能把 MMLU 下降控制在 <1 pp [CodeLlama2023]。更合理的担忧是“预算替换”而不是“必然伤害”:在固定 compute 下,code 增量确实会替换其它 token [Hoffmann2022Chinchilla],因此需要用 matched-compute 曲线来量化机会成本,而不是用 <10% 作为经验禁区。
来源论文· 3[CodeLlama2023][Hoffmann2022Chinchilla]arXiv 2308.12950arxiv.org
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrainarXiv 2308.12950· report.positions[2].countergpt-5.2

    [counter to Camp C: keep code <10% to protect NL in generalists] Public evidence does not support a hard “<10% is the only safe” threshold: Code Llama shows that even large code-token additions can keep MMLU drop under <1 pp [CodeLlama2023]

contestedc-3215495e13
“迟早会失败”需要公开的失败曲线或可复现实验,但目前更显眼的是反例:Code Llama 的 continual 显示在大规模 code 追加下仍能保持 MMLU <1 pp 下降 [CodeLlama2023]。from-scratch 的优势更像是“目标更纯、配方更可控”,而不是“continual 必然崩”。更稳的工程做法是把 continual 当作可行选项,但在训练后段引入更强的回归测试与数据 annealing,而不是把路线一刀切。
来源论文· 1[CodeLlama2023]
1 观测Code Density Pretrain
证据 (1)
  • topic_reportcode-density-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: code ability must be trained from scratch; continual] “Will eventually fail” needs public failure curves or reproducible experiments, but the more salient public datapoint is a counterexample: Code Llama’s continual keep

consensusc-71a1b4bbb6
在新项目里,FA1 [Dao2022FA1] 更适合作为历史基线而不是默认实现;同样是 exact attention,FA2 [Dao2023FA2] 通过并行粒度与 work partitioning 改动,实测可带来接近 2× 的吞吐差异。
来源论文· 3[Dao2022FA1][Dao2023FA2]arXiv 2205.14135arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2205.14135· report.headline_claimsgpt-5.4

    For new projects, FA1 [Dao2022FA1] is better kept as a historical baseline than a default implementation; for the same exact attention algorithm, FA2 [Dao2023FA2] shows that parallel granularity and work partitioning alone can produce throu

consensusc-0577b56572
H100/H200 上 BF16/FP16 训练默认应切到 FA3 [Shah2024FA3];A100/H20 上默认仍是 FA2 [Dao2023FA2],因为 FA3 的主要收益依赖 Hopper 的 TMA 与 warp specialization,而不是可无损迁移的通用技巧。
来源论文· 3[Shah2024FA3][Dao2023FA2]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.headline_claimsgpt-5.4

    On H100/H200, BF16/FP16 training should default to FA3 [Shah2024FA3]; on A100/H20, FA2 [Dao2023FA2] remains the default because FA3’s main gains depend on Hopper TMA and warp specialization rather than universally portable tricks.

consensusc-4584e13a1b
推理侧 attention 的主要 ROI 已从“再写一个更快 kernel”转向 decode 并行与 KV 策略;FlashDecoding [Dao2023FlashDecoding] 与 FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] 说明,单 query decode 若不沿 seq 维 chunk 并行,SM 利用率会系统性偏低。
来源论文· 3[Dao2023FlashDecoding][Ye2024FlashInfer][Ye2025FlashInferEngine]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsgpt-5.4

    On the serving side, the main ROI has shifted from writing yet another faster kernel to decode parallelism and KV strategy; Flash-Decoding [Dao2023FlashDecoding] and FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] show that single-que

consensusc-2adf1858a8
FlexAttention [Dong2024Flex] 应该成为 attention 变体的默认入口;只有当 mask/score 语义直接改变 tiling、访存模式或跨设备 blockwise 执行时,专用实现才更合理,这正是 FlashMask [Wang2024FlashMask] 与 RingAttention [Liu2024RingAttention] 的边界。
来源论文· 4[Dong2024Flex][Wang2024FlashMask][Liu2024RingAttention]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.headline_claimsgpt-5.4

    FlexAttention [Dong2024Flex] should be the default entry point for attention variants; specialized implementations become justified only when mask/score semantics directly alter tiling, memory access, or distributed blockwise execution, whi

consensusc-d162489cda
FP8 attention 训练不是“开开关”问题,而是误差路径管理问题;若没有 BF16 对照、rescale 顺序检查和异常 token 监控,FA3 的 FP8 路径 [Shah2024FA3] 不能直接当作生产默认 [Golden2024FAStable][Micikevicius2022FP8]。
来源论文· 4[Shah2024FA3][Golden2024FAStable][Micikevicius2022FP8]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.headline_claimsgpt-5.4

    FP8 attention training is not a toggle but an error-path management problem; without a BF16 control, rescaling-order checks, and outlier-token monitoring, the FP8 path in FA3 [Shah2024FA3] should not be treated as a production default [Gold

contestedc-1bf48c7652
反方会指出 richer mask、跨设备 blockwise、以及新硬件特性仍会不断产生专用 kernel 需求,FlashMask [Wang2024FlashMask] 与 RingAttention [Liu2024RingAttention] 就是例子。
来源论文· 2[Wang2024FlashMask][Liu2024RingAttention]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[0].countergpt-5.4

    [counter to Camp A: FlashAttention is largely the endpoint of attention ] The counterargument is that richer masks, distributed blockwise execution, and new hardware features will keep creating demand for specialized kernels, as illustrated

contestedc-5326dba6ea
反方会说编译式抽象并不能覆盖所有情况;一旦 mask/score 语义改变访存结构,或者需要跨设备 blockwise 执行,专用实现仍可能更快,见 [Wang2024FlashMask][Liu2024RingAttention]。
来源论文· 2[Wang2024FlashMask][Liu2024RingAttention]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[1].countergpt-5.4

    [counter to Camp B: The main line should move from hand-written CUDA to ] The counterargument is that compiler-based abstraction does not cover every case; once mask/score semantics alter memory access, or distributed blockwise execution is

contestedc-e10e20ec56
反方会指出,在通用 LLM 设定下,retrieval 与 in-context learning 仍是 attention 的强项;Waleffe et al. [Waleffe2024MambaStudy] 的对比没有支持“全面替代”这一结论,而 RingAttention [Liu2024RingAttention] 说明 attention 还能通过系统设计继续扩展上下文。
来源论文· 2[Waleffe2024MambaStudy][Liu2024RingAttention]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[2].countergpt-5.4

    [counter to Camp C: Attention itself should be replaced by SSM, linear, ] The counterargument is that retrieval and in-context learning remain strong points for attention in general LLM settings; the comparisons in Waleffe et al. [Waleffe20

contestedc-5e6b5fc3f2
反方会说 lock-in 不能抽象地谈,要看硬件占比与收益。如果训练主力就是 H100/H200,放弃 FA3 [Shah2024FA3] 等于主动放弃现成吞吐;而 Triton/FlexAttention [Dong2024Flex] 也能缓和部分可移植性问题。
来源论文· 2[Shah2024FA3][Dong2024Flex]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[3].countergpt-5.4

    [counter to Camp D: FA3 embodies NVIDIA generation-specific lock-in, so ] The counterargument is that lock-in cannot be discussed in the abstract; it depends on hardware share and payoff. If H100/H200 already dominate training, avoiding FA3

consensusc-1af683f1b7
token-level weighting(如 Rho-1)在 data efficiency 上可能优于 example-level filtering,但它把训练系统复杂度从“数据入口”推到“训练环路”(统计、调度、packing 交互);在没有可观测性与回归测试的团队里,净收益更容易被工程噪声吞掉。[Rho12024][WhatsInMyBigData2023]
来源论文· 2[Rho12024][WhatsInMyBigData2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.headline_claimsgpt-5.2

    Token-level weighting (e.g., Rho-1) can beat example-level filtering on data efficiency, but it shifts complexity from the data gate to the training loop (stats, scheduling, packing interactions); without observability and regression tests,

contestedc-fbcd4028d2
两类风险经常被低估:其一,proxy 的适用域与 scale drift 会让“旧 classifier 统治新模型入口”;其二,长程训练下收益可能反转,早期 ladder 不覆盖 late-stage 时会误判。Nemotron-CC 与 influence drift 都在提醒这两点。[NemotronCC2024][AnthropicInfluence2023]
来源论文· 2[NemotronCC2024][AnthropicInfluence2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[0].countergpt-5.2

    [counter to Camp A: classifiers + ablation ladders are sufficient (bet o] Two risks are often underestimated: (1) proxy domain-of-validity and scale drift can let “old classifiers dominate new-model data gates”; (2) long-horizon training ca

contestedc-49395cf18a
现有最强的负面证据同样来自 influence:top-influence 跨规模重叠 <10% 意味着“关键样本集合”不是稳定对象;把小模型归因结果迁移到大模型会系统性偏移。TRAK 证明了可部署性,但没有解决跨规模稳定性与端到端闭环稀缺的问题。[AnthropicInfluence2023][TRAK2023]
来源论文· 2[AnthropicInfluence2023][TRAK2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[1].countergpt-5.2

    [counter to Camp B: influence/attribution is the main path (infer recipe] The strongest negative evidence also comes from influence: <10% cross-scale top-influence overlap implies the “key example set” is not stable; transferring small-mode

contestedc-81dd04ec60
识别假设是硬成本:IV 的排除限制、工具变量强度、以及评估器偏差的建模,一旦不成立就可能给出方向性错误。工程上更常见的失败不是“估计方差大”,而是“选错方向还很自信”。[CausalLL2024][LengthControlledAlpacaEval2024]
来源论文· 2[CausalLL2024][LengthControlledAlpacaEval2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[2].countergpt-5.2

    [counter to Camp C: full causal identification is the future (solve conf] Identification assumptions are hard costs: IV exclusion restrictions, instrument strength, and modeling evaluator bias can fail and yield directional errors. In pract

contestedc-288d1e9a94
两个工程现实会压制纯 scale-first:其一,token 成本与数据可得性约束在 2026 更硬;其二,覆盖并不等于有效覆盖,噪声与污染会直接进入记忆与评估泄漏风险。去重与审计在多个工作里被证明能改善质量与降低记忆风险。[Dedup2021][WhatsInMyBigData2023][Memorization2023]
来源论文· 3[Dedup2021][WhatsInMyBigData2023][Memorization2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[3].countergpt-5.2

    [counter to Camp D: skip measurement, rely on intuition and scale (scali] Two engineering realities constrain pure scale-first: (1) token cost and data availability are harder constraints in 2026; (2) coverage is not effective coverage—nois

consensusc-b764629f1c
离线 mixture 搜索在中等算力下更像“校准”而不是“求全局最优”:RegMix/混合规律能用少量点拟合局部响应面,但一旦桶定义或去重版本漂移,回归目标会变;Held et al. 的结果显示若干 utility 估计器不如简单启发式,复杂度要用 ablation 证明 [RegMix2024] [BiMix2024] [Held2025UtilityMix]。
来源论文· 3[RegMix2024][BiMix2024][Held2025UtilityMix]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    Offline mixture search under mid compute behaves more like calibration than global optimization: RegMix/mixing laws fit local response surfaces from few points, but the target drifts when buckets or dedup versions drift; Held et al. show se

contestedc-be892c99db
反例是“估计器复杂但不稳”:Held et al. 显示若干 utility 估计器不如简单启发式;同时这些方法对桶定义、去重/过滤版本与训练时长漂移敏感,迁移失败时很难回滚到可解释状态 [Held2025UtilityMix] [D42023]。
来源论文· 1[Held2025UtilityMix]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[0].countergpt-5.2

    [counter to Camp A: Formal search first (laws/regression/robust optimiza] A counterpoint is “complex estimators can be unstable”: Held et al. show several utility estimators underperform simple heuristics; these methods are also sensitive t

contestedc-91585ddd29
启发式的盲点是不可迁移与不可解释:当桶更细、目标从平均指标转向最差桶或特定能力时,纯配方可能需要大量试错;缺少离线校准也更难做预算规划 [MixMax2024] [RegMix2024]。
来源论文· 2[MixMax2024][RegMix2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.2

    [counter to Camp B: Heuristics + curricula are more robust; a few ablati] The blind spot is transferability and interpretability: with finer buckets, or when objectives shift from average metrics to worst-bucket or specific capabilities, pu

contestedc-2851e12c56
主要风险是成本与记账:额外阶段、稳定的 domain 记账、以及迁移假设(桶定义与训练时长相近)。在桶粗或桶漂移时,在线调权可能把桶内子域的冲突平均化,导致“看似自适应、实际不可控” [OrganizeWeb2025] [AboutMe2024]。
来源论文· 2[OrganizeWeb2025][AboutMe2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.2

    [counter to Camp C: Online/adaptive mixing beats one-shot offline search] The main risks are cost and accounting: extra phases, stable domain accounting, and transfer assumptions (similar buckets and horizons). With coarse or drifting bucke

contestedc-ba5cbe596c
两个反例需要同时记住:第一,“更干净”可能删掉长训练视野需要的覆盖面,Nemotron-CC 明确反对过度过滤 [NemotronCC2024];第二,过滤/估值方法本身可能误导,Data Shapley 在选择任务上会失败,说明“质量”也需要约束与验证 [RethinkingDataShapley2024]。
来源论文· 2[NemotronCC2024][RethinkingDataShapley2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[3].countergpt-5.2

    [counter to Camp D: Ratios are second-order; quality/selection is first-] Two counterexamples matter simultaneously: (1) “cleaner” can remove coverage needed for long-horizon training; Nemotron-CC explicitly argues against over-filtering [N

consensusc-42072da9a5
把 RoPE base 从 10000 提前设到目标窗口量级(例如 128K 取 ~500000)不是“玄学超参”:当 base 偏小,低频维度在目标长度上的相位覆盖不足,远距离位置趋于不可分,从而给可学有效上下文长度设了上界。[Xu2024RoPEBaseBounds][Dubey2024Llama3]
来源论文· 2[Xu2024RoPEBaseBounds][Dubey2024Llama3]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    Setting RoPE base from 10000 directly to the target-window scale (e.g., ~500000 for 128K) is not a magic knob: if base is too small, low-frequency dimensions lack phase coverage at the target length, making far positions indistinguishable a

consensusc-91a7a442cd
当目标跨到 512K–2M,单一全局缩放(无论是 PI 式插值还是全局 NTK-aware scale)会在不同频段产生 per-dim 错配;LongRoPE 通过搜索学非均匀 per-dim scale,并配更长微调预算,把“能跑”变成“更可控地可用”。[Ding2024LongRoPE]
来源论文· 1[Ding2024LongRoPE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    At 512K–2M targets, a single global scaling (PI-style interpolation or a global NTK-aware scale) induces per-dimension mismatch across frequency bands; LongRoPE learns non-uniform per-dim scales via search and pairs it with larger fine-tuni

consensusc-d54f0e452d
“宣称 128K”不能推断“有效 128K”:RULER 在多类任务上测到大量模型有效上下文仅 ~32K;因此验收应以 RULER/LongBench/LV-Eval 的跨长度任务曲线为主,而不是 PPL 或 needle 单点。[Hsieh2024RULER][Bai2023LongBench][Yuan2024LVEval]
来源论文· 3[Hsieh2024RULER][Bai2023LongBench][Yuan2024LVEval]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    “Advertised 128K” does not imply “effective 128K”: RULER finds many models effectively usable only up to ~32K across task families; acceptance should rely on cross-length task curves from RULER/LongBench/LV-Eval rather than PPL or single-po

consensusc-40c4659322
绕开 RoPE/attention 的路线在复杂度上更省(例如线性时间的 Mamba、带记忆的 RMT、Infini-attention),但在 recall-heavy、多跳追踪任务上不能只报吞吐或窗口上限;需要与原生长窗训练/持续预训模型做同基准对照,才能判断是否真的替代。[Gu2023Mamba][Bulatov2023RMT][Munkhdalai2024InfiniAttention]
来源论文· 3[Gu2023Mamba][Bulatov2023RMT][Munkhdalai2024InfiniAttention]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    Bypass routes can be cheaper in complexity (e.g., linear-time Mamba, memory-augmented RMT, Infini-attention), but for recall-heavy multi-hop tracing they cannot be justified by throughput or max window alone; they need benchmark-matched com

contestedc-c93832b31f
成本与数据是硬约束:持续预训需要长文数据与系统吞吐,且对存量模型不友好;在 32K–128K 区间,retrofit 往往更快交付,且在不少任务族上已足够。[Peng2023YaRN][Zhu2023PoSE]
来源论文· 2[Peng2023YaRN][Zhu2023PoSE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].countergpt-5.2

    [counter to Camp A: native long-context (ABF + curriculum) is the cleane] Cost and data are hard constraints: continual pretraining needs long-text data and systems throughput, and is unfriendly to existing deployed models; in the 32K–128K

contestedc-20d0b5fb09
PI 的优势是简单与低门槛:实现改动少、微调步数少,适合“先把窗口跑起来”的快速验证;若任务主要是稀疏检索,PI 的副作用不一定立刻暴露。[Chen2023PI][Bai2023LongBench]
来源论文· 2[Chen2023PI][Bai2023LongBench]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].countergpt-5.2

    [counter to Camp B: YaRN, not PI, should be the default for 32K–128K ret] PI’s advantage is simplicity and low barrier: minimal implementation changes and fewer fine-tune steps, useful for quick “make the window run” validation; if tasks ar

contestedc-d6fd59bd0e
搜索与长微调把成本推高,且系统瓶颈(KV cache/通信)可能先于 RoPE 参数化成为主瓶颈;对很多产品任务,512K 以上的边际收益未必覆盖成本。[Liu2024RingAttentionWorldModel][Hsieh2024RULER]
来源论文· 1[Hsieh2024RULER]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].countergpt-5.2

    [counter to Camp C: ≥512K needs LongRoPE-style per-dim search; global fo] Search and long fine-tuning raise costs, and systems bottlenecks (KV cache/communication) may dominate before RoPE parameterization does; for many product tasks, marg

contestedc-d8f49214d5
成本优势不自动转化为 recall-heavy 质量:摘要/写回会引入信息瓶颈,检索会引入召回误差传播,SSM 的信息路由机制也不同;若不在 RULER/LongBench 这类任务族上与原生长窗对照,很难判断是否只是“能吞吐”。[Hsieh2024RULER][Bai2023LongBench][Liu2023LostInTheMiddle]
来源论文· 3[Hsieh2024RULER][Bai2023LongBench][Liu2023LostInTheMiddle]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[3].countergpt-5.2

    [counter to Camp D: bypassing RoPE/attention (SSM/external memory/retrie] Cost advantages do not automatically translate to recall-heavy quality: summarization/write-back introduces information bottlenecks, retrieval introduces recall-error

consensusc-a8051d8272
在 32K–128K 区间,位置外推已不是主要瓶颈;数据分布与评估协议更常决定“有效上下文”上限 [PI2023][YaRN2023][Fu2024DE128K][Gao2024EffectiveLongCtx]。
来源论文· 5[PI2023][YaRN2023][Fu2024DE128K][Gao2024EffectiveLongCtx]arXiv 2309.00071arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2309.00071· report.headline_claimsgpt-5.4

    In the 32K–128K regime, positional extrapolation is no longer the main bottleneck; data distribution and evaluation protocol more often determine the ceiling of effective context [PI2023][YaRN2023][Fu2024DE128K][Gao2024EffectiveLongCtx].

consensusc-33c655654f
NIAH 或 perplexity 通过,不足以说明模型会在真实长任务中利用中段和远端证据;RULER、LV-Eval、LongBench、NoCha 与 RepoQA 都给出反例 [RULER2024][LVEval2024][LongBench2023][NoCha2024][RepoQA2024]。
来源论文· 6[RULER2024][LVEval2024][LongBench2023][NoCha2024][RepoQA2024]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.headline_claimsgpt-5.4

    Passing NIAH or perplexity is not enough to show that a model will use mid- and far-context evidence on real long tasks; RULER, LV-Eval, LongBench, NoCha, and RepoQA all provide counterexamples [RULER2024][LVEval2024][LongBench2023][NoCha20

consensusc-3adbf9a261
把长文档比例提升到约 25% 以上、控制 packing 截断率,并做分阶段扩窗,通常比单纯追加长 token 更稳 [Fu2024DE128K][Xiong2023EffectiveLongContext][Ding2024Packing][Xu2025UltraLong]。
来源论文· 5[Fu2024DE128K][Xiong2023EffectiveLongContext][Ding2024Packing][Xu2025UltraLong]arXiv 2309.16039arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2309.16039· report.headline_claimsgpt-5.4

    Raising long-document ratio to roughly 25%+, controlling packing truncation, and using staged window growth are usually more reliable than simply adding more long tokens [Fu2024DE128K][Xiong2023EffectiveLongContext][Ding2024Packing][Xu2025U

consensusc-b97a848aaa
到 1M+ 区间,问题重新回到系统稳定性:kernel、并行与位置缩放误差共同决定训练是否还能维持短上下文能力 [FA2][LongRoPE2024][Gemini15][Xu2025UltraLong]。
来源论文· 3[LongRoPE2024][Xu2025UltraLong]arXiv 2402.13753arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2402.13753· report.headline_claimsgpt-5.4

    At 1M+, the problem shifts back to systems stability: kernels, parallelism, and positional-scaling error jointly determine whether training can preserve short-context ability [FA2][LongRoPE2024][Gemini15][Xu2025UltraLong].

contestedc-4b862a5ec5
Press et al. [ALiBi2021] 已经给出非 RoPE 的长度外推反例;Kazemnejad et al. [Kazemnejad2023PE] 与 Lu et al. [Lu2024ControlledLongContext] 也说明位置方法重要,但不是唯一决定因素,数据与任务协议同样会改写结论。
来源论文· 2[ALiBi2021][Kazemnejad2023PE]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[0].countergpt-5.4

    [counter to Camp A: explicit positional design is required, and RoPE ext] Press et al. [ALiBi2021] already provide a non-RoPE counterexample for length extrapolation; Kazemnejad et al. [Kazemnejad2023PE] and Lu et al. [Lu2024ControlledLongC

contestedc-9dd7f457f3
系统优化确实决定能否进入可训练区间,但 Hsieh et al. [RULER2024]、Gao et al. [Gao2024EffectiveLongCtx] 和 Fu et al. [Fu2024DE128K] 说明,能训练不等于能利用。很多 128K 失败首先来自数据分布、packing 和任务监督,而不是并行策略。
来源论文· 4[RULER2024][Gao2024EffectiveLongCtx][Fu2024DE128K]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.positions[1].countergpt-5.4

    [counter to Camp B: scaling to million tokens is mostly a parallelism pr] Systems optimization does determine whether training enters a workable regime, but Hsieh et al. [RULER2024], Gao et al. [Gao2024EffectiveLongCtx], and Fu et al. [Fu20

contestedc-2dac674e5a
Zhou et al. [BenchmarkCheater2023] 的确提醒 benchmark 可能被污染,但这不是回到单一 proxy 的理由。Hsieh et al. [RULER2024]、Karpinska et al. [NoCha2024]、Xu et al. [RepoQA2024] 和 Gao et al. [Gao2024EffectiveLongCtx] 已经反复给出 proxy 失真的具体反例。
来源论文· 5[BenchmarkCheater2023][RULER2024][NoCha2024][RepoQA2024][Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[2].countergpt-5.4

    [counter to Camp C: NIAH/perplexity are sufficient; task benchmarks are ] Zhou et al. [BenchmarkCheater2023] do warn that benchmarks can be contaminated, but that is not a reason to fall back to a single proxy. Hsieh et al. [RULER2024], Kar

contestedc-6255e6663e
替代架构在复杂度上有吸引力,但目前公开证据更多停留在可行性和效率,而不是与扩窗 Transformer 在真实任务上的正面对比。Lu et al. [LongHeads2024] 甚至指出标准 multi-head attention 本身就具备被低估的长上下文处理能力。
来源论文· 1[LongHeads2024]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.positions[3].countergpt-5.4

    [counter to Camp D: sparse, memory, or linear-attention alternatives wil] Alternative architectures are attractive on complexity grounds, but public evidence still leans more toward feasibility and efficiency than head-to-head comparisons a

consensusc-c354d502ed
把负载均衡从主损失里剥离(bias EMA / sign-gradient 更新)通常能把“实现细节主导均衡效果”的风险降一个量级:aux loss 的统计口径、DP 同步与 detach 选择会让 usage CV 与分化差异超过“aux 系数大小”本身 [Qiu2025DemonsLBL],而 bias EMA 直接把目标写成对 router bias 的外环控制 [Wang2024AuxFree][DeepSeekAI2024V3]。
来源论文· 3[Qiu2025DemonsLBL][Wang2024AuxFree][DeepSeekAI2024V3]
2 观测MOE Landscape
证据 (2)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Decoupling load balancing from the main loss (bias EMA / sign-gradient updates) typically reduces “implementation details dominate balancing” risk by an order: aux-loss stat scope, DP sync, and detach choices can change usage CV and special

  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Decoupling balancing from main gradients (bias EMA / sign updates) demotes aux-loss implementation sensitivity from a dominant to a secondary factor: stat scope, DP sync, and detach choices can change usage CV and specialization more than t

consensusc-7340a446ee
细粒度专家(64–128,必要时到 256)+ 1 shared expert 更像是“降低分化门槛”的结构性手段:shared expert 承接高频公共模式,减少 routed experts 被迫学公共功能,从而缓解专家同质化与 collapse;这比单纯换路由公式更稳定地带来 zero-shot 增益(报告区间 1.8–3.4 pp)[Dai2024DeepSeekMoE][DeepSeekAI2024V2]。
来源论文· 2[Dai2024DeepSeekMoE][DeepSeekAI2024V2]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Fine-grained experts (64–128, up to 256) plus one shared expert is a structural way to lower the specialization barrier: the shared expert carries high-frequency common patterns, reducing pressure for routed experts to learn generic functio

consensusc-61819042bb
router z-loss 的主要价值是数值稳定而不是“更聪明的路由”:它抑制 router logit 爆炸,减少早期流量塌缩到少数专家的概率;在以 token drop、dead-expert 为硬门槛的栈里,z-loss(常见 1e-3)更像保险丝 [Zoph2022STMoE][DeepSeekAI2024V3]。
来源论文· 2[Zoph2022STMoE][DeepSeekAI2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    The main value of router z-loss is numerical stability rather than “smarter routing”: it prevents router-logit blow-up and reduces early traffic collapse to a few experts; in stacks that gate on token drop and dead experts, z-loss (often ~1

consensusc-31e2bb5eea
“dense→MoE upcycling 一定更省”并不成立:upcycling scaling law 显示存在收益提前饱和区间,出现“加专家不加质”[Liew2025Upcycling];因此 ROI 评估必须把额外稳定期步数、all-to-all 通信与 dispatch kernel 效率一起计入 [He2024UpcyclingLLMtoMoE][Tan2024ScatterMoE]。
来源论文· 4[Liew2025Upcycling][He2024UpcyclingLLMtoMoE][Tan2024ScatterMoE]arXiv 2502.03009arxiv.org
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscapearXiv 2502.03009· report.headline_claimsgpt-5.2

    “Dense→MoE upcycling is always cheaper” does not hold: upcycling scaling laws show early saturation regimes with “more experts without more quality” [Liew2025Upcycling]; ROI must include extra stabilization steps, all-to-all communication,

consensusc-d64fc0277f
“学到的路由/均衡一定必要”证据不足:系统性实验发现 frozen random router 与 learned router 在部分设置下可接近 [Fan2024EmpiricalMoEChoices];但缺少第二个 LLM 规模的强复现,因此把 learned routing 当作默认前提仍偏冒险。
来源论文· 1[Fan2024EmpiricalMoEChoices]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Evidence is insufficient that “learned routing/balancing is always necessary”: systematic experiments show frozen random routers can be close to learned routers in some settings [Fan2024EmpiricalMoEChoices]; but a second strong LLM-scale re

contestedc-f8f0b9a22e
反驳点集中在全生命周期 ROI:all-to-all、dispatch、路由稳定期带来的 wall-clock 与工程成本可能抵消 compute 优势;并且 upcycling 的 scaling law 显示“复用 dense + 加专家”存在饱和区间 [Liew2025Upcycling][Tan2024ScatterMoE]。
来源论文· 2[Liew2025Upcycling][Tan2024ScatterMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: MoE becomes the default backbone (dense kept for sma] Counterarguments focus on full-lifecycle ROI: all-to-all, dispatch, and router stabilization can negate compute advantages in wall-clock and engineering cost; upcycli

contestedc-1df7c66bbf
反驳点在于证据外推:现有结果缺少第二个 LLM 规模强复现,也未覆盖 DeepSeek-V3 这类把 dead-expert、token drop 当硬门槛的训练控制;在真实 all-to-all 与长训练步数下,路由学习可能主要体现在“避免事故”而不是平均验证 loss。[DeepSeekAI2024V3][Zoph2022STMoE]
来源论文· 2[DeepSeekAI2024V3][Zoph2022STMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: learned routing/balancing is overrated (random/froze] The counter is extrapolation risk: current evidence lacks a second strong LLM-scale replication and does not cover training control like DeepSeek-V3’s hard gates on d

contestedc-fa2a0b6861
反驳点是证据缺口:目前缺少直接比较“稀疏 SFT/RLHF vs dense SFT/RLHF”的公开 LLM 规模对照;仅凭系统直觉不足以决定阶段切换策略。DeepSeek 系列也未公开系统化的 post-training 稀疏对照。[DeepSeekAI2024V3]
来源论文· 1[DeepSeekAI2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: MoE is mainly for pretraining; post-training should ] The counter is an evidence gap: there are few public LLM-scale matched comparisons of sparse SFT/RLHF vs dense SFT/RLHF; systems intuition alone is insufficient to de

consensusc-344c9c9f47
在现代 transformer 栈里,原始 µP 不是“默认可用”;若不修正 QK-Norm、tied embeddings、residual scaling 等模块,LR 迁移会出现架构相关漂移,coord check 往往先暴露问题 [Yang2022muP][Cerebras2024CompleteP][Lingle2024EmpiricalMuP]。
来源论文· 3[Yang2022muP][Cerebras2024CompleteP][Lingle2024EmpiricalMuP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.4

    On modern transformer stacks, original µP is not 'usable by default'; without fixing modules such as QK-Norm, tied embeddings, and residual scaling, LR transfer shows architecture-dependent drift, and coord check usually reveals the mismatc

consensusc-aacd8bd581
固定 SP recipe 时,经验公式加 1–2 轮小 sweep 常能把目标规模 LR/batch 初值误差压到约 10%,但一旦 aspect ratio、训练时长或推理成本目标变化,公式会明显漂移,token:param 最优点可变动超过 3× [Bi2024DeepSeekLLM][Dey2023CerebrasGPT][McLeish2025Gemstones][Porian2024ResolvingScaling]。
来源论文· 4[Bi2024DeepSeekLLM][Dey2023CerebrasGPT][McLeish2025Gemstones][Porian2024ResolvingScaling]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.4

    When the SP recipe is fixed, empirical formulas plus 1–2 small sweeps often get target-scale LR/batch initialization within about 10%, but once aspect ratio, training duration, or inference-cost objectives change, the fit drifts substantial

consensusc-e815af240c
AdamW 下的 weight decay 是独立控制轴,不应被当成固定背景参数;在若干实际设置里,wd 对 LR 迁移误差的影响可超过 parameterization 本身 [Kosson2025WDMoreThanMuP][Wang2024AdamWWD][Loshchilov2017AdamW][Kosson2023RotationalEquilibrium]。
来源论文· 4[Kosson2025WDMoreThanMuP][Wang2024AdamWWD][Loshchilov2017AdamW]arXiv 2510.19093arxiv.org
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transferarXiv 2510.19093· report.headline_claimsgpt-5.4

    Under AdamW, weight decay is an independent control axis and should not be treated as a fixed background setting; in several practical settings, wd can affect LR transfer error more than the parameterization itself [Kosson2025WDMoreThanMuP]

consensusc-053bcc5939
深度与精度不是 width 迁移的附属修正项:层数翻倍时应先按 depthwise 规则做小规模对照;bf16/fp16→fp8 时应把 activation/gradient/weight RMS 监控纳入迁移目标,而不是只调 LR [Bordelon2023DepthwiseTransfer][Yang2023TPVI][Blake2024UMUP][Micikevicius2017MixedPrecision]。
来源论文· 4[Bordelon2023DepthwiseTransfer][Yang2023TPVI][Blake2024UMUP][Micikevicius2017MixedPrecision]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.4

    Depth and precision are not secondary corrections to width transfer: when depth doubles, first run a small controlled comparison using depthwise rules; when moving from bf16/fp16 to fp8, include activation/gradient/weight RMS monitoring in

contestedc-926bbc86f3
反方会指出,现代训练栈里模块、优化器、精度、norm 设计都在变,真正满足这些前提的团队并不多;在这种条件下,经验公式加小 sweep 往往更便宜、更稳 [Lingle2024EmpiricalMuP][Bi2024DeepSeekLLM][Dey2023CerebrasGPT]。
来源论文· 3[Lingle2024EmpiricalMuP][Bi2024DeepSeekLLM][Dey2023CerebrasGPT]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].countergpt-5.4

    [counter to Camp A: Complete-P should be the default starting point, and] The counterargument is that modern training stacks keep changing modules, optimizers, precision, and norm design, and few teams actually satisfy these assumptions. Un

contestedc-9d18e979e3
反方会指出,公式的稳定性高度依赖 recipe 不变;一旦 aspect ratio、训练时长或推理成本目标变化,最优点会漂移,旧拟合不再可信 [McLeish2025Gemstones][Porian2024ResolvingScaling][Sardana2023InferenceScaling]。
来源论文· 3[McLeish2025Gemstones][Porian2024ResolvingScaling][Sardana2023InferenceScaling]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].countergpt-5.4

    [counter to Camp B: empirical formulas plus small sweeps are enough; µP ] The counterargument is that formula stability depends heavily on an unchanged recipe; once aspect ratio, training duration, or inference-cost objectives change, the o

contestedc-1379d7a002
反方会指出,端到端自动化在大模型预算下仍然太贵,而且从坏初值开始时会把大量预算浪费在明显次优区域;更现实的做法是先用 parameterization 或公式把起点拉近,再让 BO 做局部补洞 [CARBS2024][Bi2024DeepSeekLLM][Cerebras2024CompleteP]。
来源论文· 3[CARBS2024][Bi2024DeepSeekLLM][Cerebras2024CompleteP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].countergpt-5.4

    [counter to Camp C: end-to-end BO / automatic optimization will replace ] The counterargument is that end-to-end automation is still too expensive at LLM budgets, and starting from a poor initialization wastes large amounts of compute in ob

contestedc-eb105300ee
反方会说,如果每个超参都单独建模,transfer 的简洁性就消失了,最后还是回到全量搜索。parameterization 的价值恰恰在于先把主导轴压缩掉 [Yang2022muP][Cerebras2024CompleteP]。
来源论文· 2[Yang2022muP][Cerebras2024CompleteP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].countergpt-5.4

    [counter to Camp D: transfer error is dominated by non-transferable hype] The counterargument is that if every hyperparameter is modeled separately, the simplicity of transfer disappears and the process collapses back into full search. The

consensusc-08258626fa
重复的主要风险变量不是“有没有重复”,而是“总曝光次数如何在全局分布”;跨语料 overlap 让单池 dedup 低估真实曝光 [Elazar2023WhatsInMyBigData]。
来源论文· 2[Elazar2023WhatsInMyBigData]arXiv 2310.20707arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2310.20707· report.headline_claimsgpt-5.4

    The main risk variable is not simply whether repetition exists, but how total exposure is distributed globally; cross-corpus overlap makes per-pool dedup underestimate true exposure [Elazar2023WhatsInMyBigData].

consensusc-07c93e5b71
对 benchmark、PII、版权文本,默认 0–1 次曝光比“低比例混入主池再多轮训练”更稳,因为重复曝光会随模型规模与 context 长度抬高可恢复记忆化风险 [Deng2023BenchmarkContamination][Carlini2022Memorization]。
来源论文· 1[Carlini2022Memorization]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.headline_claimsgpt-5.4

    For benchmarks, PII, and copyrighted text, a default of 0–1 exposure is safer than mixing them into the main pool and training for multiple passes, because repeated exposure raises extractable memorization risk as model size and context len

consensusc-d332095e56
语义去重解决的是 exact dedup 覆盖不到的第二层冗余,但它应建立在强 exact/near-exact 去重之后,而不是替代前者 [SemDeDup2023][D42023]。
来源论文· 2[SemDeDup2023]arXiv 2303.09540arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2303.09540· report.headline_claimsgpt-5.4

    Semantic dedup addresses a second redundancy layer that exact dedup misses, but it should sit on top of strong exact/near-exact dedup rather than replace it [SemDeDup2023][D42023].

contestedc-19e7735e3b
问题在于它容易把“web 被动冗余”与“有限池主动多 epoch”混成一类。Muennighoff et al. [Muennighoff2023DataConstrained] 表明,在高质量有限池里,前 2–4 轮均匀重复并不等于纯噪声。
来源论文· 1[Muennighoff2023DataConstrained]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].countergpt-5.4

    [counter to Camp A: Dedup as much as possible; repetition is mostly nois] The weakness is that it can collapse “passive web redundancy” and “active multi-epoch training on a finite pool” into one bucket. Muennighoff et al. [Muennighoff2023D

contestedc-a0c71644f9
问题在于 exact/near-exact 重复依然是便宜且高密度的冗余来源。Lee et al. [Lee2021Dedup] 已经表明 substring 与 MinHash 级别去重能直接带来收益;跳过这一步直接做 semantic,成本更高且未必先解决最大头问题。
来源论文· 2[Lee2021Dedup]arXiv 2107.06499arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2107.06499· report.positions[2].countergpt-5.4

    [counter to Camp C: The real battleground is semantic dedup; exact dedup] The weakness is that exact/near-exact duplication remains a cheap and dense redundancy source. Lee et al. [Lee2021Dedup] already show that substring- and MinHash-leve

contestedc-05d497bfa9
反方通常会说:只要比例足够小、总 token 足够大,风险可以忽略。但 Carlini et al. [Carlini2022Memorization] 反驳的是这种平均化思路——重复曝光与长 context 会把局部风险放大。
来源论文· 2[Carlini2022Memorization]arXiv 2202.07646arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2202.07646· report.positions[3].countergpt-5.4

    [counter to Camp D: For sensitive/eval/copyrighted data, zero repetition] The usual counterargument is that if the mixture ratio is tiny and the total token count is huge, the risk is negligible. Carlini et al. [Carlini2022Memorization] arg

consensusc-e1f21a3b33
在同 tokenizer/同目标函数/同模型家族的训练环内,验证集 cross-entropy(等价于 log PPL)对 compute、参数量、token 数的幂律拟合在多个数量级上可复现,用于预算分配与 early-stop 的外推误差通常小于“直接跑一次大训练”的成本误差。[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Porian2024ResolvingDiscrepancies]
来源论文· 2[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Within the same tokenizer/objective/model family, validation cross-entropy (log PPL) follows reproducible power-law fits over multiple orders of magnitude and is cost-effective for budgeting and early stopping; the extrapolation error is ty

consensusc-fb81ddda0d
compute-optimal 结论对训练 regime(是否 overtrain、训练时长假设)与拟合细节敏感;把 PPL 当作“预算决策者”时必须同时报告训练时长假设与拟合不确定性,否则同一数据可导出互相冲突的最优配比。[Porian2024ResolvingDiscrepancies][Besiroglu2024ChinchillaReplication]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Compute-optimal conclusions are sensitive to training regime (overtraining vs not, duration assumptions) and fitting details; if PPL drives budgeting, training-duration assumptions and fit uncertainty must be reported, otherwise the same ev

consensusc-c89ac11084
存在可复现实证:在同一预训练 loss 水平附近,下游分数可以出现可观差异;因此“同 PPL 即同能力”在强意义上是假的,PPL 更像必要条件而非充分条件。[HongLiu2022SameLossBetterDownstream]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    There is reproducible evidence that downstream scores can differ materially at near-identical pretraining loss; therefore the strong claim “same PPL implies same capability” is false, and PPL behaves more like a necessary condition than a s

consensusc-3789d1fd7f
压缩/稀疏化的验收不能用“PPL 几乎没变”替代:在高稀疏率下 PPL 近似不变但任务分数下跌,同时行为分布漂移(如 JS divergence)更贴近风险面。[KhanalCapone2024CompressionTasks]
来源论文· 1[KhanalCapone2024CompressionTasks]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Compression/sparsification acceptance cannot be replaced by “PPL barely changed”: at high sparsity PPL can stay nearly flat while task scores drop, and behavioral distribution drift (e.g., JS divergence) tracks risk more closely.[KhanalCapo

consensusc-a9dd4eb9b8
把缩放律目标直接定义为逐任务分数,在 overtraining 等设置下可以稳定拟合并外推;工程上更稳的决策变量是“任务曲线 + 外推误差”,而不是把 PPL 当任务代理。[Isik2024DownstreamScalingLaws][Gadre2024OvertrainingDownstream][Bhagia2024TaskScalingLadders]
来源论文· 1[Bhagia2024TaskScalingLadders]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Defining scaling targets directly on per-task scores can yield stable fits and extrapolations under regimes like overtraining; the more robust engineering decision variable is “task curve + extrapolation error,” not PPL as a task proxy.[Isi

contestedc-e5e5def1d7
同 loss 不同下游的反例说明“PPL 主导选型”会在 recipe 微调、优化路径变化、或 post-training 介入时失灵;此外 compute-optimal 拟合对训练时长假设与拟合细节敏感,缺少不确定性报告会放大误用风险。[HongLiu2022SameLossBetterDownstream][Porian2024ResolvingDiscrepancies][Besiroglu2024ChinchillaReplication]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].countergpt-5.2

    [counter to Camp A: PPL remains the most reliable primary variable (at l] Counterexamples where downstream differs at the same loss show that “PPL-driven selection” can fail under recipe tweaks, optimization-path changes, or post-training;

contestedc-f92d94c6e6
逐任务建模的成本高、面板易漂移;任务指标常有阈值与离散跳变,外推可能不稳,导致“曲线驱动”在工程上难以规模化。[Schaeffer2025WhyElusive]
来源论文· 1[Schaeffer2025WhyElusive]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].countergpt-5.2

    [counter to Camp B: PPL is stage-1 only; stage-2 must use per-task scali] Per-task modeling is expensive and panels drift; many task metrics have thresholds and discrete jumps, making extrapolation unstable and “curve-driven” workflows hard

contestedc-6f6405d7ee
多面板容易变成“指标堆砌”,缺少统一的预算分配与外推框架;训练阶段仍需要一个便宜的连续信号做控制回路,PPL 在这里仍不可替代。[Kaplan2020ScalingLaws]
来源论文· 1[Kaplan2020ScalingLaws]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].countergpt-5.2

    [counter to Camp C: stop searching for a scalar; define quality via stan] Multi-panels can devolve into metric sprawl without a unified budgeting/extrapolation framework; training still needs a cheap continuous signal for control loops, whe

contestedc-7d340f6e74
目标函数越复杂,越难建立稳定的缩放律与可复现的预算规划;训练内仍需要一个与优化目标一致的监控量,而对齐目标往往缺少像 PPL 那样成熟的“通用仪表”。[Porian2024ResolvingDiscrepancies]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].countergpt-5.2

    [counter to Camp D: the issue is ontological—next-token loss is not the ] More complex objectives make stable scaling laws and reproducible budgeting harder; training still needs a monitor aligned with the optimized objective, and alignment

consensusc-b678745d43
在 packing 场景里,naive concat 让 cross-doc 可见会系统性低估训练 loss,并把目标从“文档内条件概率”改写成“跨文档条件概率”;per-doc causal mask 才是与评估(按文档/样本)更一致的 objective [Krell2021Packing]。
来源论文· 1[Krell2021Packing]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    Under packing, naive concatenation with cross-doc visibility systematically underestimates training loss and changes the objective from within-document conditional likelihood to cross-document likelihood; per-doc causal masking keeps the ob

consensusc-345aab604a
减少 truncation、保持文档完整性带来可测的 LM 收益,因此超长 doc 更稳的处理是 split-then-pack,而不是 truncate-and-drop 或 concat-then-chunk [Ding2024FewerTruncations]。
来源论文· 2[Ding2024FewerTruncations]arXiv 2404.10830arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2404.10830· report.headline_claimsgpt-5.2

    Fewer truncations and better document integrity yield measurable LM gains, so split-then-pack is a safer default for over-length docs than truncate-and-drop or concat-then-chunk [Ding2024FewerTruncations].

consensusc-d8c2be3211
长窗能力更像“末段购买”的属性:公开配方里常见的是 85–95% compute 在 8K,末段 5–15% 做 128K mid-train;这要求数据流在长样本构造与采样上是稳定可控的 [Llama3Blog2024][Fu2024DataEngineering128K]。
来源论文· 2[Llama3Blog2024][Fu2024DataEngineering128K]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    Long-context behaves like a late-stage purchase: public recipes often allocate 85–95% compute at 8K and 5–15% to a 128K mid-train; this requires stable, controllable long-sample construction and sampling in the data pipeline [Llama3Blog2024

consensusc-13774067fb
RoPE 外推不是单点技巧:RoPE base 调整(如 10K→1M)与 positional interpolation 属于不同杠杆,前者改频率分布、后者改位置映射;两者对 mid-train 的依赖程度不同 [Qwen2_5Report2024][Chen2023PositionInterpolation][RoFormer2021]。
来源论文· 1[RoFormer2021]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    RoPE extrapolation is not a single trick: changing RoPE base (e.g., 10K→1M) and positional interpolation are different levers—one changes frequency distribution, the other remaps positions—and they differ in how much they rely on mid-train

consensusc-db8045f982
FIM 作为 code 模型默认项有成熟配方(StarCoder 50% rate),但把 FIM 推广为“所有模型默认目标”会把训练从纯 causal 改成混合目标;在缺少 NL-only rate sweep 对照前,更稳的默认是 code 开、NL 关 [StarCoder2023][UL22022]。
来源论文· 2[StarCoder2023][UL22022]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsgpt-5.2

    FIM has a mature default recipe for code models (StarCoder 50% rate), but making it universal changes training from pure causal to mixed objectives; without NL-only rate-sweep ablations, a safer default is on for code and off for NL [StarCo

contestedc-61eeb3088c
现有公开证据更多支持课程式加速与预算化长窗;而且如果系统层面没有接近 0 padding 的 packing,混长会把 compute 账算乱,难以保证等 compute 对照 [FlashAttention2Varlen][GrowLength2023]。
来源论文· 1[GrowLength2023]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].countergpt-5.2

    [counter to Camp B: always-mixed lengths (anti-curriculum)] Public evidence leans toward curriculum-style acceleration and budgeted long-context; and without near-zero-padding packing, always-mixed distorts compute accounting and makes equa

contestedc-bacea4e814
它会系统性低估 loss 并改写 objective,导致训练指标与评估指标不再同一语义;同时会把“跨 doc 依赖”从可控构造变成隐式泄漏,难以做比例与质量控制 [Krell2021Packing][Ding2024FewerTruncations]。
来源论文· 3[Krell2021Packing][Ding2024FewerTruncations]arXiv 2107.02027arxiv.org
2 观测Packing Masking Length
证据 (2)
  • topic_reportpacking-masking-length· report.positions[2].countergpt-5.2

    [counter to Camp C: cross-doc visibility by default (beyond-boundaries c] It systematically underestimates loss and rewrites the objective, breaking semantic alignment between training and evaluation metrics; it also turns cross-doc depende

  • topic_reportpacking-masking-lengtharXiv 2107.02027· report.positions[2].counterep-20260214160829-csjmc

    [counter to Camp C: cross-doc visibility by default (beyond-boundaries c] Krell et al. [Krell2021Packing] prove that cross-doc visibility systematically underestimates training loss, rewrites the pretraining objective, leading to inconsiste

contestedc-4ac1f3d7ca
把 FIM 推成默认会改变 tokenizer 约定、数据变换与评估口径;对 code 有成熟配方,但对纯 NL 缺少 rate sweep 的无回归证据,容易在不知情的情况下牺牲纯生成/困惑度指标 [StarCoder2023]。
来源论文· 1[StarCoder2023]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[3].countergpt-5.2

    [counter to Camp D: FIM/denoising-style objectives as default (infilling] Making FIM the default changes tokenizer conventions, data transforms, and evaluation framing; it has a mature recipe for code, but NL lacks no-regression rate sweeps

consensusc-aa2af33c51
在“精确复制/召回”范式上,纯 SSM 更容易出现干扰累积与模糊化:Jelassi et al. [Jelassi2024RepeatAfterMe] 给出系统性证据表明 Transformers 在 copying 上优于 generalized SSM;而 Mamba 的 ICL 表现也在多组对比中落后同规模 Transformer [Grazzi2024IsMambaICL][Park2024CanMambaLearnHowToLearn]。
来源论文· 3[Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL]arXiv 2402.01032arxiv.org
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architecturesarXiv 2402.01032· report.headline_claimsgpt-5.2

    On exact copying/recall paradigms, pure SSMs more readily accumulate interference and blur: Jelassi et al. [Jelassi2024RepeatAfterMe] provides systematic evidence that Transformers outperform generalized SSMs on copying; Mamba also lags siz

consensusc-4944a3b515
Hybrid 的关键不是“把 attention 变便宜”,而是“把 attention 变少”:Jamba 用 1:7(attention:Mamba)+ MoE 仍能保持下游质量并把 256K 推理压到单卡 80GB 可运行 [Lieber2024Jamba];Zamba 甚至用一个共享 attention block 在多个层位复用来保留全局交互通道 [Glorioso2024Zamba]。
来源论文· 3[Lieber2024Jamba][Glorioso2024Zamba]arXiv 2403.19887arxiv.org
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architecturesarXiv 2403.19887· report.headline_claimsgpt-5.2

    The core of Hybrids is not “make attention cheaper” but “use less attention”: Jamba keeps downstream quality with a 1:7 (attention:Mamba)+MoE recipe while fitting 256K inference on a single 80GB GPU [Lieber2024Jamba]; Zamba goes further by

consensusc-d1dba83544
当上下文拉到 128K 量级,端到端训练吞吐更受 kernel 形态支配:Samba 在 128K 上下文报告约 4× 训练加速 [Ren2024Samba];SSD 把 SSM 与结构化掩码 attention 对偶化,解释了为何 matmul-friendly 实现能带来 2–8× 的训练速度区间 [Dao2024TransformersAreSSMs]。
来源论文· 3[Ren2024Samba][Dao2024TransformersAreSSMs]arXiv 2406.07522arxiv.org
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architecturesarXiv 2406.07522· report.headline_claimsgpt-5.2

    At ~128K context, end-to-end training throughput is dominated by kernel form: Samba reports ~4× faster training at 128K [Ren2024Samba]; SSD duality explains why matmul-friendly implementations can yield 2–8× training speed ranges by relatin

consensusc-ca20ec2ea5
“长上下文可用”不能用困惑度替代:Lost in the Middle 显示模型对中段信息的利用会系统性退化 [Liu2023LostInTheMiddle],因此 Hybrid/SSM 的收益需要在 L-Eval 这类标准化长上下文基准上对齐比较,而不是只看训练 loss [An2023LEval]。
来源论文· 3[Liu2023LostInTheMiddle][An2023LEval]arXiv 2307.03172arxiv.org
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architecturesarXiv 2307.03172· report.headline_claimsgpt-5.2

    “Usable long context” cannot be substituted by perplexity: Lost in the Middle shows systematic degradation in using mid-context information [Liu2023LostInTheMiddle], so Hybrid/SSM gains must be compared on standardized long-context suites l

consensusc-7cd40ce3fd
RWKV/线性递推的优势主要落在 constant-memory 推理与流式场景,但其“可寻址检索”能力边界仍缺少与 Hybrid 的同规模对照;现有理论工作提示 constant-memory 递推在表达某些离散行为时需要更强的状态结构或外部交互通道 [Merrill2020RNNHierarchy][Peng2023RWKV]。
来源论文· 2[Merrill2020RNNHierarchy][Peng2023RWKV]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.headline_claimsgpt-5.2

    RWKV/linear recurrence is strongest for constant-memory inference and streaming, but matched-scale comparisons against Hybrids on addressable retrieval remain thin; theory suggests constant-memory recurrence may need stronger state structur

contestedc-4f6eafd477
精确召回/复制与 ICL 更像结构性差距:copying 任务上 Transformers 更容易学到近似离散检索 [Jelassi2024RepeatAfterMe],Mamba 在 ICL 对比中也落后 [Grazzi2024IsMambaICL][Park2024CanMambaLearnHowToLearn]。如果不引入显式可寻址交互通道,单靠 kernel 优化很难解释这些差距会自然消失。
来源论文· 3[Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL]arXiv 2402.01032arxiv.org
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architecturesarXiv 2402.01032· report.positions[0].countergpt-5.2

    [counter to Camp A: Pure SSM will be the endgame for long context] Exact recall/copying and ICL look like structural gaps: Transformers more readily learn near-discrete retrieval on copying tasks [Jelassi2024RepeatAfterMe], and Mamba lags o

contestedc-ef2c4e3757
Hybrid 会引入实现复杂度:两套 kernel、两套数值稳定策略、以及(在 MoE 场景)更复杂的并行与负载均衡 [Lieber2024Jamba]。如果上下文 ≤8K,很多收益兑现不了,反而增加维护成本。
来源论文· 1[Lieber2024Jamba]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[1].countergpt-5.2

    [counter to Camp B: Hybrids are the production default (tunable 1:3 to 1] Hybrids add implementation complexity: two kernel stacks, two numerical-stability regimes, and (with MoE) more complex parallelism and load balancing [Lieber2024Jamba

contestedc-6150aefdef
两类问题仍未被“只改 attention 形态”彻底解决:其一是长度外推对位置编码与描述高度敏感 [Kazemnejad2023PELengthGeneralization][Ruoss2023RandomizedPE];其二是长上下文使用效率,Lost in the Middle 显示中段信息利用退化 [Liu2023LostInTheMiddle]。在 128K+ 的吞吐约束下,Hybrid 通过把大部分层换成线性模块更容易把端到端成本打下来 [Ren2024Samba]
来源论文· 3[Ruoss2023RandomizedPE][Liu2023LostInTheMiddle][Ren2024Samba]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[2].countergpt-5.2

    [counter to Camp C: No architecture change—Transformer + long-context ex] Two issues remain unresolved by “only changing attention form”: (i) length extrapolation is highly sensitive to positional encoding and positional descriptions [Kazem

contestedc-801b4dd35f
缺少与 Hybrid 的同尺度精确召回对照,导致“质量会追上”难以落到可检验的工程决策;同时理论工作提示 constant-memory 递推在表达某些可寻址行为上存在结构门槛 [Merrill2020RNNHierarchy],这与 copying/ICL 的经验差距方向一致 [Jelassi2024RepeatAfterMe]。
来源论文· 2[Merrill2020RNNHierarchy][Jelassi2024RepeatAfterMe]
1 观测SSM Hybrid Architectures
证据 (1)
  • topic_reportssm-hybrid-architectures· report.positions[3].countergpt-5.2

    [counter to Camp D: RWKV/linear RNN is the correct RNN revival path] Matched-scale exact-recall comparisons against Hybrids are missing, making “quality will catch up” hard to turn into testable engineering decisions; theory also suggests s

consensusc-dbaefb9f1c
在固定 HP 搜索预算的协议下,许多“新 optimizer 比 AdamW 快”的差距会明显收缩;不报告 schedule 家族与试验次数时,差距更像调参能力而非算法能力 [Dahl2023AlgoPerf][Agarwal2020LRConfound]。
来源论文· 3[Dahl2023AlgoPerf][Agarwal2020LRConfound]arXiv 2306.07179arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2306.07179· report.headline_claimsgpt-5.2

    Under fixed HP-search budgets, many “new optimizer beats AdamW” gaps shrink materially; without reporting schedule family and trial counts, gaps often reflect tuning capacity rather than algorithmic advantage [Dahl2023AlgoPerf][Agarwal2020L

consensusc-aad4e2f731
Muon 的可部署点在“混合路由”而非全量替换:只对 hidden 的 2D 权重做 Newton-Schulz 近似正交化更新,其余张量回退 AdamW,从而把失败模式集中在一类张量上 [Jordan2024Muon]。
来源论文· 1[Jordan2024Muon]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Muon’s deployability comes from hybrid routing, not wholesale replacement: orthogonalize updates only for hidden 2D weights via a Newton–Schulz approximation and fall back to AdamW for other tensors, concentrating failure modes into one ten

consensusc-038efb360a
SOAP 把 Shampoo 的“难调参/不稳定”压缩成 1 个额外超参:在 Shampoo 的特征基里跑 Adam 的动量与自适应,使 360M–1.3B 的 wall-clock 接近 AdamW 且 loss 更低 [Vyas2024SOAP][Gupta2018Shampoo]。
来源论文· 3[Vyas2024SOAP][Gupta2018Shampoo]arXiv 2409.11321arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2409.11321· report.headline_claimsgpt-5.2

    SOAP compresses Shampoo’s “hard to tune / unstable” burden into one extra HP by running Adam’s momentum/adaptivity in Shampoo’s eigenbasis, achieving near-AdamW wall-clock and lower loss at 360M–1.3B [Vyas2024SOAP][Gupta2018Shampoo].

consensusc-7b0adf1a66
显存受限时,优先选“不改训练循环”的低 state(Apollo、Adam-mini)通常比“引入二阶矩阵状态”更稳;GaLore 属于改梯度表示,收益与工程复杂度同时上升 [Zhu2024Apollo][Zhang2024AdamMini][Zhao2024GaLore]。
来源论文· 3[Zhu2024Apollo][Zhang2024AdamMini][Zhao2024GaLore]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    When VRAM is tight, low-state methods that don’t change the training loop (Apollo, Adam-mini) are often a safer first move than adding second-order matrix state; GaLore changes gradient representation and increases both upside and engineeri

contestedc-a45c8efb8b
如果二阶或混合路由能在 wall-clock 上稳定赢,并且不增加调参自由度,默认就应迁移;继续停留在 AdamW 可能是在为历史包袱买单。
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: AdamW won’t be retired (highest default priority)] If second-order or hybrid routing can reliably win on wall-clock without increasing tuning degrees of freedom, the default should move; sticking to AdamW could be paying

contestedc-632591d9b2
公开证据仍偏 speedrun/中小规模,且缺少“非 2D 张量为什么会坏”的系统消融;在 ≥7B/≥30B、长上下文、分布式通信下,wall-clock 优势可能被系统开销吃掉 [Zhang2024CBSScaling]。
来源论文· 1[Zhang2024CBSScaling]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: Muon is the next default (but only as a hybrid)] Public evidence is still skewed toward speedrun/mid-small scale, and systematic ablations on why non-2D tensors break are missing. At ≥7B/≥30B, long context, and distribut

contestedc-8b903cac0f
≥7B 的公开 head-to-head(同预算 HPO + wall-clock)仍缺;二阶在 μP-style 下的 LR transfer 也未被系统验证,导致调参成本可能抵消 token 节省 [Ishikawa2023SecondOrderParam][Lingle2024muPTransfer]。
来源论文· 2[Ishikawa2023SecondOrderParam][Lingle2024muPTransfer]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Public ≥7B head-to-heads (equal-budget HPO + wall-clock) are still missing; μP-style LR transfer for second-order is not systematically validated, so tuning cost may

consensusc-6444f7d112
在 copy/检索/密集实体绑定上,纯递推(固定维度状态)会出现可复现缺口;用 recall/复制类指标能稳定测到,而困惑度经常测不出来 [Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL][Park2024CanMambaLearnHowToLearn]。
来源论文· 4[Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL]arXiv 2312.04927arxiv.org
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkvarXiv 2312.04927· report.headline_claimsgpt-5.2

    On copy/retrieval/dense entity binding, pure recurrence with fixed-size state shows a reproducible gap; recall/copy metrics surface it reliably while perplexity often does not [Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL][Pa

consensusc-ed2053a24e
“少量 attention + 多数递推/SSM”的混合在工程上更稳:attention 层承担精确寻址/复制,其余层承担路由与压缩;Jamba 与 Griffin 给出了可复用的实现形态 [Lieber2024Jamba][De2024Griffin]。
来源论文· 2[Lieber2024Jamba][De2024Griffin]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    Hybrids with “a few attention layers + many recurrent/SSM layers” are more stable in practice: attention handles exact addressing/copying, the rest handles routing/compression; Jamba and Griffin provide reusable instantiations [Lieber2024Ja

consensusc-9c183eca3b
线性注意力与 SSM 在数学上可通过对偶/重写互相映射,但工程上是否等价取决于状态形态(向量 vs 矩阵)、有限维度下的可逆性,以及带宽/数值稳定成本 [Dao2024TransformersAreSSMs][Yang2023GLA][Ali2024HiddenAttentionMamba]。
来源论文· 3[Dao2024TransformersAreSSMs][Yang2023GLA][Ali2024HiddenAttentionMamba]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    Linear attention and SSMs are mathematically mappable via duality/rewriting, but engineering equivalence depends on state form (vector vs matrix), reversibility under finite dimension, and bandwidth/numerical-stability costs [Dao2024Transfo

consensusc-3e0c82e7c3
次二次骨架不必执着从零预训练:Transformer→SSM/混合的蒸馏与替换能把交互模式迁移过去,把“训练预算”从高风险探索变成可控改造 [Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Hinton2015Distilling]。
来源论文· 3[Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Hinton2015Distilling]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.headline_claimsgpt-5.2

    Subquadratic backbones need not be pretrained from scratch: Transformer→SSM/hybrid distillation and layer replacement can transfer interaction patterns, turning training budget from high-risk exploration into controlled retrofit [Bick2024Tr

contestedc-4d4c6e2335
反例集中在“精确寻址/复制”:固定状态瓶颈在复制与 dense binding 上可复现,且用 recall 指标能稳定测到;这类差距不一定随规模自然消失 [Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL][Park2024CanMambaLearnHowToLearn]。Pretraining Without Attention 也显示纯无 attention 预训练在 NLP 上要么掉点,要么需要引入
来源论文· 3[Zoology2023][Jelassi2024RepeatAfterMe][Grazzi2024IsMambaICL]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[0].countergpt-5.2

    [counter to Camp A: Pure SSM/RWKV will eventually replace Transformers] Counterexamples concentrate on exact addressing/copy: the fixed-state bottleneck is reproducible on copying and dense binding, and recall metrics surface it reliably; t

contestedc-201f979d6f
工程上“可重写”不等于“可互换”:GLA 的矩阵状态把预算从 O(d) 变成 O(d^2) 级别的读写与存储常数,硬件带宽与数值稳定会改变可用的状态规模 [Yang2023GLA]。同时,固定状态瓶颈在复制/检索上是行为层面的失败,不是表达式能否重写的问题;如果没有显式寻址通道,等价重写也可能无法恢复 passkey/needle 类精确检索 [Jelassi2024RepeatAfterMe][Zoology2023]。
来源论文· 3[Yang2023GLA][Jelassi2024RepeatAfterMe][Zoology2023]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[1].countergpt-5.2

    [counter to Camp B: Linear attention and SSMs are the same class and mut] In engineering, “rewritable” is not “interchangeable”: GLA’s matrix state shifts the budget from O(d) to O(d^2)-scale read/write and storage constants, where hardware

contestedc-2981a14d2c
近期证据更支持“可控改造”:Bick et al. 把蒸馏目标对准注意力结构并逐级迁移 [Bick2024TransformersToSSMs];Wang et al. 直接从已训练 Transformer 出发做混合替换并展示加速路径 [Wang2024MambaInTheLlama]。更关键的是,蒸馏路线允许把验收指标前置到 recall/检索曲线,而从零预训练往往在训练结束才暴露精确寻址缺口 [Zoology2023][Liu2023LEval]。
来源论文· 5[Bick2024TransformersToSSMs][Wang2024MambaInTheLlama][Zoology2023][Liu2023LEval]arXiv 2408.10189arxiv.org
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkvarXiv 2408.10189· report.positions[2].countergpt-5.2

    [counter to Camp C: Subquadratic models must be pretrained from scratch;] Recent evidence supports controlled retrofit: Bick et al. distill attention structure progressively [Bick2024TransformersToSSMs]; Wang et al. retrofit pretrained Tran

contestedc-6bde3e5deb
主要弱点是缺少“最小 attention 配置”的可复现规范:当前更多是经验配方,attention 层数/位置、full vs local 的选择缺少统一扫参与 passkey/needle 恢复曲线 [Liu2023LEval][Shaham2022SCROLLS]。此外,线性注意力/SSM 的对偶关系可能提供更好的混合放置策略,但还缺少把理论边界转成工程规则的证据 [Dao2024TransformersAreSSMs]。
来源论文· 3[Liu2023LEval][Shaham2022SCROLLS][Dao2024TransformersAreSSMs]
1 观测SSM Mamba Rwkv
证据 (1)
  • topic_reportssm-mamba-rwkv· report.positions[3].countergpt-5.2

    [counter to Camp D: The engineering optimum is hybrid; minimize attentio] The main weakness is the lack of a reproducible specification for the minimal attention configuration: current practice is recipe-driven, with insufficient systematic

consensusc-447f63c608
compute-optimal tokens/param 不是固定常数;公开重拟合已把最优区间从约 5 推到约 100,变化主因是训练步数、batch/LR schedule、数据质量与去重 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][DeepSeek2024LLM]。
来源论文· 4[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][DeepSeek2024LLM]arXiv 2001.08361arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2001.08361· report.headline_claimsgpt-5.4

    Compute-optimal tokens/param is not a fixed constant; public refits already place the optimum anywhere from about 5 to about 100, driven mainly by training steps, batch/LR schedules, data quality, and deduplication [Kaplan2020ScalingLaws][H

consensusc-6ad3ed461a
在数据受限场景,总 tokens 必须拆成 fresh tokens × 重复次数;重复训练在 ≤4 epochs 内近似等价 fresh tokens,之后收益按可拟合速率衰减 [Muennighoff2023DataConstrained]。
来源论文· 2[Muennighoff2023DataConstrained]arXiv 2305.16264arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2305.16264· report.headline_claimsgpt-5.4

    Under data constraints, total tokens must be decomposed into fresh tokens × repetition; repetition is roughly comparable to fresh tokens up to ≤4 epochs, after which gains decay at a fitable rate [Muennighoff2023DataConstrained].

consensusc-a0e7070f12
loss scaling 不直接给出单任务 scaling;在 over-training 区间,验证 loss 仍可平滑外推,但单个 benchmark 分数会明显波动,直到 loss 过阈值后才稳定 [Gadre2024OverTraining]。
来源论文· 1[Gadre2024OverTraining]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.4

    Loss scaling does not directly yield per-task scaling; in the over-training regime, validation loss remains smoothly extrapolatable, while individual benchmark scores fluctuate substantially until a loss threshold is crossed [Gadre2024OverT

consensusc-0e32bbeaff
单任务能力可以被预测,但要显式建模任务异质性;模型阶梯加两段回归已把多个 multiple-choice 任务的平均误差压到约 1.9%,明显优于直接拿 pretraining loss 外推 [Bhagia2024TaskLadders][Isik2024DownstreamScaling]。
来源论文· 2[Bhagia2024TaskLadders][Isik2024DownstreamScaling]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.4

    Per-task capability can be predicted, but only by modeling task heterogeneity explicitly; model ladders plus a two-stage regression reduce average error to about 1.9% on multiple-choice tasks, clearly outperforming direct extrapolation from

consensusc-62aa96e86a
数据 mixture 是独立 scaling 轴;固定模型与 tokens 时,仅改筛选、去重和配比就能带来至少 7 个百分点的下游差距,这个量级足以盖过许多参数规模增量 [Li2024DCLM][Albalak2024DataSelectionSurvey]。
来源论文· 2[Li2024DCLM][Albalak2024DataSelectionSurvey]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.4

    Data mixture is an independent scaling axis; at fixed model and token budgets, changing filtering, deduplication, and mixture alone can move downstream results by at least 7 points, large enough to dominate many parameter-scale increments [

contestedc-94d1239840
Hoffmann et al. [Hoffmann2022Chinchilla] 反驳 [Kaplan2020ScalingLaws] 是因为 Kaplan 设定把 undertraining 混进了最优点;DeepSeek-AI et al. [DeepSeek2024LLM] 进一步说明最优比率会随 schedule 与数据配方滑动,不能当可迁移常数。
来源论文· 4[Hoffmann2022Chinchilla][Kaplan2020ScalingLaws][DeepSeek2024LLM]arXiv 2203.15556arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2203.15556· report.positions[0].countergpt-5.4

    [counter to Camp A: Kaplan-style — portable exponents, fixed-compute sho] Hoffmann et al. [Hoffmann2022Chinchilla] rebut [Kaplan2020ScalingLaws] by showing that the Kaplan setup folded undertraining into the inferred optimum; DeepSeek-AI et

contestedc-829ac44898
Muennighoff et al. [Muennighoff2023DataConstrained] 与 Li et al. [Li2024DCLM] 反驳“20 可直接通用”是因为 tokens 的有效性取决于 freshness、重复和 mixture;DeepSeek-AI et al. [DeepSeek2024LLM] 也显示公开重拟合下最优点并不锁在 20。
来源论文· 4[Muennighoff2023DataConstrained][Li2024DCLM][DeepSeek2024LLM]arXiv 2305.16264arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2305.16264· report.positions[1].countergpt-5.4

    [counter to Camp B: Chinchilla-style — balance N and D under fixed compu] Muennighoff et al. [Muennighoff2023DataConstrained] and Li et al. [Li2024DCLM] rebut the direct portability of “20” because token effectiveness depends on freshness,

contestedc-b6e8916238
反方会说数据实验噪声大、难标准化,先做模型 scaling 更干净。但 Li et al. [Li2024DCLM] 正是通过固定模型与 tokens 的对照设计,把这种“数据太脏无法比较”的借口削弱了。
来源论文· 1[Li2024DCLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].countergpt-5.4

    [counter to Camp C: Data-mixture pragmatists — data is the first axis, g] The counterargument is that data experiments are noisy and hard to standardize, so model scaling is the cleaner first step. But Li et al. [Li2024DCLM] weaken exactly

contestedc-dc42bfc0b9
反方会引用 PaLM [Chowdhery2022PaLM]、GPT-4 技术报告 [OpenAI2023GPT4TR] 这类能力跃迁案例,认为 loss 无法解释全部现象。更稳的解释不是否认跃迁,而是把跃迁拆成 loss 阈值、任务结构和评测设计三部分。
来源论文· 3[Chowdhery2022PaLM][OpenAI2023GPT4TR]arXiv 2204.02311arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2204.02311· report.positions[3].countergpt-5.4

    [counter to Camp D: Against emergence-as-magic — many “emergent” effects] The opposing side points to capability jumps in cases like PaLM [Chowdhery2022PaLM] and the GPT-4 technical report [OpenAI2023GPT4TR], arguing that loss cannot explai

consensusc-489e1f0f1f
任何 HumanEval/MBPP 结果如果不配 EvalPlus 的增强测试,都会系统性高估正确率;因此 HumanEval 更适合作为回归测试项,而不是对外宣称“工程能力提升”的主口径。[EvalPlus2023][Chen2021Codex][Austin2021MBPP]
来源论文· 4[EvalPlus2023][Chen2021Codex][Austin2021MBPP]arXiv 2305.01210arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2305.01210· report.headline_claimsgpt-5.2

    Any HumanEval/MBPP result without EvalPlus augmented tests systematically overestimates correctness; HumanEval is better treated as a regression check than the primary claim for “engineering capability gains”.[EvalPlus2023][Chen2021Codex][A

contestedc-8b9650909e
两条反例更直接:其一,弱测试会把“过样例”当成“正确”,EvalPlus 显示 false positive 不是边角问题 [EvalPlus2023];其二,真实 SWE 需要 repo-level 语境与执行反馈闭环,SWE-bench 的 patch+tests 与 RepoBench 的跨文件依赖都不在函数题覆盖范围内 [SWEbench2023][RepoBench2023]。
来源论文· 4[EvalPlus2023][SWEbench2023][RepoBench2023]arXiv 2305.01210arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2305.01210· report.positions[0].countergpt-5.2

    [counter to Camp A: HumanEval/MBPP is sufficient to represent coding abi] Two counterexamples are more direct. First, weak tests turn “passes samples” into “correct”; EvalPlus shows false positives are not a corner case [EvalPlus2023]. Seco

contestedc-4fe875f752
两点会让“唯一真值”读法失真:其一,Verified 仍对 harness 与搜索预算敏感,且 agent loop 的 knob 会改变有效搜索空间 [SWEbenchVerified2024][CodeT2022][SelfRepair2023];其二,Verified 的生态偏 Python,Multi-SWE-Bench 显示跨语言外推需要单独验证,否则高分可能只是生态匹配 [MultiSWEBench2025]。
来源论文· 4[SWEbenchVerified2024][CodeT2022][SelfRepair2023][MultiSWEBench2025]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[1].countergpt-5.2

    [counter to Camp B: SWE-bench Verified is the only trustworthy ground tr] Two issues distort the “only truth” reading. First, Verified remains sensitive to harness and search budget; agent-loop knobs change the effective search space [SWEbe

contestedc-c7fee35c3f
已有证据更支持“需要任务约束”而不是“只看似然”:CRUXEval 明确把执行语义作为约束点,RepoBench/RepoCoder 把 repo-level 依赖作为约束点,二者都指向 likelihood 无法覆盖的结构性盲区 [CRUXEval2024][RepoBench2023][RepoCoder2023]。因此,把 patch-PPL 当作主锚容易把“语言建模更好”误读成“能跑、能改、能跨文件”。
来源论文· 3[CRUXEval2024][RepoBench2023][RepoCoder2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].countergpt-5.2

    [counter to Camp C: patch-PPL/code BPB in pretraining is the best predic] Existing evidence favors “task constraints are needed” rather than “likelihood alone”: CRUXEval explicitly constrains execution semantics, and RepoBench/RepoCoder con

contestedc-67b8b5d6dc
如果完全放弃任务级 ground truth,UX 指标会退化成“更会搜索/更敢花钱”的竞赛:CodeT 与 self-repair 类方法表明,搜索预算与选择器能显著改变结果 [CodeT2022][SelfRepair2023]。缺少 Verified 这类锚点,很难区分“能力提升”与“预算提升”。
来源论文· 2[CodeT2022][SelfRepair2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].countergpt-5.2

    [counter to Camp D: deployment UX metrics reflect SWE-agent value better] If task-level ground truth is abandoned, UX metrics can devolve into a contest of “searches more / spends more”: CodeT and self-repair show search budget and selector

consensusc-224eea2d80
在同等参数规模下,把 continued pretrain 的代码占比从“通用混合”(~25–30%)推到 ≥70%,更可能改变 SWE-bench 类任务的失败模式:从“定位错文件/改错层级”转向“最后一公里的细节错误”,这类错误更容易被 scaffolding 或 verifier 纠正[DeepSeekCoder2024][DeepSeekCoderV22024][Agentless2024]。
来源论文· 3[DeepSeekCoder2024][DeepSeekCoderV22024][Agentless2024]
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.headline_claimsgpt-5.2

    At fixed parameter scale, raising the continued-pretrain code ratio from a general mix (~25–30%) to ≥70% is more likely to change failure modes on SWE-bench-like tasks: from wrong localization/wrong abstraction level to last-mile detail err

contestedc-becfa1cb84
Agentless[Agentless2024] 表明很多“agent 提升”可由更简单的 pipeline 复现,暗示瓶颈常在 base model 的先验而非 scaffold 复杂度;同时 repo-level 依赖与 patch insertion 若在训练中稀疏出现,推理时再复杂也更像在搜索而不是在理解[StarCoder2TheStackV22024][FIM2022]。
来源论文· 4[Agentless2024][StarCoder2TheStackV22024][FIM2022]arXiv 2407.01489arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2407.01489· report.positions[0].countergpt-5.2

    [counter to Camp A: scaffolding and test-time compute are everything] Agentless [Agentless2024] shows many “agent gains” can be reproduced by simpler pipelines, suggesting the bottleneck is often base-model priors rather than scaffold compl

contestedc-ac60627fd2
执行反馈确实能带来增益[RLEF2024],但它对“结构性错误”的纠正依赖更长的探索链条与更密的环境交互,成本会被 repo 规模放大[SWEGym2024];如果预训练从未见过 commit/PR 与失败日志的 token 分布,RL 需要先补语言建模再做策略改进,样本效率更差[LingmaSWEGPT2024][LLama3Herd2024]。
来源论文· 5[RLEF2024][SWEGym2024][LingmaSWEGPT2024][LLama3Herd2024]arXiv 2410.02089arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2410.02089· report.positions[1].countergpt-5.2

    [counter to Camp B: RL and verifiers are the true drivers] Execution feedback does yield gains [RLEF2024], but correcting structural failures requires longer exploration chains and denser environment interaction, with costs amplified by rep

contestedc-8174666386
Textbooks Are All You Need[Textbooks2023] 给出质量与结构能改变样本效率的反例;更关键的是,SWE-bench 的目标分布包含 issue、diff、tests[SWEbench2023],这些 token 形状在纯代码 dump 中出现频率很低,导致模型即使“会写代码”也可能不会“按工程方式改代码”。
来源论文· 3[Textbooks2023][SWEbench2023]arXiv 2306.11644arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2306.11644· report.positions[2].countergpt-5.2

    [counter to Camp C: just mix more code (code is all you need)] Textbooks Are All You Need [Textbooks2023] is a counterexample showing quality/structure changes sample efficiency; more importantly, SWE-bench’s target distribution includes is

contestedc-393ea3dc3c
公开证据的短板是缺少严格消融:很多技术报告同时改变了模型规模、token 规模、过滤策略与后训练配方,导致“数据形状”与“后训练”难以因果分离[Qwen25Coder2024][DeepSeekCoderV22024]。
来源论文· 2[Qwen25Coder2024][DeepSeekCoderV22024]
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[3].countergpt-5.2

    [counter to Camp D: data shape first (repo/patch/process/execution first] The weakness in public evidence is the lack of strict ablations: many technical reports change model size, token scale, filtering, and post-training recipes together,

consensusc-00e946b0e8
把 mid-train 预算压到总 compute 的 <10% 时,常见结果是“看起来像有收益但不稳定”:收益与 backbone undertraining 混在一起,且对目标分布的迁移不够强;10–30% 更容易形成可复现的分布拉动 [Chinchilla2022][Phi3Report][Llama3Herd]。
来源论文· 2[Chinchilla2022]arXiv 2203.15556arxiv.org
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrainarXiv 2203.15556· report.headline_claimsgpt-5.2

    When mid-train budget is <10% of total compute, gains often look unstable: they are confounded with backbone undertraining and the distribution pull is weak; 10–30% more reliably produces a reproducible shift [Chinchilla2022][Phi3Report][Ll

contestedc-9e8060b454
这条路线的薄弱点在于规模外推:小模型/窄任务上的成功,不等价于大规模通用模型的可持续 recipe。缺少强 verifier 时,错误样本与单一 teacher 文风更容易在扩大规模后变成系统性偏差 [MAD2023][GoMAD2023][Cosmopedia2024]。
来源论文· 3[MAD2023][GoMAD2023][Cosmopedia2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].countergpt-5.2

    [counter to Camp A: synthetic-first can be a primary route (especially u] The weak point is scale extrapolation: success on small models or narrow tasks does not imply a sustainable recipe for large general models. Without strong verifiers,

contestedc-9f0afd9092
主要挑战是预算与对照:公开材料很少给出 mid-train 占比与 matched-budget 的过滤对照,因此容易出现“mid-train 看起来有效,但其实是在补足 undertrained backbone 或受益于更强过滤”的归因偏差 [Chinchilla2022][DataCompLM2024]。
来源论文· 2[Chinchilla2022][DataCompLM2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].countergpt-5.2

    [counter to Camp B: web-heavy backbone + (real/synthetic) mid-train is t] The main challenge is budgeting and controls: public materials rarely provide explicit mid-train fractions and matched-budget filtering baselines, making it easy to m

contestedc-8d1bf98659
过滤路线难以单独解决“目标分布的稀缺性”:例如 long-context 需要长文本与长序列训练,math/reasoning 需要可验证的高密度信号;这些往往更像 mid-train 的显式分布迁移问题,而不是过滤能自然覆盖的问题 [LongContextScaling2023][DeepSeekMath2024][Llemma2023]。
来源论文· 3[LongContextScaling2023][DeepSeekMath2024][Llemma2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].countergpt-5.2

    [counter to Camp C: avoid synthetic as much as possible; stronger filter] Filtering alone struggles with target-distribution scarcity: long-context needs long texts and long-sequence training; math/reasoning needs high-density verifiable si

contestedc-cfa2498eb0
公开证据更接近相反:recursive replace 会先丢尾部分布并逐代变窄 [MAD2023][GoMAD2023];而“不会 collapse”的条件是 accumulate real,而不是“synthetic 无限替代” [CollapseInevitable2024]。
来源论文· 3[MAD2023][GoMAD2023][CollapseInevitable2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].countergpt-5.2

    [counter to Camp D: synthetic scales almost without bound; collapse is m] Public evidence is closer to the opposite: recursive replacement loses tail modes first and narrows generation by generation [MAD2023][GoMAD2023]; and the condition f

consensusc-e67c12a4a4
跨 tokenizer 用 per-token PPL 做结论会系统性误导:tokenizer 改变了分母(token 步数),可比口径应回到 BPB 或 character-string likelihood [Vieira2024Characters][Gastaldi2024Foundations]。
来源论文· 3[Vieira2024Characters][Gastaldi2024Foundations]arXiv 2412.03719arxiv.org
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scalingarXiv 2412.03719· report.headline_claimsgpt-5.2

    Per-token PPL is systematically misleading across tokenizers because the denominator (token steps) changes; comparable evaluation should use BPB or character-string likelihood [Vieira2024Characters][Gastaldi2024Foundations].

  • topic_reporttokenizer-scalingarXiv 2412.03719· report.headline_claimsgpt-5.2

    Per-token PPL is not comparable across tokenizers because the denominator and reachable token-string set change; primary comparable metrics should be BPB/character-string likelihood [Vieira2024Characters][Gastaldi2024FoundationsTokenization

consensusc-7c10c72c2c
128K vocab 的主要可交付收益不是“更懂语言”,而是 bytes-per-token/fertility 带来的序列缩短:工业训练报告给出 0.02–0.04 nats 更低 loss,并明确连接到非英语与代码的更短序列,从而降低 KV cache 压力并提高吞吐 [Dubey2024Llama3]。
来源论文· 1[Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    The main shippable benefit of a 128K vocab is sequence shortening via bytes-per-token/fertility, not a vague “better language understanding”: an industrial report shows 0.02–0.04 nats lower loss and explicitly ties it to shorter non-English

consensusc-72e72f8b41
更大 vocab 并非单调更好:把多位数字或日期片段 merge 成 token,会在 3–5 位、carry-sensitive 算术与 temporal reasoning 上制造 10–20 pp 级别缺口;single-digit 数字切分更稳 [Singh2024TokenizationCounts][Bhatia2025DateFragments]。
来源论文· 2[Singh2024TokenizationCounts][Bhatia2025DateFragments]
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Bigger vocab is not monotonic: merging multi-digit numbers or date fragments into tokens creates 10–20 pp gaps on 3–5 digit, carry-sensitive arithmetic and temporal reasoning; single-digit numeral tokenization is more robust [Singh2024Token

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    ‘Bigger vocab is always better’ has stable counterexamples on digits/dates: merging multi-digit numerals or date fragments creates 10–20 pp gaps on 3–5 digit carry-sensitive arithmetic and temporal reasoning, with no clear automatic converg

consensusc-bc92badbb2
non-unique encoding 是被低估的稳定性风险:同一 surface string 对应多条 token 轨迹会让 reasoning 与 RL 把等价轨迹当作不同序列,直接在目标函数层面引入不一致;更大 vocab 往往增加这种歧义空间 [LiuEllis2026SayAnything]。
来源论文· 1[LiuEllis2026SayAnything]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Non-unique encodings are an underpriced stability risk: multiple token trajectories for the same surface string make reasoning and RL treat equivalent trajectories as different sequences, injecting inconsistency into the objective; larger v

contestedc-ddc84a46b0
非单调性来自可复现的失败模式:多位数字与日期片段 merge 会造成 10–20 pp 级别缺口 [Singh2024TokenizationCounts][Bhatia2025DateFragments];non-unique encoding 会让 reasoning/RL 把等价轨迹当作不同序列,扩大训练噪声 [LiuEllis2026SayAnything];更大词表还更容易出现大量 under-trained tail tokens,需要额外修债流程 [LandBa
来源论文· 4[Singh2024TokenizationCounts][Bhatia2025DateFragments][LiuEllis2026SayAnything]arXiv 2402.14903arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is always better; default to 256K+] Non-monotonicity is driven by reproducible failure modes: merging multi-digit numbers and date fragments yields 10–20 pp gaps [Singh2024TokenizationCounts][Bhatia2025DateF

contestedc-55ffd4f7cf
现有证据更多是“评估与形式化”而非“可交付替代”:character-string likelihood 解决了可比口径,但并不自动给出 byte/char 模型在吞吐、KV cache、训练稳定性上的账本 [Vieira2024Characters]。同时,工业报告显示 subword + 128K 已能在多语与代码上获得序列压缩收益 [Dubey2024Llama3],而 byte/char 方案需要明确回答序列长度膨胀对推理成本的影响。
来源论文· 2[Vieira2024Characters][Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Current evidence is stronger on evaluation/formalization than on a shippable replacement: character-string likelihood fixes comparability but does not automatically provid

contestedc-418d9061f7
tail token 与 non-unique encoding 的确会伤稳定性 [LandBartolo2024Magikarp][LiuEllis2026SayAnything],但“缩小词表”不是唯一解:一方面可以通过禁止数字/日期 merge、约束编码唯一性来减少歧义 [Singh2024TokenizationCounts][Bhatia2025DateFragments];另一方面,工业数据表明 128K 能显著缩短多语与代码序列,从系统账本上更容易回本 [Dub
来源论文· 5[LandBartolo2024Magikarp][LiuEllis2026SayAnything][Singh2024TokenizationCounts][Bhatia2025DateFragments]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.positions[3].countergpt-5.2

    [counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Tail tokens and non-unique encodings do harm stability [LandBartolo2024Magikarp][LiuEllis2026SayAnything], but shrinking vocab is not the only lever: forbidding digit

consensusc-94c0418328
在 decoder-only ≤70B 的常见服务设定里,把 MHA 换成 GQA 通常是“低风险的默认项”:KV-cache 随 KV heads 近似线性下降,而质量退化更像在接近 MQA(KV heads=1)时才明显加速;因此 h_kv≈h/8 往往是第一档值得扫的比率 [Ainslie2023GQA][Shazeer2019MQA]。
来源论文· 2[Ainslie2023GQA][Shazeer2019MQA]
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvements· report.headline_claimsgpt-5.2

    For decoder-only models ≤70B under typical serving constraints, replacing MHA with GQA is a low-risk default: KV-cache drops roughly linearly with the number of KV heads, while quality degradation tends to accelerate mainly near the MQA ext

  • topic_reporttransformer-arch-improvements· report.headline_claimsgpt-5.2

    In decoder-only inference, KV-cache scales roughly linearly with KV heads: reducing KV heads from h to ~h/8 (a typical GQA sweep point) cuts attention cache to ~1/8, while quality degradation tends to accelerate mainly near KV heads=1 (MQA)

consensusc-74edf5d71e
loss spike 不是随机噪声:Wortsman et al. 把根因定位到 attention logits 方差与输出范数增长,使 QK-Norm 成为针对性抑制项;相对之下,sandwich norm 目前更像 Gemma 3 配方里的相关项,缺少跨团队独立消融来定价 [Wortsman2023Instabilities][Gemma3Report]。
来源论文· 1[Wortsman2023Instabilities]
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvements· report.headline_claimsgpt-5.2

    Loss spikes are not random noise: Wortsman et al. attribute them to attention-logit variance and output-norm growth, making QK-Norm a targeted mitigation; in contrast, sandwich norm is currently more of a correlated ingredient in the Gemma

  • topic_reporttransformer-arch-improvements· report.headline_claimsgpt-5.2

    Loss spikes are not random noise: Wortsman et al. tie them to attention-logit variance and output-norm growth, making QK-Norm a targeted suppressor; in contrast, sandwich norm still lacks broad independent ablations for pricing [Wortsman202

consensusc-9db0fc90e0
当已有稳定 base,depth-up-scaling / block expansion 往往比 from-scratch 更接近“按需付费”:SOLAR 用 200B tokens continued pretrain 支撑 7B→10.7B 的长深扩展 [Kim2023Solar],LLaMA Pro 通过只训练新增 block 来降低对原能力的扰动 [LLaMAPro2024]。
来源论文· 3[Kim2023Solar][LLaMAPro2024]arXiv 2312.15166arxiv.org
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvementsarXiv 2312.15166· report.headline_claimsgpt-5.2

    With a stable base, depth-up-scaling / block expansion often behaves closer to pay-as-you-go than from-scratch: SOLAR supports 7B→10.7B depth growth with 200B tokens of continued pretraining [Kim2023Solar], and LLaMA Pro reduces disruption

  • topic_reporttransformer-arch-improvements· report.headline_claimsgpt-5.2

    With a stable base and target size not exceeding ~30B, depth-up-scaling / block expansion often behaves like “pay-as-you-grow”: SOLAR uses 200B tokens to support 7B→10.7B, and LLaMA Pro trains only inserted blocks to reduce disruption to ba

contestedc-89d515780d
这套逻辑对“训练 compute 主导”的指标更贴合,但对“服务成本主导”的场景会失真:长上下文与高并发推理里,KV-cache 与带宽是硬上限,GQA/SWA/local-global/MLA 直接改变单位 token 成本 [Ainslie2023GQA][Mistral2023][Gemma3Report][DeepSeekV2]。
来源论文· 2[Ainslie2023GQA][Mistral2023]
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvements· report.positions[0].countergpt-5.2

    [counter to Camp A: architecture is mostly done; keep scaling from scrat] This logic fits regimes where training compute dominates, but it distorts serving-dominated regimes: under long context and high-concurrency inference, KV-cache and b

  • topic_reporttransformer-arch-improvements· report.positions[0].countergpt-5.2

    [counter to Camp A: architecture details are mostly constants; keep clea] This reading fits training-compute-dominated objectives, but can be misleading when serving cost dominates: under long context and high-concurrency inference, KV-cach

contestedc-d791c121d2
公开证据的短板是同预算、同数据、同评测的 head-to-head:在 LLM 规模与真实 serving 约束下,很多收益可以被“Transformer 内部的组件化稀疏/压 KV”覆盖,例如 GQA 与 local/global 交错已经把主要账单项压下去 [Ainslie2023GQA][Gemma3Report]。此外,迁移成本不只在训练,还在 kernel、KV cache 管理、tooling 与生态兼容。
来源论文· 1[Ainslie2023GQA]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].countergpt-5.2

    [counter to Camp B: the next backbone should move to SSM / RetNet / Mamb] The public weak spot is budget-matched, data-matched, evaluation-matched head-to-head comparisons at LLM scale under real serving constraints. Many claimed gains can

contestedc-2f6f7e9381
公开证据偏正例,缺少系统性负例:grow 在哪些数据分布会引入不可恢复的偏置、在更大规模是否仍保持 compute 优势、以及对长程一致性/工具调用等能力是否有结构性上限,目前很难从现有报告中定价 [Wang2023Grow][Yao2023MaskedGrowth]。
来源论文· 2[Wang2023Grow][Yao2023MaskedGrowth]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[2].countergpt-5.2

    [counter to Camp C: a second scaling path should be default—grow first, ] Public evidence is skewed toward positive cases, with limited systematic negatives: which data distributions make growth introduce irrecoverable bias, whether compute

contestedc-cd5f762a46
Wortsman et al. [Wortsman2023Instabilities] 把 loss spike 的根因定位到 attention logits 方差与输出范数增长,使 QK-Norm 成为有靶点的干预,而不是“多一个 trick”。Gemma 3 把 QK-Norm 放进可复现配方 [Gemma3Report],至少说明在公开稠密训练里它的工程风险可控。sandwich norm 的证据更弱,但这恰好意味着应该做独立消融,而不是直接归类为可有可无 [Gemm
来源论文· 2[Wortsman2023Instabilities]arXiv 2309.14322arxiv.org
2 观测Transformer Arch Improvements
证据 (2)
  • topic_reporttransformer-arch-improvementsarXiv 2309.14322· report.positions[3].countergpt-5.2

    [counter to Camp D: QK-Norm / sandwich norm are optional details] Wortsman et al. [Wortsman2023Instabilities] localize loss spikes to attention-logit variance and output-norm growth, making QK-Norm a targeted intervention rather than “one m

  • topic_reporttransformer-arch-improvementsarXiv 2309.14322· report.positions[3].countergpt-5.2

    [counter to Camp D: stability is mostly LR/optimizer/data; QK-Norm/sandw] Wortsman et al. [Wortsman2023Instabilities] provide an observable mechanism: attention-logit variance and output-norm growth trigger loss spikes and can be reproduced

consensusc-966d57b217
在组合型长上下文任务上,标称 128K 的有效上下文常坍缩到 ~32K;用 RULER 的 variable tracking / multi-hop / aggregation 能把这种坍缩与 NIAH 的表面检索命中分离出来。[Hsieh2024RULER]
来源论文· 2[Hsieh2024RULER]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.headline_claimsgpt-5.2

    On compositional long-context tasks, nominal-128K models often collapse to ~32K effective context; RULER’s variable tracking / multi-hop / aggregation separates this collapse from NIAH-style surface retrieval hits.[Hsieh2024RULER]

consensusc-6dc2c48f41
仅做 RoPE 外推(PI/YaRN 一类)通常能把模型稳定跑到 32K,但对“会不会用远处 token”缺少直接监督;受控消融显示,若不改长文档比例与分布保持,长任务恢复幅度明显受限。[Chen2023PI][Peng2023YaRN][Fu2024DataEngineering][Xiong2023EffectiveLongCtx]
来源论文· 4[Chen2023PI][Peng2023YaRN][Fu2024DataEngineering][Xiong2023EffectiveLongCtx]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.2

    RoPE extrapolation (PI/YaRN-style) usually makes 32K stable, but provides weak supervision for actually using far tokens; controlled ablations show long-task recovery is limited without changing long-doc ratio and distribution preservation.

consensusc-00d33b1330
“Lost in the middle”的 20+ pp 级别掉点不是单纯 PE 失真:它更像注意力预算在长输入下被噪声稀释的结果,且会被“同任务加无关 token”放大。[Liu2023LostInMiddle][Levy2024SameTaskMoreTokens]
来源论文· 3[Liu2023LostInMiddle][Levy2024SameTaskMoreTokens]arXiv 2307.03172arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2307.03172· report.headline_claimsgpt-5.2

    The 20+ pp “lost in the middle” drop is not just PE distortion: it resembles attention budget dilution under long noisy inputs, and is amplified by “same task, more irrelevant tokens” controls.[Liu2023LostInMiddle][Levy2024SameTaskMoreToken

consensusc-bde6ebaaa0
相关文档 packing 把重复/对齐事件密度推高,使 induction head 类“复制电路”更频繁地被触发;这为 ICL 与跨段引用提供了可检验的机制链条。[Shi2023InContextPretraining][Chan2022DataDist][Olsson2022InductionHeads]
来源论文· 4[Shi2023InContextPretraining][Chan2022DataDist][Olsson2022InductionHeads]arXiv 2309.16039arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.headline_claimsgpt-5.2

    Related-doc packing increases repetition/alignment event density, triggering induction-head-like copying circuits more often; this yields a testable mechanism chain for ICL and cross-span reference.[Shi2023InContextPretraining][Chan2022Data

contestedc-bc0a812df5
修正 c-6a2e99f979 / c-435bd5ac5f:PE 解决“能跑长”,但在组合任务上不等价于“会用长”。RULER 显示很多 nominal-128K 在组合任务上停在 ~32K。[Hsieh2024RULER] Fu et al. 的受控消融显示 PE-only 难以恢复长任务,数据配方才是主变量。[Fu2024DataEngineering] 另外,“同任务加 token”会让推理退化,[Levy2024SameTaskMoreTokens] 说明长度带来的
来源论文· 3[Hsieh2024RULER][Fu2024DataEngineering][Levy2024SameTaskMoreTokens]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: PE/extrapolation is enough; long context is mainly a] Refining c-6a2e99f979 / c-435bd5ac5f: PE makes it runnable, but does not imply compositional usability. RULER shows many nominal-128K models plateau at ~32K on compos

contestedc-cf55898f2a
反驳“数据配方足以解释一切”的强版本:同样的数据池里,序列构造会改变跨段相关性先验;相关文档 packing 能把跨文档引用变成常态输入,[Shi2023InContextPretraining] 的收益与 ICL 的分布/电路解释对齐。[Chan2022DataDist][Olsson2022InductionHeads] 另外,评测上如果只看检索类指标,会高估数据改动的有效上下文外推范围,需要用 RULER/NoCha 校验。[Hsieh2024RULER][Karpin
来源论文· 4[Shi2023InContextPretraining][Chan2022DataDist][Olsson2022InductionHeads][Hsieh2024RULER]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: data recipe is the main variable; long-doc ratio and] Countering the strong version “data explains everything”: even with the same data pool, sequence construction changes the prior of cross-span relatedness; related-doc

contestedc-75a3bb7074
修正 c-2d04dd042e 的“零成本”表述:packing 往往不增加 token 数,但会引入检索/聚类管线、分隔 token 设计、以及更复杂的去重与泄漏风险;同时,长 ICL 仍然在多个基准上失败,[Li2024LongICLStruggle] 表明“更相关的序列”也不是自动等价于“更强的长期学习”。因此 packing 的收益需要用组合任务与噪声对照来证明,而不是只用 NIAH。[Li2024LongICLStruggle][Levy2024SameTaskMo
来源论文· 1[Li2024LongICLStruggle]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].countergpt-5.2

    [counter to Camp C: packing/concatenation is underused; sequence constru] Refining the “zero-cost” framing in c-2d04dd042e: packing may not increase token count, but it adds retrieval/clustering pipelines, separator design, and harder dedup

contestedc-e83e53710d
反驳“换架构就能带来更强有效上下文”的强版本:现有对照更常证明的是效率/最大长度,而不是在 RULER/NoCha 这类利用率基准上全面胜出。[Hsieh2024RULER][Karpinska2024NoCha] Zoology 的统一预训练对照显示 attention-free 模型在 recall 上仍落后注意力,[Arora2023Zoology] 说明“读写机制重做”需要同时解决训练事件分布与电路可实现性,而不是只解决复杂度。
来源论文· 3[Hsieh2024RULER][Karpinska2024NoCha][Arora2023Zoology]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: switch architectures to bypass Transformer long-rang] Countering the strong version “architecture switch yields better effective context”: existing evidence more often proves efficiency/max length, not consistent wins on

consensusc-674cedeb2d
把 vocab 从 32K 扩到 128K 与 0.02–0.04 nats 更低训练 loss 同时出现,并且收益主要来自 non-English 与 code 的 bytes-per-token/fertility 改善;这类序列缩短会线性降低 KV cache 占用并提高吞吐 [Dubey2024Llama3]。
来源论文· 1[Dubey2024Llama3]
3 观测Tokenizer Scaling
证据 (3)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Moving from 32K to 128K vocab is reported alongside 0.02–0.04 nats lower training loss, with gains largely explained by improved bytes-per-token/fertility for non-English and code; such sequence shortening linearly reduces KV-cache footprin

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Public industrial evidence for moving from ~32K to 128K vocab can be stated as 0.02–0.04 nats lower training loss [Dubey2024Llama3]; gains are primarily from improved bytes-per-token/fertility on non-English and code, not from “better Engli

  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    128K vocab gains can be expressed in loss and systems terms: ~0.02–0.04 nats lower training loss [Dubey2024Llama3], largely from better bytes-per-token/fertility for non-English and code, which linearly reduces KV-cache footprint and improv

consensusc-ccd977a5a4
更大词表会放大 tail 债务:多款模型存在 3k–10k+ under-trained tokens,可通过扫描定位并用短 continued pretrain 修复;因此 tokenizer 需要 post-training 治理流程,而不是一次性定稿 [LandBartolo2024Magikarp]。
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Larger vocabularies amplify tail debt: multiple LMs exhibit 3k–10k+ under-trained tokens that can be detected via scanning and repaired with short continued pretraining; tokenizers therefore require post-training governance rather than one-

  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.2

    3k–10k+ under-trained tail tokens are manageable training debt: they can be detected and repaired with short continued pretraining [LandBartolo2024Magikarp], so tokenizers need post-training workflows rather than one-shot finalization.

contestedc-13479d1fcd
现有公开证据更支持“结构先于规模”:数字/日期的局部 merge 会制造 10–20 pp 缺口 [Singh2024TokenizationCounts][Bhatia2025DateFragments],并且更大 vocab 会放大 tail token 的 under-training 债务 [LandBartolo2024Magikarp]。此外,压缩率与表现并非一一对应 [Goldman2024UnpackingTokenization],因此“只追压缩率”会把风险
来源论文· 4[Singh2024TokenizationCounts][Bhatia2025DateFragments][LandBartolo2024Magikarp]arXiv 2402.14903arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Public evidence supports “structure before size”: local merges for digits/dates create 10–20 pp gaps [Singh2024TokenizationCounts][Bhatia2025DateFragments], and larger vo

contestedc-c7d6e8196b
tokenizer-free 的主要代价是序列更长,系统成本更敏感;在长上下文与在线场景,tokenization/runtime 与 KV cache 都是硬约束,需要在同一账本上做 compute-normalized 对比 [Dubey2024Llama3][Kadamba2026GPUTOK]。同时,跨 tokenizer 的迁移与 distillation 已经在降低 lock-in 成本 [Minixhofer2024ZeroShotTokenizerTransf
来源论文· 2[Dubey2024Llama3][Kadamba2026GPUTOK]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The main cost of tokenizer-free is longer sequences and higher systems sensitivity; in long-context and online settings, both tokenization/runtime and KV cache are hard co

contestedc-7c4bef78b5
公开证据仍缺少“剪尾→RL 更稳”的受控量化;当前更可靠的是把 tail 与 non-unique encoding 变成可扫描、可约束的回归项 [LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis],并在系统账本上显式计入序列变长带来的 KV cache 与吞吐成本 [Dubey2024Llama3]。剪尾可以做,但需要把收益落到可测的 RL 稳定性指标与线上事故率,而不是凭直觉。
来源论文· 3[LandBartolo2024Magikarp][LiuEllis2026SayAnythingButThis][Dubey2024Llama3]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Public evidence still lacks controlled quantification of “pruning → more stable RL”. What is more reliable today is to make tail tokens and non-unique encoding scanna

consensusc-1aa93e7a4d
把 TP 放到跨 IB 往往比“多开 PP”更亏:TP 的 per-layer collective 频次与 layer 数线性相关 [Shoeybi2019Megatron],MegaScale 把 TP=8 固定在 NVLink 域并在 >10K GPU 上复现到 dense 175B 55.2% MFU [Jiang2024MegaScale]。
来源论文· 3[Shoeybi2019Megatron][Jiang2024MegaScale]arXiv 1909.08053arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 1909.08053· report.headline_claimsgpt-5.2

    Placing TP across IB is often worse than adding PP: TP’s per-layer collectives scale linearly with depth [Shoeybi2019Megatron], and MegaScale pins TP=8 within NVLink domains while reproducing 55.2% MFU for dense 175B at >10K GPUs [Jiang2024

consensusc-1b3549a2ac
attention 时间占比是 SP/CP 的更稳触发器:SP 把重算开销从约 36% 压到约 2% 且不增加通信量级 [Korthikanti2022SP],因此当 attention>30% 时通常先开 SP;当 attention>50% 或 L≥32K 时,Ring/Ulysses/USP 这类 CP 需要进入 mesh [Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]。
来源论文· 5[Korthikanti2022SP][Liu2023RingAttn][Jacobs2023Ulysses][Fang2024USP]arXiv 2205.05198arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2205.05198· report.headline_claimsgpt-5.2

    Attention time share is a more reliable trigger for SP/CP: SP cuts recompute overhead from ~36% to ~2% without increasing communication order [Korthikanti2022SP], so enable SP when attention>30%; when attention>50% or L≥32K, CP (Ring/Ulysse

consensusc-745821a100
auto-parallel 与 FSDP-only 的主要缺口不是“能不能跑”,而是缺少 100B+、>1K GPU、且 MFU/拓扑可对照的公开 head-to-head:Alpa/GSPMD 给出编译化路径 [Zheng2022Alpa][Xu2021GSPMD],FSDP 系列给出低侵入分片路径 [Zhao2023PyTorchFSDP][Wang2026veScaleFSDP],但都缺少与手工 4D 的 matched-scale 复现基线 [Jiang2024Meg
来源论文· 5[Zheng2022Alpa][Xu2021GSPMD][Zhao2023PyTorchFSDP][Wang2026veScaleFSDP]arXiv 2201.12023arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2201.12023· report.headline_claimsgpt-5.2

    The main gap for auto-parallel and FSDP-only is not feasibility but missing public head-to-head at 100B+, >1K GPUs with comparable MFU/topology: Alpa/GSPMD provide a compiler path [Zheng2022Alpa][Xu2021GSPMD], FSDP provides a low-intrusion

consensusc-07276260bb
2026 年的“健康 MFU 带”可用作排障阈值:dense 40–55%(MegaScale 报告 55.2%)[Jiang2024MegaScale];MoE 25–45%(公开 MoE 系统报告常落在此区间)[DeepSeek2024V3TechReport]。显著低于该带时,优先排查 mesh/拓扑映射与 schedule/内核对齐。
来源论文· 3[Jiang2024MegaScale][DeepSeek2024V3TechReport]arXiv 2402.15627arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2402.15627· report.headline_claimsgpt-5.2

    A 2026 “healthy MFU band” is a practical debugging threshold: dense 40–55% (MegaScale reports 55.2%) [Jiang2024MegaScale]; MoE 25–45% (public MoE system reports often fall here) [DeepSeek2024V3TechReport]. If far below, first inspect mesh/t

contestedc-383e5663af
代价是工程复杂度与人肉 sweep:mesh 形状、拓扑绑定、schedule、kernel 需要显式决策;当模型结构快速变化(异构 MoE、长上下文混合)时,手工计划的维护成本会抬升。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[0].countergpt-5.2

    [counter to Camp A: hand-tuned 4D with topology-aware mapping (Megatron/] The cost is engineering complexity and manual sweeps: mesh shape, topology binding, schedules, and kernels require explicit decisions; with rapidly changing architect

contestedc-ccecefb254
修正 c-b301c4c17a / c-76982c2425 / c-c512acfaf7:公开材料仍缺少 100B+、>1K GPU、同拓扑约束下与手工 4D 的 matched-scale MFU 对照;此外 cost model 对混合精度(FP8/低精度 attention)、异构 MoE 的刻画常不足,导致计划可解释性与可审计性弱于手工拓扑绑定。
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[1].countergpt-5.2

    [counter to Camp B: auto-parallel (Alpa / GSPMD / compiler-search)] Counter to c-b301c4c17a / c-76982c2425 / c-c512acfaf7: public artifacts still lack matched-scale MFU head-to-head against hand-tuned 4D at 100B+ and >1K GPUs under comparab

contestedc-546121ee24
反驳 c-225cae10ca / 修正 c-1f3caad103:当 TP/PP/CP 变成必需维度时,FSDP-only 往往把瓶颈推到跨节点同步与碎片化/调度开销上;在长上下文下,CP 解决的是 KV 分布形态而非参数显存,FSDP 无法替代 [Liu2023RingAttn][Jacobs2023Ulysses]。公开证据也仍缺少 dense 预训练的 MFU 封顶与拐点(何时必须引入 TP/PP/CP)的量化对照 [Zhao2023PyTorchFSDP][Wang
来源论文· 3[Liu2023RingAttn][Jacobs2023Ulysses][Zhao2023PyTorchFSDP]
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatron· report.positions[2].countergpt-5.2

    [counter to Camp C: FSDP-only / ZeRO-only (low-intrusion first)] Counter to c-225cae10ca / refinement of c-1f3caad103: once TP/PP/CP becomes necessary, FSDP-only often pushes bottlenecks into cross-node synchronization and fragmentation/sch

contestedc-2dd2a324f2
反驳 c-ab68e1f89c / c-ac979a61a5 / c-4086500505:当 attention 占比上升到 >30% 时,SP 往往先于 CP 成为低风险收益项 [Korthikanti2022SP];当 attention>50% 或 L≥32K 时,CP 解决的是“KV 必须分布式持有”的形态问题,kernel 只能推迟拐点 [Liu2023RingAttn][Jacobs2023Ulysses]。因此把 SP/CP 当作“可选补丁”会在长上下文 re
来源论文· 4[Korthikanti2022SP][Liu2023RingAttn][Jacobs2023Ulysses]arXiv 2205.05198arxiv.org
1 观测4d Parallelism Megatron
证据 (1)
  • topic_report4d-parallelism-megatronarXiv 2205.05198· report.positions[3].countergpt-5.2

    [counter to Camp D: classic 3D (DP+TP+PP) is enough; SP/CP are optional] Counter to c-ab68e1f89c / c-ac979a61a5 / c-4086500505: when attention share rises above >30%, SP is often a low-risk win even before CP [Korthikanti2022SP]; when atten

consensusc-ecf74e5dfd
fused attention 的数值偏差不是“局部误差”:它可以通过 optimizer 状态与梯度统计跨 step 累积,表现为训练中的 loss spike;因此 kernel numerics 必须进入稳定性回归与诊断指标 [Golden2024FAStability][Dao2022FlashAttention].
来源论文· 2[Golden2024FAStability][Dao2022FlashAttention]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    Numeric deviations in fused attention are not “local errors”: they can accumulate across steps via optimizer state and gradient statistics and surface as loss spikes; kernel numerics must be part of stability regression and diagnostics [Gol

consensusc-67abb74f50
FP8 的稳定性在 trillion-token 训练中会出现“迟到漂移”,仅靠短程 ablation 不能覆盖;把 per-block scaling、FP32 master、accumulation 路径写成可复现 recipe(MXFP8/MXFP4)能把问题收敛到可测试的合约 [Fishman2024FP8Scale][Mishra2025MXFP8Recipes][Tseng2025MXFP4][Rouhani2023Microscaling].
来源论文· 4[Fishman2024FP8Scale][Mishra2025MXFP8Recipes][Tseng2025MXFP4][Rouhani2023Microscaling]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    FP8 stability can show “late drift” at trillion-token scale, which short-run ablations cannot cover; specifying per-block scaling, FP32 masters, and accumulation paths as reproducible recipes (MXFP8/MXFP4) turns the problem into a testable

consensusc-eefda86e84
长上下文能力的提升并不总需要 kernel 重写:位置插值与 RoPE scaling/continual pretraining 能在较少系统改动下把 context 扩到 32K 量级;因此“需要 kernel 共设计”的边界应落在 bytes/token 与 IO 数据流是否成为主瓶颈 [Chen2023PI][Peng2023YaRN][Xiong2023LongContextScaling].
来源论文· 3[Chen2023PI][Peng2023YaRN][Xiong2023LongContextScaling]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    Long-context capability does not always require kernel rewrites: positional interpolation and RoPE scaling/continual pretraining can reach ~32K context with limited systems changes; the boundary for “kernel co-design is needed” should be wh

consensusc-d039d4c38c
把优化顺序倒过来(先 fusion/launch tuning,后 roofline 分类)在 memory-bound kernel 上通常只会得到边际收益;roofline 先行能更快决定该改结构、改数据流还是改数值合约 [Williams2008Roofline][Dao2023FlashAttention2].
来源论文· 2[Williams2008Roofline][Dao2023FlashAttention2]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.headline_claimsgpt-5.2

    Reversing the optimization order (fusion/launch tuning before roofline classification) typically yields only marginal gains on memory-bound kernels; roofline-first more quickly decides whether to change structure, dataflow, or numeric contr

contestedc-ea43d4e803
修正 c-aadcc38d4b / c-d75cab8d5c:高层编译器与生成式 kernel 能覆盖一部分性能增量,但它们并不能自动处理“数值路径改变导致训练轨迹变化”的问题。FlexAttention 的生成空间需要与稳定性回归绑定,否则只是在更高层面复刻同一类风险 [Dong2024FlexAttention][Golden2024FAStability]。
来源论文· 2[Dong2024FlexAttention][Golden2024FAStability]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[0].countergpt-5.2

    [counter to Camp A: algorithms and kernels must be co-designed (bytes/FL] Correction to c-aadcc38d4b / c-d75cab8d5c: high-level compilers and generated kernels can cover part of the performance delta, but they do not automatically handle “n

contestedc-5eff21232c
反驳 c-aadcc38d4b:训练期的关键不是“能生成 kernel”,而是“生成的 kernel 是否保持数值合约”。FlashAttention 的稳定性问题表明,哪怕语义等价,数值路径差异也会外泄到 optimizer 状态 [Golden2024FAStability];低精度的 recipe 进一步要求实现暴露 per-block scaling 与 accumulation 路径 [Mishra2025MXFP8Recipes][Fishman2024FP8Sc
来源论文· 2[Golden2024FAStability][Mishra2025MXFP8Recipes]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[1].countergpt-5.2

    [counter to Camp B: PyTorch/graph-compiler level is sufficient; handwrit] Rebuttal to c-aadcc38d4b: the train-time key is not “can you generate a kernel,” but “does the generated kernel preserve the numeric contract.” FlashAttention stabili

contestedc-5e583c0aff
反驳 c-a986f239ff / c-bf6e936d9d:即使 compute 变便宜,训练预算仍会被重新分配到更多 tokens 与更长上下文,Chinchilla 的 compute-optimal 结论说明“算法分配”不会被硬件自动抹平 [Hoffmann2022Chinchilla]。更关键的是,低精度并非 BF16 的平滑延伸:FP8 的漂移在 trillion-token 才显形,要求把 scale/accumulation 写成合约 [Fishman2024
来源论文· 1[Hoffmann2022Chinchilla]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[2].countergpt-5.2

    [counter to Camp C: hardware will get faster; dense MHA + BF16 need not ] Rebuttal to c-a986f239ff / c-bf6e936d9d: even if compute gets cheaper, budgets get reallocated to more tokens and longer contexts; Chinchilla’s compute-optimal result

contestedc-b4d780aac7
修正 c-c34d4fa100 / c-7d9c971185:可移植性在带宽受限算子上更容易成立,但 attention/GEMM 的极致性能往往依赖硬件特定的 tile、异步拷贝与数据类型路径;这类差异会直接影响 bytes/FLOPs 与数值合约 [Dao2023FlashAttention2][Mishra2025MXFP8Recipes]。因此“追求可移植”不应等价于“忽略 kernel 物理”,而是要把合约写在更高层:明确 IO 账与数值路径,允许不同后端各自实现。
来源论文· 2[Dao2023FlashAttention2][Mishra2025MXFP8Recipes]
1 观测Cuda Kernel Pretrain
证据 (1)
  • topic_reportcuda-kernel-pretrain· report.positions[3].countergpt-5.2

    [counter to Camp D: non-NVIDIA hardware will catch up; CUDA will stop be] Correction to c-c34d4fa100 / c-7d9c971185: portability is easier for bandwidth-limited kernels, but best-in-class attention/GEMM performance often depends on hardware

consensusc-62f91000f9
bulk filtering 的主要杠杆在“选哪类 proxy”,而不是“同一 proxy 调阈值”:DCLM 的受控 ablation 显示不同过滤器家族之间可差 4–6 pp(以 MMLU 计),而阈值扫描通常 <1 pp。[DCLM2024]
来源论文· 2[DCLM2024]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.headline_claimsgpt-5.2

    The main lever in bulk filtering is proxy family choice, not threshold tuning: DCLM controlled ablations show 4–6 pp gaps (e.g., on MMLU) across filter families, while threshold sweeps are typically <1 pp.[DCLM2024]

contestedc-f298bcf4e9
修正 c-2a5bd5b489:bulk+ladder 覆盖面很大,但“95% decisions”在 code/math/reasoning 上不成立。proxy-to-target 反转在 HumanEval 线上更常见,[DCLM2024] 的 Spearman≈0.41 已经提示需要更高规模或更专门的能力线验收。
来源论文· 1[DCLM2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[0].countergpt-5.2

    [counter to Camp A: classifiers + ablation ladders are sufficient (contr] Refines c-2a5bd5b489: bulk+ladder covers a large surface area, but “95% of decisions” does not hold for code/math/reasoning. Proxy-to-target flips are more common on

contestedc-9405514750
反驳 c-230595d964 / c-bb37143a19:归因能回答“与某个行为相关的训练片段”,但不等价于“删改后的净效应”。[AnthropicInfluence2023] 的跨规模 top-overlap <10% 说明归因排序不可跨代复用;[Simfluence2023] 的非加性说明“删 top-k”会破坏交互项。归因更适合做候选生成与 debug,而不是替代 ladder 的合并门禁。
来源论文· 2[AnthropicInfluence2023][Simfluence2023]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[1].countergpt-5.2

    [counter to Camp B: example-level influence/attribution is the main path] Counters c-230595d964 / c-bb37143a19: attribution can answer “which training snippets are associated with a behavior,” but not the net effect after deletion/weighting

contestedc-97e0ca5b63
修正 c-cdd0499162 / c-e385bd2b45:混杂确实存在,但“用 IV 直接给出门禁”风险更高。IV 的 instrument 假设在 web 数据里难审计,[CausalLL2024] 一旦识别失败会方向性错误;相对更稳的是把因果方法用于评估去偏与分层报告,[LengthControlledAlpacaEval2024] 这类控制变量能直接减少选择偏差。
来源论文· 2[CausalLL2024][LengthControlledAlpacaEval2024]
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causality· report.positions[2].countergpt-5.2

    [counter to Camp C: full causal identification is the future (IV/mediato] Refines c-cdd0499162 / c-e385bd2b45: confounding is real, but “use IV as the gate” is higher risk. Instrument assumptions are hard to audit on web data; [CausalLL2024

contestedc-d6b484f5b1
反驳 c-0cc7a00d89:scale 规划能给出下限,但“靠直觉不建工具”会把数据偏置变成不可见风险。[DCLM2024] 的 thousands-of-ablations 表明经验团队在 mixture/过滤决策上也会出现多个 pp 的误差;[FineWeb2024] 的公开 recipe 显示没有 classifier 的 baseline 会落后数个 pp(以 MMLU 计)。
来源论文· 3[DCLM2024][FineWeb2024]arXiv 2406.11794arxiv.org
1 观测Data Value Causality
证据 (1)
  • topic_reportdata-value-causalityarXiv 2406.11794· report.positions[3].countergpt-5.2

    [counter to Camp D: skip measurement, rely on intuition and scale (cover] Counters c-0cc7a00d89: scaling plans provide floors, but “no tooling, rely on intuition” turns data bias into invisible risk. [DCLM2024]’s thousands of ablations show

consensusc-d17e78c9b0
在保持 exact softmax attention 语义不变时,吞吐差异的一级来源先后从“HBM I/O”转为“并行调度”:FA1 [Dao2022FA1] 通过不写出 L×L score 降 I/O;FA2 [Dao2023FA2] 仅改 work partitioning 就在 A100 上给出约 2× 吞吐差异。
来源论文· 3[Dao2022FA1][Dao2023FA2]arXiv 2205.14135arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2205.14135· report.headline_claimsgpt-5.2

    Under exact softmax attention semantics, the first-order source of throughput differences shifted from “HBM I/O” to “parallel scheduling”: FA1 [Dao2022FA1] reduces I/O by never writing the L×L score matrix; FA2 [Dao2023FA2] changes work par

consensusc-3a00fd1df0
decode 场景的瓶颈不是训练态 attention kernel:Flash-Decoding [Dao2023FlashDecoding] 通过沿 KV 切 chunk 恢复并行度;FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] 把这种策略产品化成可配置引擎,工程上往往比自写 kernel 更划算。
来源论文· 3[Dao2023FlashDecoding][Ye2024FlashInfer][Ye2025FlashInferEngine]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.headline_claimsgpt-5.2

    Decode bottlenecks are not solved by training-shaped attention kernels: Flash-Decoding [Dao2023FlashDecoding] restores parallelism by chunking KV; FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] productizes these ideas into a configur

consensusc-aa343b488c
变体维护成本是长期主因:FlexAttention [Dong2024Flex] 用 score_mod/mask_mod 把变体表达上移,并在多种 mask 变体上达到接近手写 fused kernel 的吞吐;但当 mask 语义改变 tile 可复用性或访存模式时,专用实现仍可能更快 [Wang2024FlashMask]。
来源论文· 3[Dong2024Flex][Wang2024FlashMask]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.headline_claimsgpt-5.2

    Variant maintenance cost is the long-run driver: FlexAttention [Dong2024Flex] lifts variants into score_mod/mask_mod semantics and reaches throughput close to hand-written fused kernels on many masks; but when mask semantics change tile reu

consensusc-ccff547516
FP8 的风险主要来自误差路径而非格式本身:FP8 格式与 scaling 假设由 Micikevicius et al. [Micikevicius2022FP8] 固化,但 Fujii et al. [Fujii2024FP8vsBF16] 显示稳定性与 rescale/累加精度/长序列误差累积强相关;上线前必须有 BF16 对照与 outlier token 检查。
来源论文· 3[Micikevicius2022FP8][Fujii2024FP8vsBF16]arXiv 2209.05433arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2209.05433· report.headline_claimsgpt-5.2

    FP8 risk is dominated by error paths rather than the format itself: FP8 formats and scaling assumptions are set by Micikevicius et al. [Micikevicius2022FP8], while Fujii et al. [Fujii2024FP8vsBF16] shows stability strongly depends on rescal

contestedc-b47139f6c3
反驳 c-c3d509c165:变体与 serving 不是“衍生问题”。FlexAttention [Dong2024Flex] 与 FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] 把生态重心从“写 kernel”转到“写语义/写后端策略”,这会改变团队分工与长期成本;同时,长上下文的分布式 blockwise 执行(RingAttention [Liu2023RingAttention])也不是简单移植。
来源论文· 5[Dong2024Flex][Ye2024FlashInfer][Ye2025FlashInferEngine][Liu2023RingAttention]arXiv 2412.05496arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2412.05496· report.positions[0].countergpt-5.2

    [counter to Camp A: FA1/FA2/FA3 already solved the core; what remains is] Counter to c-c3d509c165: variants and serving are not “derivatives”. FlexAttention [Dong2024Flex] and FlashInfer [Ye2024FlashInfer][Ye2025FlashInferEngine] shift the

contestedc-23e4f6593f
修正 c-fdcdf25f0c / c-f00a0f8184:门槛下降不等于峰值被取代。FA3 [Shah2024FA3] 把 Hopper 的 TMA/warp specialization/FP8 深度揉进 kernel,FlexAttention 在这些代际特化上可能仍落后 5–15%(尤其在 FP8 路径)。当训练处在临界吞吐,编译路径需要明确的 head-to-head 证据才能替代手写 CUDA。
来源论文· 2[Shah2024FA3]arXiv 2407.08608arxiv.org
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernelsarXiv 2407.08608· report.positions[1].countergpt-5.2

    [counter to Camp B: the mainline should move from hand-written CUDA to T] Refinement of c-fdcdf25f0c / c-f00a0f8184: lower authoring barrier does not imply peak is replaced. FA3 [Shah2024FA3] deeply integrates Hopper TMA/warp specialization

contestedc-1f24bec62c
反驳 c-f4696a7b06 / c-ff80f3a405 / c-bf5e97f8c9:复杂度优势不自动转化为通用默认。softmax attention 在表达力上有明确优势 [Deng2023SoftmaxBeatsLinear],而结构改动在不同实现与任务间迁移不稳 [Narang2021Transfer]。同时,长上下文也可通过分布式 blockwise attention 扩展 [Liu2023RingAttention],并不必然要求抛弃 attention。
来源论文· 3[Deng2023SoftmaxBeatsLinear][Narang2021Transfer][Liu2023RingAttention]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[2].countergpt-5.2

    [counter to Camp C: attention should be replaced by SSM/linear/sparse st] Counter to c-f4696a7b06 / c-ff80f3a405 / c-bf5e97f8c9: complexity advantages do not automatically translate into a universal default. Softmax attention has a clear ex

contestedc-ec767ef289
修正 c-574ffb3b22 / c-0d086b357f:锁定主要集中在“FP8 + Hopper 异步搬运”的峰值路径,而不是整个 attention 生态。FA2 [Dao2023FA2] 与 FlexAttention [Dong2024Flex] 更容易作为跨架构默认;同时,在硬件固定且训练成本主导的场景,FA3 的 5–15% 吞吐差距可能足以覆盖迁移风险。
来源论文· 2[Dao2023FA2][Dong2024Flex]
1 观测Flashattention Kernels
证据 (1)
  • topic_reportflashattention-kernels· report.positions[3].countergpt-5.2

    [counter to Camp D: FA3 implies generation lock-in; critical paths shoul] Refinement of c-574ffb3b22 / c-0d086b357f: lock-in concentrates on the peak path of “FP8 + Hopper async movement”, not the entire attention ecosystem. FA2 [Dao2023FA2

consensusc-4e05fc7838
离线搜索 w* 的工程可行区间通常是“proxy 模型群 + 回归/规律外推”,而不是复杂 utility 估计器:在若干设置下,LLM-utility mixing 的稳定性不如 token-count/启发式 warm start[Held2025UtilityMix][RegMix2024]。
来源论文· 2[Held2025UtilityMix][RegMix2024]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.headline_claimsgpt-5.2

    The practical engineering regime for offline w* search is often “proxy model ladders + regression/law extrapolation”, not complex utility estimators: in several settings, LLM-utility mixing is less stable than token-count / heuristic warm s

contestedc-0dc93859ca
启发式容易把“当前评测集”过拟合成配方;当桶定义变化(例如新增多语/长尾主题)或目标从平均性能转向最差桶时,缺少形式化目标会让 trade-off 难以解释[Paloma2023][Michel2021RobustMTL]。
来源论文· 2[Paloma2023][Michel2021RobustMTL]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[1].countergpt-5.2

    [counter to Camp B: Heuristics + curricula are more robust; 2–3 ablation] Heuristics can overfit to the current evaluation suite; when bucket definitions change (e.g., adding multilingual/long-tail topics) or objectives shift from average t

contestedc-42c7f319c2
在线方法的额外通道(per-domain loss、采样器状态、延迟统计)会放大记账不稳的风险;当桶定义、去重阈值或过滤器版本频繁迭代时,权重振荡与不可复现更常见,且回滚成本更高[Held2025UtilityMix][OrganizeWeb2025]。
来源论文· 2[Held2025UtilityMix][OrganizeWeb2025]
1 观测Data Mixture
证据 (1)
  • topic_reportdata-mixture· report.positions[2].countergpt-5.2

    [counter to Camp C: Online/adaptive mixing beats one-shot offline ratios] Online methods add extra channels (per-domain loss, sampler state, delayed stats) that amplify risks under unstable accounting; when bucket definitions, dedup thresho

consensusc-190bd266b9
fine-grained(≥64)+ 1 shared expert 在同 active 参数预算下更容易得到稳定的 zero-shot 增益(报告区间 1.8–3.4 pp),机制上是把“公共成分”从 routed experts 中显式分离,降低 identifiability 冲突与专家相似化风险。[Dai2024DeepSeekMoE][DeepSeekAI2024V2][Nguyen2025SharedExperts]
来源论文· 3[Dai2024DeepSeekMoE][DeepSeekAI2024V2][Nguyen2025SharedExperts]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Fine-grained (≥64) + one shared expert more reliably yields zero-shot gains at matched active-parameter budgets (reported 1.8–3.4 pp), because it explicitly separates common components from routed experts, reducing identifiability conflicts

consensusc-15b909240f
dense→MoE upcycling 的收益存在可量化上限与饱和区间:在 token-rich checkpoint 上更可能划算,token 等效系数约 0.4–0.6×;把这一约束忽略掉会导致“加专家不加质”的 ROI 误判。[Liew2025Upcycling][Komatsuzaki2022SparseUpcycling][He2024UpcyclingLLMtoMoE]
来源论文· 3[Liew2025Upcycling][Komatsuzaki2022SparseUpcycling][He2024UpcyclingLLMtoMoE]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Dense→MoE upcycling has quantifiable ceilings and saturation regimes: it is more likely to pay off on token-rich checkpoints, with token-equivalent factors around 0.4–0.6×; ignoring this constraint leads to ROI misreads where adding experts

consensusc-0238eb850a
系统实现决定稀疏是否兑现:dispatch/all-to-all、token packing 与 kernel 调度能把理论稀疏收益吃掉到接近 0;因此“MoE 是否更省”必须在同一套 serving/training 系统上对照,而不是只看训练 FLOPs。[Tan2024ScatterMoE][Kim2022ElephantsProduction]
来源论文· 2[Tan2024ScatterMoE][Kim2022ElephantsProduction]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.headline_claimsgpt-5.2

    Systems implementation determines whether sparsity pays off: dispatch/all-to-all, token packing, and kernel scheduling can erase theoretical gains to near zero; therefore “is MoE cheaper” must be evaluated on the same training/serving stack

contestedc-79200485ae
需要把系统兑现与全生命周期成本算进去:dispatch/all-to-all 可能吞掉稀疏收益;并且 upcycling 路径在真实组织里更常见,收益存在饱和区间(修正 c-a25fb78820、c-364cf0aacb)。[Tan2024ScatterMoE][Kim2022ElephantsProduction][Liew2025Upcycling]
来源论文· 3[Tan2024ScatterMoE][Kim2022ElephantsProduction][Liew2025Upcycling]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: MoE becomes the default backbone for frontier pretra] Systems realization and full-lifecycle costs must be counted: dispatch/all-to-all can erase sparsity gains; and upcycling is more common in real orgs, with saturation

contestedc-e190a19dae
这条论证多发生在 upcycling 路径;当目标是 frontier 级从零预训练且能复刻 DeepSeek 的监控与均衡模板时,MoE 的 active/total 结构仍能把容量拉开(保留 c-fcaf30ab3)。[DeepSeekAI2024V2][DeepSeekAI2024V3][Wang2024AuxFree]
来源论文· 3[DeepSeekAI2024V2][DeepSeekAI2024V3][Wang2024AuxFree]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: dense wins on full-lifecycle ROI; upcycling makes Mo] This argument is mostly about the upcycling path; when the goal is frontier from-scratch pretraining and the DeepSeek monitoring/balancing template is reproducible, M

contestedc-6f5a3ed10d
把“路由是否聪明”与“拥塞是否可控”混在一起会误导工程决策:DeepSeek 的关键不是更复杂路由,而是用 bias EMA + 早期硬门槛把 token drop 与 dead experts 压到可控区间;这在吞吐与稳定性上直接影响训练是否能跑完。[DeepSeekAI2024V3][Wang2024AuxFree][Qiu2025DemonsLBL]
来源论文· 3[DeepSeekAI2024V3][Wang2024AuxFree][Qiu2025DemonsLBL]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: learned routing/balancing is overrated; random/froze] Conflating “router intelligence” with “congestion controllability” misleads engineering decisions: DeepSeek’s key is not a more complex router but bias EMA plus early

contestedc-85d837dd94
DeepSeek-V3 报告在保持 256 experts 结构下完成对齐阶段并取得一线效果,说明“MoE 后训练不可行”更像是工程门槛而非结构性不可能(修正 c-d7ca5574a2)。[DeepSeekAI2024V3]
来源论文· 1[DeepSeekAI2024V3]
1 观测MOE Landscape
证据 (1)
  • topic_reportmoe-landscape· report.positions[3].countergpt-5.2

    [counter to Camp D: MoE is mainly for pretraining; post-training should ] DeepSeek-V3 reports completing alignment while keeping the 256-expert structure with strong results, suggesting “MoE post-training is impossible” is more an engineeri

consensusc-e91aa6590a
把 RoPE base 从 10k 直接对齐到目标窗口量级(128K 常见 ~500k)能把“可学有效上下文长度上界”前移;否则低频维度在目标长度相位覆盖不足,远距离位置趋于不可分,后续再 retrofit 只能在上界内打转。[Xu2024RoPEBaseBounds][Dubey2024Llama3]
来源论文· 2[Xu2024RoPEBaseBounds][Dubey2024Llama3]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    Setting RoPE base from 10k directly to the target-window scale (often ~500k for 128K) moves the learnable effective-context ceiling upward; otherwise low-frequency phase coverage is insufficient at target lengths, far positions become less

consensusc-afb92e0cb0
“宣称 128K”与“有效 ~32K”之间的落差在 RULER 这类 recall/聚合/多跳套件上是常态;只报 PPL 或 needle 单点会系统性高估可用上下文。[Hsieh2024RULER][Yuan2024LVEval][Zhang2024InfinityBench]
来源论文· 4[Hsieh2024RULER][Yuan2024LVEval][Zhang2024InfinityBench]arXiv 2404.06654arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2404.06654· report.headline_claimsgpt-5.2

    The gap between “advertised 128K” and “usable ~32K” is common under recall/aggregation/multi-hop suites like RULER; reporting only PPL or a single needle systematically overestimates usable context.[Hsieh2024RULER][Yuan2024LVEval][Zhang2024

consensusc-92d116a1c1
当目标进入 512K–2M,误差主因从“相位越界”转向 per-dim mismatch:不同频段对外推的需求不同,统一全局缩放会同时出现低频仍不足与高频过度压缩;LongRoPE 通过搜索学非均匀 per-dim scale + 更长微调预算来按维度拟合误差。[Ding2024LongRoPE]
来源论文· 1[Ding2024LongRoPE]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    At 512K–2M targets, the dominant error shifts from “phase overflow” to per-dim mismatch: different frequency bands need different extrapolation, and a single global scaling causes both under-covered low frequencies and over-compressed high

consensusc-b1332ba8bc
旁路路线(检索/记忆/SSM/压缩)能把复杂度从 O(n^2) 拉向线性或次二次,但在 recall-heavy 任务上经常落后于 attention 基线;若不做与原生长文的同基线对照,很容易把“系统吞吐提升”误当成“有效上下文提升”。[Arora2023Zoology][Goldman2024IsItRetrieval][Dao2023FlashAttention2]
来源论文· 3[Arora2023Zoology][Goldman2024IsItRetrieval][Dao2023FlashAttention2]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.headline_claimsgpt-5.2

    Bypass routes (retrieval/memory/SSM/compression) can reduce complexity from O(n^2) toward linear or subquadratic, but often lag attention baselines on recall-heavy tasks; without head-to-head native-long-context controls, teams can mistake

contestedc-93adca56a8
反驳 c-7094e4510a / c-8b90a3a5a5:对多数团队,持续预训 token 预算与数据工程门槛确实更高;在 ≤128K,retrofit 可能以更低成本达到接近的有效上下文曲线。[Peng2023YaRN][Xiong2023LongLlama][Young2024Yi] 更稳的做法是把原生长文定位为“新模型或大版本迭代的默认”,而不是给所有存量模型强推重训。
来源论文· 3[Peng2023YaRN][Xiong2023LongLlama][Young2024Yi]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[0].countergpt-5.2

    [counter to Camp A: native long context (base-aligned + curriculum/ABF) ] Counter to c-7094e4510a / c-8b90a3a5a5: token budget and data-engineering requirements are real; for ≤128K, retrofits can reach similar effective-context curves at lo

contestedc-d4b8d9c05e
修正 c-457e01b168:即使 YaRN 更稳,也不等于“只改 RoPE 就够”。没有长序列微调与分布对齐,RULER 仍可能出现宣称窗口与有效长度的断层。[Hsieh2024RULER][Fu2024DataEngineering128K] 因此 YaRN 的默认姿势应当绑定“长序列微调预算 + 任务曲线验收”。
来源论文· 2[Hsieh2024RULER][Fu2024DataEngineering128K]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[1].countergpt-5.2

    [counter to Camp B: for 32K–128K retrofits, default to YaRN, not PI] Refinement to c-457e01b168: even if YaRN is more stable, it does not mean “RoPE-only is enough.” Without long-sequence finetuning and distribution alignment, RULER can sti

contestedc-ced5422927
反驳 c-802b6e8f2c:search 成本与工程复杂度确实更高,且在 512K 以下优势可能不稳定;因此 per-dim 搜索更适合作为“512K+ 的门槛动作”,而不是把所有模型都拉进搜索循环。[Ding2024LongRoPE][Yuan2024LVEval]
来源论文· 2[Ding2024LongRoPE][Yuan2024LVEval]
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntk· report.positions[2].countergpt-5.2

    [counter to Camp C: ≥512K needs LongRoPE-style per-dim search/learning; ] Counter to c-802b6e8f2c: search cost and engineering complexity are real, and below 512K the advantage may be unstable; per-dim search is better treated as a 512K+ th

contestedc-3800d740d0
反驳 c-9e220ff288:复杂度优势不等于 recall-heavy 可靠性。Zoology [Arora2023Zoology] 显示不少高效模型在 recall-sensitive 任务上仍落后 attention;同时 Goldman et al. [Goldman2024IsItRetrieval] 指出很多“长文任务”可被检索捷径解决,容易夸大旁路方案的泛化能力。若目标任务需要跨段聚合与多跳追踪,旁路路线必须给出与原生长文同基线的对照数字。[Hsieh2024
来源论文· 3[Arora2023Zoology][Goldman2024IsItRetrieval]arXiv 2312.04927arxiv.org
1 观测Long Context Rope Ntk
证据 (1)
  • topic_reportlong-context-rope-ntkarXiv 2312.04927· report.positions[3].countergpt-5.2

    [counter to Camp D: bypass RoPE/attention (SSM/external memory/retrieval] Counter to c-9e220ff288: complexity wins are not recall-heavy reliability. Zoology [Arora2023Zoology] shows many efficient models still lag attention on recall-sensit

consensusc-f4a8f87736
在 64K–256K 区间,perplexity/NIAH 与任务型长上下文评估经常不同步:模型可以在 NIAH“找针”接近满分,但在 RULER 的多压力项与 LongBench 的跨段任务上仍出现中段证据利用不足与非单调退化 [RULER2024][LostInTheMiddle2023][LongBench2023][Gao2024EffectiveLongCtx]。
来源论文· 4[RULER2024][LostInTheMiddle2023][LongBench2023][Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    In the 64K–256K regime, perplexity/NIAH often fail to track task-based long-context evaluations: a model can be near-saturated on NIAH “needle” retrieval yet still fail RULER stressors and LongBench cross-span tasks with mid-context underus

consensusc-2eb0105e50
把窗口从 32K 扩到 128K 的主要不确定性通常不是 RoPE 外推公式,而是训练分布里“长文档是否真实出现”:长文档比例偏低时,继续堆长序列 token 更容易只提升检索型 proxy,而不提升跨段任务 [Fu2024DE128K][Xiong2023EffectiveLongContext][Gao2024EffectiveLongCtx]。
来源论文· 3[Fu2024DE128K][Xiong2023EffectiveLongContext][Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    When extending 32K→128K, the dominant uncertainty is often not the RoPE extrapolation formula but whether long documents actually appear in the training distribution: with low long-doc ratio, adding more long-sequence tokens tends to boost

consensusc-68e7d05ec1
极长扩窗(≥1M)更像“分阶段上线”而不是一次性跳跃:Xu et al. 的 128K→4M 流程把 continual pretrain 与长依赖 SFT 分段绑定,并在每段回到任务评估闭环以避免短窗能力与指令跟随退化 [Xu2025UltraLong][Gao2024EffectiveLongCtx]。
来源论文· 2[Xu2025UltraLong][Gao2024EffectiveLongCtx]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    Ultra-long scaling (≥1M) behaves like staged rollout rather than a one-shot jump: Xu et al.’s 128K→4M recipe binds continual pretraining and long-dependency SFT per stage and returns to task evaluation loops each time to avoid short-context

consensusc-dc748f7513
“全注意力 O(L^2) 不可持续、必须换稀疏/记忆/线性架构”在任务层面还缺少统一胜出证据:Zoology 显示高效架构的 recall 仍是硬约束;更现实的分工是 dense 长窗负责跨段整合,retrieval/memory 负责稀疏证据与知识更新 [Zoology2023][RetrievalMeetsLongContext2023][REALM2020][Goldman2024GenuinelyDifficult]。
来源论文· 4[Zoology2023][RetrievalMeetsLongContext2023][REALM2020][Goldman2024GenuinelyDifficult]
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretraining· report.headline_claimsgpt-5.2

    The claim “dense O(L^2) attention is unsustainable, so sparse/memory/linear architectures must replace it” still lacks consistent task-level dominance: Zoology shows recall remains a hard constraint for efficient models; a more realistic di

contestedc-6293b98bc0
RULER [RULER2024] 与 Lost in the Middle [LostInTheMiddle2023] 显示:即使“能放下”,中段证据仍可能系统性用不上;Gao et al. [Gao2024EffectiveLongCtx] 进一步表明 proxy 与任务收益不等价。Fu et al. [Fu2024DE128K] 与 Xiong et al. [Xiong2023EffectiveLongContext] 把关键变量指向训练分布:长文档比例不足时,位置外
来源论文· 6[RULER2024][LostInTheMiddle2023][Gao2024EffectiveLongCtx][Fu2024DE128K][Xiong2023EffectiveLongContext]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.positions[0].countergpt-5.2

    [counter to Camp A: positional extrapolation is the main line; evaluatio] RULER [RULER2024] and Lost in the Middle [LostInTheMiddle2023] show that even if the model “fits,” it can systematically underuse mid-context evidence; Gao et al. [Ga

contestedc-7a361a4ddd
系统能解决“吞吐与显存”,但不自动解决“任务利用率”。Gao et al. [Gao2024EffectiveLongCtx] 显示:即使继续训练让 ppl 下降,长任务收益也可能不升反降;RULER [RULER2024] 与 LV-Eval [LVEval2024] 也能在系统可跑的设置下测出中段证据利用不足。Xu et al. [Xu2025UltraLong] 的流程把“分阶段评估门禁 + 长依赖 SFT”写进 recipe,等价于承认系统到位仍需要训练信号与评估闭环
来源论文· 5[Gao2024EffectiveLongCtx][RULER2024][LVEval2024][Xu2025UltraLong]arXiv 2410.02660arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2410.02660· report.positions[1].countergpt-5.2

    [counter to Camp B: long-sequence training is mainly a systems-paralleli] Systems solve throughput and memory, but not task utility. Gao et al. [Gao2024EffectiveLongCtx] shows that even when continued training reduces ppl, long-task gains c

contestedc-9c569882a8
RULER [RULER2024] 的核心论点是:单一 NIAH 会系统性高估真实可用上下文;Lost in the Middle [LostInTheMiddle2023] 给出中段证据利用不足的可复现现象,说明“检索到”与“用来推理”是两件事。Gao et al. [Gao2024EffectiveLongCtx] 直接比较 proxy 与任务,给出弱相关结论,并指出长依赖 SFT 比继续堆长预训练 token 更快带来任务收益,这与“ppl 下降自然带来长能力”的假设冲
来源论文· 4[RULER2024][LostInTheMiddle2023][Gao2024EffectiveLongCtx]arXiv 2404.06654arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2404.06654· report.positions[2].countergpt-5.2

    [counter to Camp C: perplexity/NIAH are sufficient; task benchmarks are ] RULER [RULER2024] argues a single NIAH probe systematically overestimates usable context; Lost in the Middle [LostInTheMiddle2023] shows reproducible mid-context unde

contestedc-55f5f49fee
替代路线提供了成本优势,但任务层面的代价往往体现在 recall 与信息保真。Zoology [Zoology2023] 通过系统测量指出高效模型在 recall 上存在硬约束;而 Goldman et al. [Goldman2024GenuinelyDifficult] 也提醒:很多“长上下文任务”其实是检索问题,替代架构在这类任务上看起来赢并不等价于在跨段组合任务上也赢。更稳的工程分工是 dense 长窗 + retrieval/memory 的混合:dense 负责跨
来源论文· 3[Zoology2023][Goldman2024GenuinelyDifficult]arXiv 2312.04927arxiv.org
1 观测Length Scaling Pretraining
证据 (1)
  • topic_reportlength-scaling-pretrainingarXiv 2312.04927· report.positions[3].countergpt-5.2

    [counter to Camp D: dense O(L^2) attention is unsustainable; alternative] Alternatives offer cost advantages, but task-level costs often show up as recall and fidelity. Zoology [Zoology2023] measures hard recall constraints in efficient mod

consensusc-2364a1b974
把 coord check / pre-activation RMS 叠合设为合并门槛,能把“LR 迁移失败”的排障从大规模训练后移到小规模诊断前移;在现代 Transformer 组件存在时,不做该诊断,µP 的 width 迁移结论经常被模块尺度破坏 [Yang2022muP][Cerebras2024CompleteP][Lingle2024EmpiricalMuP].
来源论文· 3[Yang2022muP][Cerebras2024CompleteP][Lingle2024EmpiricalMuP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    Making coord check / pre-activation RMS overlap a merge gate shifts LR-transfer debugging from post-hoc large runs to pre-hoc small diagnostics; with modern Transformer components, skipping this check often lets module-scale mismatches brea

consensusc-b19c92246c
“width 上 LR 可迁移”不能直接外推到 depth:在 residual 动力学下,最优 LR 仍可随 depth 漂移,且宽度/深度极限是否可交换取决于 residual scaling;因此 depth 变化至少需要单独的验证轴 [Bordelon2023DepthwiseTransfer][HayouYang2023WidthDepthCommute][Jelassi2023DepthDependenceMuP].
来源论文· 3[Bordelon2023DepthwiseTransfer][HayouYang2023WidthDepthCommute][Jelassi2023DepthDependenceMuP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    “LR transfers across width” does not extrapolate to depth: under residual dynamics, the optimal LR can drift with depth, and whether width/depth limits commute depends on residual scaling; depth changes therefore require an explicit validat

consensusc-8e793c95cb
在固定 SP recipe 下,经验公式/联合 scaling law 往往能把目标规模 LR、batch 的初值误差压到约 10% 量级;但一旦改动 aspect ratio、precision 或 schedule,最优点可能出现倍数级漂移,需要重拟合而不是沿用常数 [Bi2024DeepSeekLLM][Dey2023CerebrasGPT][McLeish2025Gemstones].
来源论文· 3[Bi2024DeepSeekLLM][Dey2023CerebrasGPT][McLeish2025Gemstones]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    Under a fixed SP recipe, empirical formulas/joint scaling laws often keep target-scale LR and batch starting-point error around ~10%; but once aspect ratio, precision, or schedule changes, the optimum can drift multiplicatively, requiring r

consensusc-083531743a
AdamW 下 wd 是独立轴:wd 改动会移动稳定边界与最优 LR,使“只迁移 LR”在归因上经常出错;更稳的做法是把 (LR, wd, β₂) 当作耦合面,用 10–20 次 cost-aware 局部搜索补齐 [Loshchilov2017AdamW][Kosson2025WDMoreThanMuP][Wang2024AdamWWeightDecayScaling][CARBS2024].
来源论文· 3[Loshchilov2017AdamW][Kosson2025WDMoreThanMuP][CARBS2024]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.headline_claimsgpt-5.2

    Under AdamW, wd is an independent axis: changing wd shifts stability boundaries and the optimal LR, so “transfer LR only” often misattributes errors; a more reliable approach treats (LR, wd, β₂) as a coupled surface and patches it with ~10–

contestedc-ccedd29e9b
修正 c-7d1c22d4b6:即便 Complete-P/u-µP 补齐模块与低精度,仍不能推出“所有预训练都应切换至 µP”。Lingle [Lingle2024EmpiricalMuP] 显示实际栈里仍有架构/实现依赖;此外 wd/β₂ 等变量不在 parameterization 的闭包内,会继续主导迁移误差 [Kosson2025WDMoreThanMuP][Loshchilov2017AdamW]。
来源论文· 3[Lingle2024EmpiricalMuP][Kosson2025WDMoreThanMuP][Loshchilov2017AdamW]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[0].countergpt-5.2

    [counter to Camp A: Complete-P as the default; formulas are a stopgap] Correcting c-7d1c22d4b6: even with Complete-P/u-µP patching modules and low precision, it does not follow that “all pretraining should switch to µP”. Lingle [Lingle2024E

contestedc-718637c9e5
反驳 c-b807a6f58d:公式并非“无需改底层就能准确预测”的通用解。Gemstones 指出对 aspect ratio、schedule 与 HP 组合敏感,最优点可出现倍数级漂移 [McLeish2025Gemstones];而当现代模块或 precision 变化引入尺度失配时,公式无法提供失败归因,仍需要 coord check 类诊断 [Cerebras2024CompleteP][Blake2024UMUP]。
来源论文· 3[McLeish2025Gemstones][Cerebras2024CompleteP][Blake2024UMUP]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[1].countergpt-5.2

    [counter to Camp B: empirical formulas + small sweeps are enough; µP is ] Countering c-b807a6f58d: formulas are not a universal “accurate prediction without touching the stack”. Gemstones shows sensitivity to aspect ratio, schedules, and HP

contestedc-01ba78b4ab
反驳 c-bec6705c6f:端到端搜索很难绕开 proxy fidelity 与失效归因问题。小 proxy 可能漏掉大规模不稳定性 [Wortsman2023ProxiesInstabilities];而没有 parameterization/诊断,搜索失败时难以判断是模型族定义错、模块尺度错,还是超参局部最优。更像可落地的分工是:用 parameterization/公式把可迁移部分固定住,再让搜索补齐 wd/β₂ 等盲区 [CARBS2024][Kosson2025
来源论文· 2[Wortsman2023ProxiesInstabilities][CARBS2024]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[2].countergpt-5.2

    [counter to Camp C: end-to-end automation (BO/auto-optimizers) will repl] Countering c-bec6705c6f: end-to-end search cannot easily bypass proxy fidelity and failure attribution. Small proxies can miss large-scale instabilities [Wortsman2023

contestedc-58c8221ef0
修正 c-05156d3ad8 / c-bdb7a65e9e:wd/β₂ 很关键,但不意味着 parameterization 没价值。Complete-P 的模块级修补与 coord check 能减少“结构性漂移”,把搜索空间从全局缩到少数耦合轴 [Cerebras2024CompleteP][Yang2022muP];否则 wd/β₂ 的搜索会被尺度失配噪声污染,难以稳定收敛 [Wortsman2023ProxiesInstabilities]。
来源论文· 3[Cerebras2024CompleteP][Yang2022muP][Wortsman2023ProxiesInstabilities]
1 观测Mup Hp Transfer
证据 (1)
  • topic_reportmup-hp-transfer· report.positions[3].countergpt-5.2

    [counter to Camp D: transfer error is dominated by non-transferable HPs,] Refining c-05156d3ad8 / c-bdb7a65e9e: wd/β₂ matter, but that does not make parameterization useless. Complete-P’s module-wise patches and coord check reduce structura

consensusc-0ae2710831
在固定 HP 搜索预算与对齐 schedule 家族的协议下,许多 optimizer 报告的优势会缩小到“原差距的约一半”,并且存在 rank flip;因此任何 A/B 若不报告 trial 数、可调超参集合与 schedule 家族,就不能把差距当作算法结论。[Dahl2023AlgoPerf][Agarwal2020LRConfound]
来源论文· 2[Dahl2023AlgoPerf][Agarwal2020LRConfound]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    Under protocols that fix HP-search budgets and align schedule families, many reported optimizer advantages shrink to roughly half and rank flips occur; any A/B that does not report trial count, tunable HP set, and schedule family should not

consensusc-ede14f3825
≥70B 稠密训练里,AdamW 的工程优势主要来自 μP/LR transfer:把跨宽度 LR 搜索从“每次重来”压缩到“少量校准”,通常可把 sweep 预算压到总算力的 ≤10%。[Lingle2024muPTransfer][Noci2024SuperConsistency]
来源论文· 3[Lingle2024muPTransfer][Noci2024SuperConsistency]arXiv 2404.05728arxiv.org
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscapearXiv 2404.05728· report.headline_claimsgpt-5.2

    For ≥70B dense training, AdamW’s practical edge is largely μP/LR transfer: it turns width scaling from “retune from scratch” into “light calibration,” often keeping sweep budget to ≤10% of total compute [Lingle2024muPTransfer][Noci2024Super

consensusc-40e47eb260
SOAP 把 Shampoo 的额外超参从约 4 个压到 1 个,并在 360M–1.3B 报告 wall-clock 接近 AdamW 且 loss 更低;“二阶一定更慢/更难调”的强版本反驳在中等规模上已不成立,但 ≥7B 的公开 head-to-head 仍缺。[Vyas2024SOAP][Gupta2018Shampoo]
来源论文· 2[Vyas2024SOAP][Gupta2018Shampoo]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    SOAP reduces Shampoo’s extra HPs from ~4 to 1 and reports near-AdamW wall-clock with lower loss at 360M–1.3B; the strong claim “second-order is always slower/harder to tune” no longer holds at mid-scale, but public ≥7B head-to-heads are sti

consensusc-24ec0b6240
在显存受限的全参训练里,优先选“不改训练循环”的低 state 路线:Apollo 用 per-tensor 标量替代 per-param 二阶矩,在 7B/13B 上接近 AdamW;GaLore 通过低秩投影省 state,但引入额外算子与数值路径,A/B 必须把吞吐与稳定性一起报。[Zhu2024Apollo][Zhao2024GaLore]
来源论文· 2[Zhu2024Apollo][Zhao2024GaLore]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.headline_claimsgpt-5.2

    For VRAM-limited full-parameter training, prefer low-state methods that do not change the training loop: APOLLO replaces per-parameter second moments with per-tensor scalars and stays close to AdamW on 7B/13B; GaLore saves state via low-ran

contestedc-479c4f2b3e
反驳 c-6cf8d6c199:在 matched budget 下,Muon/SOAP 仍可能在 wall-clock 或 loss 上给出可见收益;把“默认”外推成“不可替代”会阻碍在 ≤30B 或中等规模上吃到低摩擦收益。[Jordan2024Muon][Vyas2024SOAP]
来源论文· 2[Jordan2024Muon][Vyas2024SOAP]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[0].countergpt-5.2

    [counter to Camp A: AdamW won’t be retired (highest default priority)] Counter to c-6cf8d6c199: under matched budgets, Muon/SOAP can still deliver visible gains in wall-clock or loss; extrapolating “default” into “irreplaceable” can block l

contestedc-029ed1008f
修正 c-c7ea2fc38c:Muon 的公开证据对 ≥7B–70B 稠密 LLM 仍稀疏,且缺少 matched schedule + matched budget 的 head-to-head;在证据补齐前,把它当“≤30B 新训练的可控试验”更稳。[Dahl2023AlgoPerf][Agarwal2020LRConfound]
来源论文· 2[Dahl2023AlgoPerf][Agarwal2020LRConfound]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[1].countergpt-5.2

    [counter to Camp B: Muon is the next default (but only as a hybrid)] Revision to c-c7ea2fc38c: public evidence for Muon on ≥7B–70B dense LLMs is still sparse, and matched-schedule + matched-budget head-to-heads are missing; until evidence c

contestedc-71499ea2f7
反驳 c-9c69cc911f 的外推:SOAP 在 360M–1.3B 的 wall-clock 逼近 AdamW,并不自动推出 ≥7B 仍成立;分布式通信、预条件更新频率与 block 选择可能让系统代价重新主导。[Shi2023DistShampoo][Defazio2024RoadLessScheduled]
来源论文· 2[Shi2023DistShampoo][Defazio2024RoadLessScheduled]
1 观测Optimizer Landscape
证据 (1)
  • topic_reportoptimizer-landscape· report.positions[2].countergpt-5.2

    [counter to Camp C: Shampoo/SOAP is the canonical endgame (second-order ] Counter to over-extrapolating c-9c69cc911f: SOAP’s near-AdamW wall-clock at 360M–1.3B does not automatically extend to ≥7B; distributed communication, preconditioner

consensusc-fd2b2fdec4
在 web 爬取池里,仅做文档级 exact hash 会漏掉“长重复 substring”与 near-duplicate;Lee et al. [Lee2021Dedup] 的 substring + MinHash near-dedup 同时降低 PPL、逐字记忆化与 train-test 泄漏,说明重复治理至少要到片段/近重复层。
来源论文· 2[Lee2021Dedup]arXiv 2107.06499arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2107.06499· report.headline_claimsgpt-5.2

    In scraped web pools, doc-level exact hashing misses long repeated substrings and near-duplicates; Lee et al. [Lee2021Dedup] show substring + MinHash near-dedup reduces PPL, verbatim memorization, and train-test leakage, implying repetition

consensusc-3a4221a208
“均匀重复 ≤4 epochs 基本免费”只在曝光分布近似均匀时成立:Muennighoff et al. [Muennighoff2023DataConstrained] 给出约 2–4 轮的等效新鲜 token 窗口;Hernandez et al. [Hernandez2022RepeatedData] 表明热子集过曝会额外退化并在 induction head 留下指纹。
来源论文· 3[Muennighoff2023DataConstrained][Hernandez2022RepeatedData]arXiv 2305.16264arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2305.16264· report.headline_claimsgpt-5.2

    “Uniform repetition ≤4 epochs is close to free” holds only under near-uniform exposure: Muennighoff et al. [Muennighoff2023DataConstrained] give a ~2–4 pass fresh-token-equivalent window; Hernandez et al. [Hernandez2022RepeatedData] show ho

consensusc-7f28275426
单语料内 dedup 不足以控制总曝光:Elazar et al. [Elazar2023WhatsInMyBigData] 观察到 C4、RedPajama、Dolma 等公开池之间存在 overlap,因此需要跨语料指纹账本(hash/MinHash/embedding)来统计“同一文档被看过几次”。
来源论文· 2[Elazar2023WhatsInMyBigData]arXiv 2310.20707arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2310.20707· report.headline_claimsgpt-5.2

    Per-corpus dedup cannot bound total exposure: Elazar et al. [Elazar2023WhatsInMyBigData] observe overlap across public pools (C4, RedPajama, Dolma), so a cross-corpus fingerprint ledger (hash/MinHash/embedding) is needed to count how many t

consensusc-5c576a2a87
对 benchmark/PII/版权文本,用“混入比例”管理不如用“曝光次数上限”:Carlini et al. [Carlini2022Memorization] 的可恢复记忆化风险随重复、模型规模与更长 context 上升;Deng et al. [Deng2023BenchmarkContamination] 显示污染会抬高分数,工程上更稳的是默认 0–1 次曝光 + prefix/suffix dedup + 独立评测管线。
来源论文· 2[Carlini2022Memorization]arXiv 2202.07646arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2202.07646· report.headline_claimsgpt-5.2

    For benchmarks/PII/copyrighted text, exposure caps dominate mixture ratios: Carlini et al. [Carlini2022Memorization] show extractable memorization risk rises with repetition, model size, and longer context; Deng et al. [Deng2023BenchmarkCon

contestedc-d71460c7dc
反驳 c-841441e807 / c-bce7a179d3:在数据受限的高质量池里,“删到极致”并不总可行;Muennighoff et al. [Muennighoff2023DataConstrained] 显示均匀重复前 2–4 轮仍接近新鲜 token 的收益。另一个边界是误删:语义去重可能把“同主题但互补”的样本当冗余,目前公开负面结果偏少,缺少同预算对照来界定阈值 [Abbas2023SemDeDup]。
来源论文· 2[Muennighoff2023DataConstrained][Abbas2023SemDeDup]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[0].countergpt-5.2

    [counter to Camp A: Dedup-first and aggressive (exact → near → semantic)] Counter to c-841441e807 / c-bce7a179d3: in data-constrained high-quality pools, “dedup to the extreme” is not always feasible; Muennighoff et al. [Muennighoff2023Data

contestedc-f96670e3ee
反驳 c-766e70d1a1 / c-2c9c18bcfc:平均 epochs 不能替代曝光分布控制。Hernandez et al. [Hernandez2022RepeatedData] 显示热子集过曝会额外退化并留下 induction head 指纹;对敏感/评测/版权文本,重复曝光会把可恢复记忆化与污染风险推高,不能用“平均 4 epochs”论证安全 [Carlini2022Memorization][Deng2023BenchmarkContamination
来源论文· 3[Hernandez2022RepeatedData][Carlini2022Memorization]arXiv 2205.10487arxiv.org
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetitionarXiv 2205.10487· report.positions[1].countergpt-5.2

    [counter to Camp B: Epochs-first (fill compute under data constraints)] Counter to c-766e70d1a1 / c-2c9c18bcfc: average epochs cannot substitute exposure-distribution control. Hernandez et al. [Hernandez2022RepeatedData] show hot-subset ove

contestedc-6751a0a141
修正 c-8720d900a9 / c-2a25c7719f:语义去重解决不了两类关键账:长重复 substring(片段级坏账)与跨语料曝光(同一文档在多个池复用)。前者需要 substring/MinHash 管线 [Lee2021Dedup],后者需要跨语料指纹账本与 provenance [Elazar2023WhatsInMyBigData]。此外,语义去重的阈值选择缺少统一的“误删代价”评估,尤其在长尾能力(小语种、专业领域)上更难靠公开基准验证。
来源论文· 2[Lee2021Dedup][Elazar2023WhatsInMyBigData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[2].countergpt-5.2

    [counter to Camp C: Semantic dedup is the main battleground (exact/MinHa] Refinement to c-8720d900a9 / c-2a25c7719f: semantic dedup does not address two key ledgers: long repeated substrings (fragment-level debt) and cross-corpus exposure (

contestedc-1b6b84d54c
修正 c-b0ed1be495 / c-45889dd38a:工程现实是“绝对零”往往不可证伪:风险文本可能以多版本、多池复用形式进入训练配方。Elazar et al. [Elazar2023WhatsInMyBigData] 的 overlap 现象意味着需要把目标从“绝对零”落到“可审计的曝光上限 + 可复现的过滤证据 + 跨语料指纹账本”。
来源论文· 1[Elazar2023WhatsInMyBigData]
1 观测Pretrain Data Repetition
证据 (1)
  • topic_reportpretrain-data-repetition· report.positions[3].countergpt-5.2

    [counter to Camp D: Zero exposure (or 0–1 exposure) for risky data overr] Refinement to c-b0ed1be495 / c-45889dd38a: “absolute zero” is often non-falsifiable operationally—risky text can enter via multiple versions and reused pools. The ove

consensusc-47851b00d5
在同 tokenizer/同目标函数/同模型家族的训练闭环里,验证集 cross-entropy(等价于 log PPL)对“是否继续训练/是否异常/数据混合是否更好”的决策信息密度最高且成本最低;把它用于对外选型会把 tokenization 与 post-training 的变量混进同一个标量里。[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Mielke2021TokenizationHistory]
来源论文· 3[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Mielke2021TokenizationHistory]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Within a training loop with the same tokenizer/objective/model family, validation cross-entropy (log PPL) is the highest signal-per-cost metric for keep-training/anomaly/data-mix decisions; using it for external model selection mixes tokeni

consensusc-a752b93e83
compute-optimal 结论不是“单点最优”,而是依赖训练时长假设的区间决策:Kaplan-style 与 Chinchilla-style 的最优配比差异可由是否 overtrain 等 regime 假设解释,因此预算拟合必须同时报告 regime 与拟合不确定性。[Porian2024ResolvingDiscrepancies][Hoffmann2022Chinchilla][Kaplan2020ScalingLaws]
来源论文· 2[Hoffmann2022Chinchilla][Kaplan2020ScalingLaws]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Compute-optimality is an interval decision conditioned on training-duration assumptions: Kaplan-style vs Chinchilla-style optima can be explained by regime assumptions such as overtraining, so budgeting fits must report regime and fit uncer

consensusc-734beb26ca
raw PPL 不是跨 tokenizer 的统一单位:分词粒度改变每个 token 的信息量,使得“PPL 下降 10%”在不同 tokenizer 下不可同义;对外比较至少要用 BPB/信息归一化,并用语言均衡任务面板校验。[Mielke2021TokenizationHistory][BigScience2022BLOOM][Liang2022HELM]
来源论文· 3[Mielke2021TokenizationHistory][BigScience2022BLOOM][Liang2022HELM]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Raw PPL is not a common unit across tokenizers: segmentation changes information per token, so a “10% PPL drop” is not semantically comparable; external comparisons should at least use BPB/information-normalized metrics and validate on a la

consensusc-a0b6fe62ce
阶段二(post-training/压缩)会系统性打破“更低 PPL ⇒ 更好任务”:同 pretraining loss 仍可不同下游,[HongLiu2022SameLossBetterDownstream] 剪枝/稀疏化可让 PPL 近似不变但任务分数下跌,[KhanalCapone2024CompressionTasks] RLHF/DPO 的优化目标也不以 LM loss 为中心。[Bai2022RLHF][Rafailov2023DPO]
来源论文· 3[KhanalCapone2024CompressionTasks][Bai2022RLHF][Rafailov2023DPO]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    Stage-2 changes (post-training/compression) systematically break “lower PPL ⇒ better tasks”: near-identical pretraining loss can yield different downstream outcomes,[HongLiu2022SameLossBetterDownstream] pruning/sparsification can keep PPL n

consensusc-2f3ce5ba1b
“是否继续训练/是否发版”更可操作的信号是逐任务缩放律 + 外推误差:在 overtraining 设置下任务指标也能被稳定拟合,[Gadre2024OvertrainingDownstream] 且可用 model ladders 降低建立成本。[Bhagia2024TaskScalingLadders][Isik2024DownstreamScalingLaws]
来源论文· 1[Bhagia2024TaskScalingLadders]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.headline_claimsgpt-5.2

    A more actionable signal for keep-training/ship decisions is per-task scaling laws plus extrapolation error: downstream metrics can be reliably fit in overtrained regimes,[Gadre2024OvertrainingDownstream] and model ladders reduce the cost o

contestedc-ee2691ed85
反驳点集中在“跨阶段/跨设置”的不可比:Hong Liu et al. [HongLiu2022SameLossBetterDownstream] 说明同 loss 仍可不同下游;McKenzie et al. [McKenzie2023InverseScaling] 与 Wei et al. [Wei2022UShapedInverseScaling] 说明任务曲线可非单调;tokenization 使 raw PPL 跨模型不可比。[Mielke2021Tokenizat
来源论文· 1[McKenzie2023InverseScaling]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[0].countergpt-5.2

    [counter to Camp A: PPL remains the primary variable (at least for train] The pushback is concentrated on non-comparability across stages/settings: Hong Liu et al. [HongLiu2022SameLossBetterDownstream] shows same loss can still yield differ

contestedc-99fa26246e
主要成本在评估与统计稳定性:逐任务曲线需要足够密的模型点与一致的评测协议;若任务面板频繁变动,外推误差会被“指标漂移”放大。[Liang2022HELM] 另一个现实问题是:阶段一仍需要 loss 来做训练稳定性控制与数据混合调参,逐任务评估无法替代训练内信号。[Kaplan2020ScalingLaws]
来源论文· 2[Liang2022HELM][Kaplan2020ScalingLaws]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[1].countergpt-5.2

    [counter to Camp B: PPL is stage-1 only; stage-2 must use per-task scali] The main cost is evaluation and statistical stability: per-task curves need dense model checkpoints and consistent protocols; if panels change frequently, extrapolati

contestedc-37f59c9379
面板并不自动解决“决策”问题:当预算有限时仍需要排序与停训准则;如果没有逐任务外推或明确的业务权重,多面板容易退化成“多指标同时看但没人负责”。[Isik2024DownstreamScalingLaws] 另外,面板覆盖越广,评测噪声与协议维护成本越高,需要版本化与变更控制。[Liang2022HELM]
来源论文· 1[Liang2022HELM]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[2].countergpt-5.2

    [counter to Camp C: stop searching for a scalar; define quality via stan] Panels do not automatically solve decision-making: under limited budgets, ranking and stopping rules are still needed; without per-task extrapolation or explicit busi

contestedc-659de7ec24
反驳点在工程可控性:即便最终目标不是 next-token loss,阶段一仍需要一个稳定、可微、低成本的训练监控量;PPL 在训练稳定性与预算拟合上仍然更可用。[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla] 另一个风险是把偏好 proxy 优化过头会伤害真实偏好质量,需要把“对齐指标”也做成多面板并加防过拟合约束。[Gao2022RewardOveroptimization][Liang2022HELM]
来源论文· 3[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla][Liang2022HELM]
1 观测Perplexity Downstream Performanc
证据 (1)
  • topic_reportperplexity-downstream-performance· report.positions[3].countergpt-5.2

    [counter to Camp D: the issue is ontological—next-token loss is not the ] The pushback is engineering controllability: even if the final target is not next-token loss, stage 1 still needs a stable, differentiable, low-cost monitoring signal

consensusc-24f9a4d9e9
在公开重拟合里,compute-optimal 的 tokens/param 不收敛到单一常数:同为 Transformer LM,最优比值可在 5–100 间滑动,且对 batch-size schedule 与数据配方敏感 [DeepSeek2024LLM]。
来源论文· 2[DeepSeek2024LLM]arXiv 2401.02954arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2401.02954· report.headline_claimsgpt-5.2

    In public refits, compute-optimal tokens/param does not converge to a single constant: for Transformer LMs the optimum can slide across 5–100 and is sensitive to batch-size schedules and data recipes [DeepSeek2024LLM].

consensusc-62c2f29640
Kaplan 的“固定算力偏大模型”与 Chinchilla 的“偏多 tokens”并非互斥真理:当 undertraining 被显式建模并用学习曲线外推校正时,固定算力最优点会从 Kaplan 的局部区间移动到 tokens/param≈20 的区间 [Kaplan2020ScalingLaws][Hoffmann2022Chinchilla]。
来源论文· 2[Kaplan2020ScalingLaws][Hoffmann2022Chinchilla]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.2

    Kaplan’s “fixed-compute favors larger models” and Chinchilla’s “favor more tokens” are not mutually exclusive truths: when undertraining is modeled explicitly and corrected via learning-curve extrapolation, the fixed-compute optimum shifts

consensusc-5fc50e7cef
固定模型与 tokens 预算时,数据过滤与 mixture 单独就能造成 ≥7 pp 的下游差距,这个量级足以盖过一次小幅的 tokens/param 调整 [Li2024DCLM]。
来源论文· 1[Li2024DCLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.2

    At fixed model and token budgets, data filtering and mixtures alone can create ≥7-point downstream gaps—large enough to dominate small shifts in tokens/param [Li2024DCLM].

consensusc-35a291eb9d
loss scaling 在 over-training 区间仍可平滑外推,但单任务 benchmark 分数会先抖动、后跨阈值才稳定;要把单任务预测误差压到 ~1.9%,需要 two-step(task perplexity→accuracy)而不是直接用 loss→accuracy [Gadre2024OverTraining][Bhagia2024TaskLadders]。
来源论文· 2[Gadre2024OverTraining][Bhagia2024TaskLadders]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.headline_claimsgpt-5.2

    Loss scaling can extrapolate smoothly in over-training, but per-task benchmark scores wobble and only stabilize after thresholds; reaching ~1.9% per-task prediction error requires a two-step mapping (task perplexity→accuracy), not loss→accu

contestedc-5459b81c5b
反驳 c-861b5bafc8 / c-1f9ceebe32 / c-a34e28d5d3:当 undertraining 被显式建模并用 IsoFLOP+外推校正后,固定算力最优点会向更多 tokens 移动 [Hoffmann2022Chinchilla];开源重拟合进一步表明最优 tokens/param 会随 recipe 漂移到 5–100,无法把“堆参数”当作跨设定的默认最优 [DeepSeek2024LLM]。
来源论文· 2[Hoffmann2022Chinchilla][DeepSeek2024LLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[0].countergpt-5.2

    [counter to Camp A: Kaplan-style — portable exponents; under fixed compu] Countering c-861b5bafc8 / c-1f9ceebe32 / c-a34e28d5d3: once undertraining is modeled explicitly and corrected via IsoFLOP+extrapolation, the fixed-compute optimum shi

contestedc-eefd17d100
反驳 c-0f12d82e0e / c-6669a9cdef / c-e2361a4007:tokens/param≈20 在很多设定里是好起点,但不是可移植常数。DeepSeek-AI [DeepSeek2024LLM] 的公开重拟合把最优点拉宽到 5–100,且对 batch schedule 与数据配方敏感;数据受限时,重复 epochs 的边界(≤4)会直接改变“有效 tokens”,让固定比例失真 [Muennighoff2023DataConstrained]。
来源论文· 3[DeepSeek2024LLM][Muennighoff2023DataConstrained]arXiv 2401.02954arxiv.org
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llmarXiv 2401.02954· report.positions[1].countergpt-5.2

    [counter to Camp B: Chinchilla-style — tokens/param≈20 as the default re] Countering c-0f12d82e0e / c-6669a9cdef / c-e2361a4007: tokens/param≈20 is often a good starting point but not a portable constant. DeepSeek-AI [DeepSeek2024LLM] widen

contestedc-ce32644b61
修正 c-8a6b54a19e / c-b54170330d / c-f7286a59b8:数据配方确实能主导,但“越 curated 越好”并不稳。RefinedWeb 的 web-only 反例表明在某些规模与任务上,过滤后的 web 数据可以胜过 curated mixture [Penedo2023RefinedWeb];因此更像是“先做数据 Pareto”,而不是预设某类语料天然更优 [Li2024DCLM][Penedo2023RefinedWeb]。
来源论文· 2[Penedo2023RefinedWeb][Li2024DCLM]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[2].countergpt-5.2

    [counter to Camp C: Data-mixture pragmatists — get the data recipe right] Refining c-8a6b54a19e / c-b54170330d / c-f7286a59b8: data recipes can dominate, but “more curated is always better” is not stable. RefinedWeb provides a web-only coun

contestedc-031dc93cb3
反驳 c-b6d5738eb8 / c-f19d4b4475 / c-2017e034b2 的“全归因”版本:并非所有阈值都能被指标解释。GPT-4 技术报告展示了部分能力在小模型上接近不可用、在大模型上可用的阶跃现象 [OpenAI2023GPT4];数学推理也被报告存在更强的门槛效应 [Yuan2023Math]。更稳的做法是把阈值分成两类:指标阈值(可去阈值化)与任务门槛(需要机制或数据改变)。
来源论文· 2[OpenAI2023GPT4][Yuan2023Math]
1 观测Scaling Laws LLM
证据 (1)
  • topic_reportscaling-laws-llm· report.positions[3].countergpt-5.2

    [counter to Camp D: Against emergence-as-magic — many “emergent” effects] Countering the “explain everything away” version of c-b6d5738eb8 / c-f19d4b4475 / c-2017e034b2: not all thresholds are metric artifacts. The GPT-4 technical report sh

consensusc-acf27c6d27
采用FlashAttention2/3 varlen接口的per-doc causal mask packing,可在packing ratio>98%的前提下,将mask计算开销压缩到<3%吞吐量损失,同时避免cross-doc contamination导致的0.5-2%训练loss低估[Krell2021Packing][FlashAttention2Varlen][FlashAttention32024]。
来源论文· 2[Krell2021Packing][FlashAttention32024]
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.headline_claimsep-20260214160829-csjmc

    Per-doc causal mask packing via FlashAttention2/3 varlen interface can achieve >98% packing ratio, reduce mask computation overhead to <3% throughput loss, and avoid 0.5-2% training loss underestimation caused by cross-doc contamination [Kr

consensusc-9a79e8f1cc
FIM仅在code预训练场景下无性能退化,纯NL语料启用50% FIM会导致下游NL任务出现1-3pp的稳定退化[StarCoder2023][CodeLlama2023]。
来源论文· 3[StarCoder2023][CodeLlama2023]arXiv 2305.06161arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2305.06161· report.headline_claimsep-20260214160829-csjmc

    FIM has no performance degradation only in code pretraining scenarios; enabling 50% FIM on pure NL corpus leads to 1-3pp stable degradation in downstream NL tasks [StarCoder2023][CodeLlama2023].

consensusc-3490a24bfa
split-then-pack处理超长文档对比truncate-and-drop,在困惑度与长文档任务上分别提升0.02-0.05 nats与3-6pp[Ding2024FewerTruncations]。
来源论文· 2[Ding2024FewerTruncations]arXiv 2404.10830arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2404.10830· report.headline_claimsep-20260214160829-csjmc

    Processing over-length documents with split-then-pack improves perplexity by 0.02-0.05 nats and long-document tasks by 3-6pp compared to truncate-and-drop [Ding2024FewerTruncations].

contestedc-f7e77c24d7
同算力下全程混合长度的wallclock比short-to-long高20-40%,Llama3和Qwen2.5的长上下文评测未显示混合长度的性能优势,同时混合长度会提升packing调度的复杂度。
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-length· report.positions[1].counterep-20260214160829-csjmc

    [counter to Camp B: always-mixed lengths (anti-curriculum)] Full-cycle mixed length has 20-40% higher wallclock time than short-to-long under equal compute, long-context evaluations of Llama3 and Qwen2.5 show no performance advantage of mix

contestedc-acbc109e59
Code Llama [CodeLlama2023]在纯NL语料上启用FIM后观察到1-3pp的下游任务退化,现有FIM的收益仅在code场景下被复现,无NL专属的rate sweep证明无副作用,混合目标会增加训练复杂度。
来源论文· 2[CodeLlama2023]arXiv 2308.12950arxiv.org
1 观测Packing Masking Length
证据 (1)
  • topic_reportpacking-masking-lengtharXiv 2308.12950· report.positions[3].counterep-20260214160829-csjmc

    [counter to Camp D: FIM/denoising-style objectives as default (infilling] Code Llama [CodeLlama2023] observes 1-3pp downstream task degradation after enabling FIM on pure NL corpus. Existing FIM benefits are only reproduced in code scenario

consensusc-4040fe20fc
压缩率与平均 token 长度不足以解释 tokenizer 的质量差异 [Goldman2024UnpackingTokenization][Schmidt2024MoreThanCompression];更稳的解释变量是 segmentation 诱导的归纳偏置与长尾训练信号分配。
来源论文· 2[Schmidt2024MoreThanCompression]arXiv 2402.18376arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2402.18376· report.headline_claimsgpt-5.2

    Compression ratio and mean token length do not fully explain tokenizer quality differences [Goldman2024UnpackingTokenization][Schmidt2024MoreThanCompression]; a more stable explanation is segmentation-induced inductive bias and long-tail tr

consensusc-362b639b0a
digit/date 的局部 merge 会破坏组合结构:在 3–5 位 carry-sensitive 算术与 temporal reasoning 上出现 10–20 pp 缺口 [Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic],且不应指望“随 scale 自然消失”。
来源论文· 4[Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic]arXiv 2402.14903arxiv.org
2 观测Tokenizer Scaling
证据 (2)
  • topic_reporttokenizer-scalingarXiv 2402.14903· report.headline_claimsgpt-5.2

    Local merges for digits/dates break compositional structure: 10–20 pp gaps appear on 3–5 digit carry-sensitive arithmetic and temporal reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments][Nogueira2021Arithmetic], and should not

  • topic_reporttokenizer-scalingarXiv 2402.14903· report.headline_claimsgpt-5.2

    Local merges for digits/dates create reproducible reasoning gaps: 10–20 pp differences on 3–5 digit carry-sensitive arithmetic and temporal reasoning [Singh2024TokenizationCounts][Bhatia2025DateFragments], so vocabulary structure can domina

consensusc-bdab2d201e
词表尾部的 under-trained tokens 不是偶发瑕疵:多款模型可观测到 3k–10k+,需要扫描定位并用短 continued pretrain 修复 [LandBartolo2024Magikarp];这使 tokenizer 版本化与 post-training 治理成为交付项。
来源论文· 2[LandBartolo2024Magikarp]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.headline_claimsgpt-5.2

    Under-trained tail tokens are not rare edge cases: 3k–10k+ are observable across models and require scan-and-repair with short continued pretraining [LandBartolo2024Magikarp]; this makes tokenizer versioning and post-training governance a s

contestedc-65c38e463c
修正 c-f70d90a418 / c-83bd08f6d0:公开证据能稳落到 128K 带来 0.02–0.04 nats loss 下降 [Dubey2024Llama3],但缺少 32K/64K/128K/256K+ 的 fixed-compute 曲线来证明单调性。更关键的反例来自结构:digit/date 的局部 merge 会制造 10–20 pp 缺口 [Singh2024TokenizationCounts][Bhatia2025DateFragments],
来源论文· 3[Dubey2024Llama3][Singh2024TokenizationCounts][Bhatia2025DateFragments]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Refinement to c-f70d90a418 / c-83bd08f6d0: public evidence supports 128K yielding 0.02–0.04 nats lower loss [Dubey2024Llama3], but controlled 32K/64K/128K/256K+ fixed-com

contestedc-656a099340
反驳 c-2b8f288a97 / c-f7fdd5ced2:tokenizer-free 的方向在机制上干净,但交付上需要更强的受控对比——尤其是对 64K–128K BPE 强基线的 fixed-compute parity。BLT 已经把“序列太长”压到可比范围 [Pagnoni2024BLT],但这并不自动消除工程复杂度(patching、吞吐、KV cache 行为)与评估口径问题(仍需 byte-level 对齐)[Vieira2024Characters][Ph
来源论文· 2[Pagnoni2024BLT][Vieira2024Characters]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] Counter to c-2b8f288a97 / c-f7fdd5ced2: tokenizer-free is mechanistically clean, but shipping requires stronger controlled comparisons—especially fixed-compute parity agai

contestedc-0ef341a9a0
修正 c-d35e1355ac:目前公开证据更强的是“尾部确实存在且可扫描修复”[LandBartolo2024Magikarp],而不是“缩词表能直接提升 RLHF 稳定性”。Yu et al. [Yu2021RareTokensDegenerate] 提供了稀有 token 影响表示退化的机制线索,但仍缺少把“剪词表→对齐收益”做成受控实验的公开曲线。
来源论文· 3[LandBartolo2024Magikarp][Yu2021RareTokensDegenerate]arXiv 2405.05417arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2405.05417· report.positions[3].countergpt-5.2

    [counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Refinement to c-d35e1355ac: the strongest public evidence today is that tail issues exist and are scannable/repairable [LandBartolo2024Magikarp], not that shrinking v

consensusc-93d79bd2d6
HumanEval/MBPP 的 pass@k 不能作为 SFT-ready 的主口径:EvalPlus 证明弱测试会产生结构性 false positive;因此对外比较至少应报告 EvalPlus 口径,并把原始 HumanEval/MBPP 降级为回归项。[EvalPlus2023][Chen2021Codex][Austin2021MBPP]
来源论文· 3[EvalPlus2023][Chen2021Codex][Austin2021MBPP]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.headline_claimsgpt-5.2

    HumanEval/MBPP pass@k should not be the primary SFT-ready metric: EvalPlus shows weak tests create structural false positives; external comparisons should at least report EvalPlus, with raw HumanEval/MBPP demoted to a regression signal.[Eva

contestedc-8971846042
反驳 c-3c1d1582fe / c-fc0eb62ec3 / c-90f8f3ed32:弱测试会系统性抬高分数,EvalPlus 显示同一解在更强测试下会翻车;[EvalPlus2023] repo-level 与库生态会改写排序,RepoBench/RepoCoder 与 BigCodeBench/ODEX 把跨文件与多库调用变成结构性变量,函数题覆盖不到这些瓶颈。[RepoBench2023][RepoCoder2023][BigCodeBench2024][ODEX
来源论文· 5[EvalPlus2023][RepoBench2023][RepoCoder2023][BigCodeBench2024]arXiv 2305.01210arxiv.org
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluationarXiv 2305.01210· report.positions[0].countergpt-5.2

    [counter to Camp A: HumanEval/MBPP is sufficient—cheap, stable, reproduc] Counter to c-3c1d1582fe / c-fc0eb62ec3 / c-90f8f3ed32: weak tests systematically inflate scores—EvalPlus shows solutions can fail under stronger tests;[EvalPlus2023]

contestedc-5208790fbe
修正 c-23dce50372 / c-867a4775a4 / c-b6d1926f11:现有公开证据并未给出“patch-PPL 可替代后续 benchmark”的系统相关性;CRUXEval 把执行语义单独测出来,[CRUXEval2024] RepoBench/RepoCoder 把跨文件依赖与检索设定变成结构性变量,[RepoBench2023][RepoCoder2023] 这些都说明 token-level 似然不足以覆盖关键瓶颈。更稳的做法是把 BPB/pat
来源论文· 3[CRUXEval2024][RepoBench2023][RepoCoder2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[2].countergpt-5.2

    [counter to Camp C: pretrain BPB/patch-PPL best predicts SWE; downstream] Correction to c-23dce50372 / c-867a4775a4 / c-b6d1926f11: public evidence does not establish a systematic correlation strong enough to let patch-PPL replace downstrea

contestedc-e8aecdd1c2
反驳 c-c980753890 / c-88b722216c 的强版本:只看 UX/成本会失去成功率锚点,系统可能通过更保守或更昂贵的策略“看起来更稳”,但实际解决率下降;Verified 仍需要作为结果层硬约束(对应 c-e1f5138355)。[SWEbenchVerified2024][SelfRepair2023]
来源论文· 2[SWEbenchVerified2024][SelfRepair2023]
1 观测Swe Agent Evaluation
证据 (1)
  • topic_reportswe-agent-evaluation· report.positions[3].countergpt-5.2

    [counter to Camp D: deployment UX/cost metrics reflect value better than] Counter to the strong version of c-c980753890 / c-88b722216c: UX/cost alone loses a success anchor—systems can appear more stable by being more conservative or more e

contestedc-d0edda7be3
反驳 c-b93955c99f:公开配方并未证明“synthetic 可替代真实世界覆盖”。当 synthetic 变成替换而不是增量层时,尾部分布更容易先死,且 teacher 文风偏差会被稳定复制。[MAD2023][CollapseInevitable2024] 更稳的读法是:synthetic-first 适用于数据受限与可验证子域,但仍需要真实锚点与池策略约束。
来源论文· 2[MAD2023][CollapseInevitable2024]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[0].countergpt-5.2

    [counter to Camp A: synthetic-first (a primary route under data constrai] Counter to c-b93955c99f: public recipes do not show “synthetic can replace real-world coverage.” When synthetic becomes replacement rather than an incremental layer,

contestedc-abdf16218c
修正 c-958f96c37a:高密度 synthetic 的收益容易与“底座 undertrained”混淆。若 backbone 没训到位,任何后段配方都会看起来有效;因此需要先用 compute-optimal 与训练轨迹信号把底座训到接近饱和,再评估 mid-train 的净增益。[Chinchilla2022][ScalingLaws2020]
来源论文· 2[Chinchilla2022][ScalingLaws2020]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[1].countergpt-5.2

    [counter to Camp B: web-heavy backbone + (real/synthetic) mid-train (a m] Correction to c-958f96c37a: gains from high-density synthetic are easily confounded with an undertrained backbone. If the backbone is not trained sufficiently, any la

contestedc-f3fbf1840b
反驳 c-1da7d58c52:过滤只能在已有候选池里选,无法主动扩展稀缺分布。对 math/code/长上下文等目标域,公开经验更像“过滤 + 合成/重写 + mid-train”组合,而不是单靠过滤就够。[DeepSeekMath2024][WRAP2024][LongContextScaling2023]
来源论文· 3[DeepSeekMath2024][WRAP2024][LongContextScaling2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[2].countergpt-5.2

    [counter to Camp C: minimize synthetic; stronger filtering + more real d] Counter to c-1da7d58c52: filtering can only select from existing pools and cannot actively expand scarce distributions. For math/code/long context, public evidence lo

contestedc-76f88ddc54
反驳 c-8768256855 / c-fea6da20b0 / c-588cd52729:现有公开证据只支持“在 accumulate real 的池策略下,退化不是必然”,并不等价于“synthetic 可无限替代真实”。MAD [MAD2023] 仍给出 replace real 时尾部先死的机制;此外,code/math 的 verifier 经验不能直接外推到自然语言 reasoning,因为 correctness signal 形态不同。[MAD2023][Le
来源论文· 1[MAD2023]
1 观测Synthetic Data Midtrain
证据 (1)
  • topic_reportsynthetic-data-midtrain· report.positions[3].countergpt-5.2

    [counter to Camp D: synthetic scales almost without bound; collapse is m] Counter to c-8768256855 / c-fea6da20b0 / c-588cd52729: public evidence supports only “with accumulate-real pool policy, degradation is not inevitable,” which is not e

contestedc-f695b17b22
公开证据显示,Transformer 内部的“压 KV + 局部化”已经覆盖了大量服务账单:≤70B 默认 GQA 可把 cache 压到 h_kv/h 的比例 [Ainslie2023GQA],128K 用 local/global 交错可再降约 4× 显存 [Gemma3Report],MoE/超大规模还可用 MLA 把 cache 压到约 1/7 [DeepSeekV2]。同时,SSM 的 length extension 也存在失败条件,不能默认假设“训练短、推理长”
来源论文· 1[Ainslie2023GQA]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[1].countergpt-5.2

    [counter to Camp B: Transformer state cost is near its limit; move to re] Public evidence shows intra-Transformer “KV reduction + locality” already covers a large fraction of serving bills: ≤70B with default GQA reduces cache by h_kv/h [Ain

contestedc-a2b6948179
公开负例与边界条件不足:扩深/插块何时会破坏通用能力、何时会让对齐更难、何时会引入新的不稳定,目前缺少同预算对照与失败模式报告。对于全新数据分布与长期路线图,from-scratch 仍更容易按 scaling law 规划与复现 [Kaplan2020ScalingLaws][PaLM2022]。
来源论文· 2[Kaplan2020ScalingLaws][PaLM2022]
1 观测Transformer Arch Improvements
证据 (1)
  • topic_reporttransformer-arch-improvements· report.positions[2].countergpt-5.2

    [counter to Camp C: default a second scaling path—grow first, then decid] Public negative results and boundary conditions are scarce: when growth breaks generality, makes alignment harder, or introduces new instabilities is underreported, w

consensusc-5f4d24a7c9
预训练阶段引入commit/PR/issue等过程数据,可将RL阶段的样本效率提升3倍以上,奖励信号收敛速度提升2倍[SWERL2025][LingmaSWEGPT2024]。
来源论文· 3[SWERL2025][LingmaSWEGPT2024]arXiv 2502.18449arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2502.18449· report.headline_claimsep-20260214160829-csjmc

    Introducing process data such as commits/PRs/issues during pretraining can increase the sample efficiency of the RL stage by more than 3 times, and the convergence speed of reward signals by 2 times [SWERL2025][LingmaSWEGPT2024].

consensusc-264a3b7d72
在同一基座、同一后训练配置下,训练数据形状优化带来的SWE-bench增益是纯脚手架优化的1.8-2.5倍[Agentless2024][SWEagent2024]。
来源论文· 3[Agentless2024][SWEagent2024]arXiv 2407.01489arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2407.01489· report.headline_claimsep-20260214160829-csjmc

    With the same base model and post-training configuration, the SWE-bench gain from training data shape optimization is 1.8-2.5 times that of pure scaffolding optimization [Agentless2024][SWEagent2024].

contestedc-76e57fce62
Agentless [Agentless2024] 用简单pipeline就达到了复杂agent的85%以上性能,当基座跨文件先验缺失时,再复杂的脚手架也只能通过随机试错定位文件,修复率上限低于35%。
来源论文· 2[Agentless2024]arXiv 2407.01489arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2407.01489· report.positions[0].counterep-20260214160829-csjmc

    [counter to Camp A: Inference-time scaffolding and test-time compute are] Agentless [Agentless2024] achieves more than 85% of the performance of complex agents with a simple pipeline. When the base model lacks cross-file priors, no matter h

contestedc-6d1b137da9
SWERL [SWERL2025] 的实验显示,当预训练没有见过commit形态数据时,RL的探索空间会扩大10倍以上,样本效率下降70%,无法收敛到可用水平。
来源论文· 2[SWERL2025]arXiv 2502.18449arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2502.18449· report.positions[1].counterep-20260214160829-csjmc

    [counter to Camp B: RL and verifiers are the true drivers] Experiments from SWERL [SWERL2025] show that when pretraining does not include commit-shaped data, the exploration space of RL expands by more than 10 times, sample efficiency drops

contestedc-48ded21559
StarCoder2TheStackV22024 [StarCoder2TheStackV22024] 的对比实验显示,相同代码token规模下,仓库级打包的模型比文件级打散的模型跨文件修复率高11pp,纯代码语料无法覆盖issue、diff、CI日志等SWE任务特有的token形态。
来源论文· 2[StarCoder2TheStackV22024]arXiv 2402.19173arxiv.org
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretrainingarXiv 2402.19173· report.positions[2].counterep-20260214160829-csjmc

    [counter to Camp C: just mix more code (code is all you need)] Comparative experiments from StarCoder2TheStackV22024 [StarCoder2TheStackV22024] show that under the same code token scale, the cross-file repair rate of models trained with rep

contestedc-b15da7cb69
目前缺少严格控制变量的头对头消融,无法量化每个数据组件的单独增益,部分组件(如CI日志)的预训练收益尚未有公开验证。
1 观测Swe Agent Pretraining
证据 (1)
  • topic_reportswe-agent-pretraining· report.positions[3].counterep-20260214160829-csjmc

    [counter to Camp D: data shape first (repo/patch/process/execution first] There is currently a lack of strictly controlled head-to-head ablations, making it impossible to quantify the individual gain of each data component. The pretraining

consensusc-0723e51131
把“更大 vocab = 更好压缩 = 更好模型”当默认会误导:压缩率与下游质量相关但不充分 [Goldman2024UnpackingTokenization],且存在“非压缩机制”解释空间 [Schmidt2024MoreThanCompression]。
来源论文· 1[Schmidt2024MoreThanCompression]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.headline_claimsgpt-5.2

    Treating “bigger vocab = better compression = better model” as default is misleading: compression correlates with quality but is insufficient [Goldman2024UnpackingTokenization], leaving room for non-compression mechanisms [Schmidt2024MoreTh

contestedc-ffa3e3206d
两类证据要求把“单调更好”改成“结构优先、规模其次”。一是压缩率与质量相关但不充分 [Goldman2024UnpackingTokenization],存在非压缩机制与反例空间 [Schmidt2024MoreThanCompression]。二是局部 merge 会制造 10–20 pp 的推理缺口:digit/date 的坏 merge 与 vocab 大小无关,反而更容易在扩词表时被引入 [Singh2024TokenizationCounts][Bhatia2025
来源论文· 2[Schmidt2024MoreThanCompression][Singh2024TokenizationCounts]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[1].countergpt-5.2

    [counter to Camp B: bigger vocab is near-monotonic; default to 256K+] Two evidence lines push “monotonic” toward “structure first, size second”. (1) Compression correlates with quality but is insufficient [Goldman2024UnpackingTokenization],

contestedc-6c192fc99f
tokenizer-free 的关键门槛不是“能不能训”,而是“同 FLOPs、同延迟预算下是否更划算”。BLT 已经把一部分证据补上 [Pagnoni2024BLT],但工业主线仍需要与 BPE 的系统账本同台对照(KV cache、吞吐、tokenize latency 的替代成本)[PagedAttention2023][FastWordPiece2020]。同时,tokenized LM 也在暴露非 canonical 表示的脆弱点 [Ayoobi2026SayAny
来源论文· 4[Pagnoni2024BLT][PagedAttention2023][FastWordPiece2020]arXiv 2412.09871arxiv.org
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scalingarXiv 2412.09871· report.positions[2].countergpt-5.2

    [counter to Camp C: tokenizer-free is the endgame; abandon BPE ASAP] The key bar for tokenizer-free is not “trainability” but “cost-effectiveness under matched FLOPs and latency budgets”. BLT fills part of the evidence gap [Pagnoni2024BLT],

contestedc-8467560955
目前直接把“缩词表→对齐更稳”钉死的 controlled evidence 仍偏少,更多是机制推断与系统层面的合理性。可操作的中间版本是:先用 Magikarp 扫描 under-trained tail 并修复 [LandBartolo2024Magikarp],再评估 prune 对 RL 稳定性与下游质量的净影响;同时用字符串对齐指标避免把 token 分母差异误判为对齐收益 [Vieira2024Characters]。
来源论文· 2[LandBartolo2024Magikarp][Vieira2024Characters]
1 观测Tokenizer Scaling
证据 (1)
  • topic_reporttokenizer-scaling· report.positions[3].countergpt-5.2

    [counter to Camp E: shrink/prune the vocabulary to buy alignment and RL ] Controlled evidence that directly pins down “smaller vocab → more stable alignment” is still limited; much of it is mechanistic reasoning and systems plausibility. An

consensusc-79860b8797
在组合型长上下文任务上,许多标称 128K 的模型有效上下文停在约 32K;RULER 的结论与 NIAH 的高命中并不矛盾,因为两者测的是“找到”与“组合使用”的不同能力 [Hsieh2024RULER]。
来源论文· 2[Hsieh2024RULER]arXiv 2404.06654arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2404.06654· report.headline_claimsgpt-5.4

    On compositional long-context tasks, many nominal-128K models plateau at roughly 32K effective context; this does not contradict high NIAH hit rates because RULER measures “can compose and use,” not just “can find” [Hsieh2024RULER].

consensusc-474ece4cac
Lost in the Middle 的 20+ pp U-shape 说明有效上下文不是窗口长度的单调函数;证据放在序列中部时,注意力分配与训练分布偏置会共同放大退化 [Liu2023LostInMiddle]。
来源论文· 1[Liu2023LostInMiddle]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.4

    The 20+ pp U-shaped drop in Lost in the Middle shows that effective context is not a monotonic function of window size; when evidence sits in the middle, attention allocation and training-distribution bias jointly amplify degradation [Liu20

consensusc-cc7fe8614b
相关文档 packing 不是纯吞吐优化;它通过提高跨段重复、对齐与引用事件密度,给 ICL 和 long-context 使用提供弱监督,这与 Chan et al. [Chan2022DataDist] 和 Olsson et al. [Olsson2022InductionHeads] 的机制解释一致。
来源论文· 2[Chan2022DataDist][Olsson2022InductionHeads]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.4

    Related-document packing is not merely a throughput optimization; by increasing the density of cross-span repetition, alignment, and reference events, it provides weak supervision for ICL and long-context use, consistent with the mechanisms

consensusc-4ba23b50d8
检索、外部记忆与架构切换在 retrieval-heavy 任务上常更省算力,但公开证据还不足以说明它们在 RULER、NoCha、BABILong 这类 effective-context 评测上已经替代了强 Transformer 长上下文模型 [Xu2023RetrievalMeetsLongContext][Karpinska2024NoCha][Kuratov2024BABILong]。
来源论文· 2[Karpinska2024NoCha][Kuratov2024BABILong]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.headline_claimsgpt-5.4

    Retrieval, external memory, and architecture switches are often more compute-efficient on retrieval-heavy tasks, but public evidence is still insufficient to show that they have replaced strong Transformer long-context models on effective-c

contestedc-2d04e8df19
反驳 c-2218c6a6ff / c-0e06feed14 的关键不在于 PE 无效,而在于 PE 主要解决“能跑长”。Fu et al. [Fu2024DataEngineering] 与 Xiong et al. [Xiong2023EffectiveLongCtx] 在同一 base model 上显示,长文档上采样与领域分布保持更直接抬升长任务;Hsieh et al. [Hsieh2024RULER] 进一步说明,PE-only 风格的 nominal 128K
来源论文· 3[Fu2024DataEngineering][Xiong2023EffectiveLongCtx][Hsieh2024RULER]
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[0].countergpt-5.4

    [counter to Camp A: PE / extrapolation is enough; long context is mainly] The rebuttal to c-2218c6a6ff / c-0e06feed14 is not that PE is useless, but that PE mainly solves “can run long.” Fu et al. [Fu2024DataEngineering] and Xiong et al. [X

contestedc-e8bc05d08a
需要修正 c-acb79e4a69 的地方是:数据不是全部。Shi et al. [Shi2023InContextPretraining] 与 Staniszewski et al. [Staniszewski2023StructuredPacking] 指出,同样的数据池,不同 packing 结构下收益差异很大;这对应对 c-51ca2c6ff5 的支持。另一个限制是,公开文献里仍缺少更多 same-base、same-token-budget 的 PE-only vs
来源论文· 3[Shi2023InContextPretraining][Staniszewski2023StructuredPacking]arXiv 2309.16039arxiv.org
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrainarXiv 2309.16039· report.positions[1].countergpt-5.4

    [counter to Camp B: the data recipe is the main variable; long-document ] What needs correction in c-acb79e4a69 is that data is not the whole story. Shi et al. [Shi2023InContextPretraining] and Staniszewski et al. [Staniszewski2023Structure

contestedc-eddb63c6e6
需要克制地修正 c-b51a8309a9:机制上很顺,但直接因果证据还不够。crawler 的 open question 也明确指出,packing/related-document concatenation 的公开 pretraining 消融仍少,packing 是否直接诱导 induction head 增长,还缺 controlled experiment。
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[2].countergpt-5.4

    [counter to Camp C: packing / concatenation is underestimated; sequence ] c-b51a8309a9 needs a careful correction: the mechanism is plausible, but direct causal evidence is still limited. The crawler’s open question states this explicitly:

contestedc-525a259180
反驳 c-083d546514 / c-f5078308ed 的关键是评测口径。公开证据更多在最大长度、复杂度或 retrieval-heavy 任务上,而不是 RULER、NoCha、BABILong 这类 effective-context 基准。claim c-5b64164d84 指向的缺口仍在:同尺寸 SSM / hybrid 是否在 hard compositional long-context 上追平强 Transformer,公开 head-to-head 还
1 观测Context Scaling Pretrain
证据 (1)
  • topic_reportcontext-scaling-pretrain· report.positions[3].countergpt-5.4

    [counter to Camp D: changing the architecture or system boundary is more] The key rebuttal to c-083d546514 / c-f5078308ed is the evaluation lens. Public evidence is stronger on maximum length, complexity, or retrieval-heavy tasks than on ef

consensusc-189faa2b71
把 doc 边界外的 Z 显式写回训练上下文 C,并对 C 做 100% loss mask(证据不计 loss),比单纯把窗口拉长更直接地削弱 p(y|S) 的 prior 填空;否则模型会把“证据文本”当成要背诵的 y,重复数据效应会放大模板化而非可迁移性 [HernandezRepeatedData2022][Rho12024]。
来源论文· 2[HernandezRepeatedData2022][Rho12024]
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.headline_claimsgpt-5.2

    Writing out-of-doc Z back into training context C and masking C from loss (100% evidence masked) targets prior-filling more directly than merely increasing window length; otherwise the model treats evidence text as target y, and repeated-da

contestedc-c7350674f2
这条路线解释“平均指标变好”,但对“窗口够了还脑补”的机制解释不足:长文事实性错误会随篇幅累积 [LongFormFactuality2024],闭环 agent 需要环境反馈才能收敛 [RepairAgent2024]。仅靠过滤/去重无法把缺失的 Z 变成训练时可见的条件变量,prior 填空仍会在信息不足处接管输出。
来源论文· 2[LongFormFactuality2024][RepairAgent2024]
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[0].countergpt-5.2

    [counter to Camp A: keep Classic NTP + scale; hallucination is mostly so] This explains average metric gains but under-explains “window is enough yet hallucination persists”: long-form factual errors accumulate with length [LongFormFactuali

contestedc-acf21cb134
主要风险是成本与对照不清:训练时检索会引入系统复杂度,且缺少与“推理时 RAG”在同预算下的直接对比;如果 mask 做得不严,模型会把证据当作要背诵的 y,重复数据效应会放大模板化 [HernandezRepeatedData2022]。
来源论文· 1[HernandezRepeatedData2022]
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[1].countergpt-5.2

    [counter to Camp B: HDP / retrieval-aware pretraining—make retrieval and] Main risks are cost and unclear controls: training-time retrieval adds system complexity and lacks apples-to-apples comparisons against inference-time RAG under match

contestedc-61643f9ef7
重写/合成的负迁移风险更隐蔽:分布漂移、模板化、以及“看起来更干净但丢掉关键细节”。此外,若任务需要可引用证据(citation、依赖文件、工具输出),纯生成式反推无法替代真实检索与环境反馈 [Toolformer2023]。
来源论文· 1[Toolformer2023]
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdoc· report.positions[2].countergpt-5.2

    [counter to Camp C: Method-2 rewriting / reverse prompt-plan—structured ] Negative transfer risks are subtle: distribution shift, templating, and “cleaner-looking but detail-losing” rewrites. Also, when tasks require citable evidence (citat

contestedc-73d0dcaffd
两条负面证据需要被当作默认约束:模仿更强模型输出不等于迁移能力 [FalsePromise2023];CoT 在蒸馏设置下未必迁移推理能力 [CoTDistillEffectiveness2025]。没有把可验证的 Z(证据/工具反馈)写进上下文并做 mask,轨迹更像“自洽叙事”而不是“可执行计划”。
来源论文· 3[FalsePromise2023][CoTDistillEffectiveness2025]arXiv 2305.15717arxiv.org
1 观测Agent Context Scaling Hyperdoc
证据 (1)
  • topic_reportagent-context-scaling-hyperdocarXiv 2305.15717· report.positions[3].countergpt-5.2

    [counter to Camp D: trajectory distillation / self-reflection first—CoT ] Two negative results should be treated as default constraints: imitating stronger-model outputs does not imply capability transfer [FalsePromise2023]; CoT may not tra

contestedc-49be07a466
该阵营的实验均基于无结构拼接的长上下文,未引入结构化增强。Stack v2 [2402.19173] 表明结构化4D增强的32k上下文模型在代码补全任务上超过无结构128k上下文模型12.7pp,仅需要1/4的计算量。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4darXiv 2402.19173· report.positions[0].counterep-20260214160829-csjmc

    [counter to Camp A: Long context only requires engineering length scalin] Experiments from this camp are all based on unstructured stitched long context, without structured augmentation. Stack v2 [2402.19173] shows that a 32k context model

contestedc-6899f4c57f
该阵营的方法均为2×3矩阵中的单格子实现,仅能覆盖特定任务场景。4D框架可整合所有现有方法,且跨域可迁移,ProX [2409.17115] 验证了IVI×语义方法在代码、web、科学文献等多个领域的通用性,跨域落地ROI提升2倍以上。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4d· report.positions[1].counterep-20260214160829-csjmc

    [counter to Camp B: Hyper-Doc pretraining is a collection of scattered m] Methods from this camp are all single-cell implementations in the 2×3 matrix, only covering specific task scenarios. The 4D framework can integrate all existing metho

contestedc-744d0fc105
推理时RAG存在检索开销大、中间信息丢失、上下文长度限制等问题。Stack v2 [2402.19173] 表明训练时4D增强的代码模型在补全任务上比RAG增强的高11.3pp,推理速度快3倍,且不需要维护检索库。
1 观测Context Scaling 4d
证据 (1)
  • topic_reportcontext-scaling-4darXiv 2402.19173· report.positions[2].counterep-20260214160829-csjmc

    [counter to Camp C: Inference-time RAG can completely replace training-t] Inference-time RAG has problems such as high retrieval overhead, middle information loss, and context length limits. Stack v2 [2402.19173] shows that a code model wit

consensusc-fd4d0c1dc7
把 exact-match 一类离散指标换成连续 surrogate 后,Wei et al. [Wei2022Emergent] 中大量“跳变”会收缩成单调曲线;若替换后仍保留拐点,才值得按真实阈值处理 [Schaeffer2023Mirage]。
来源论文· 2[Wei2022Emergent][Schaeffer2023Mirage]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.headline_claimsgpt-5.4

    After replacing exact-match-style discrete metrics with continuous surrogates, many “jumps” from Wei et al. [Wei2022Emergent] collapse into monotonic curves; only bends that survive this swap should be treated as candidate real thresholds [

consensusc-68c0dd7fbe
对下游任务做外推时,pretraining loss 比 compute 更接近可迁移状态变量;它能在不同架构、token 预算和 dense/sparse 设定间给出更一致的轨迹对齐 [Du2024LossPerspective]。
来源论文· 1[Du2024LossPerspective]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.headline_claimsgpt-5.4

    For downstream extrapolation, pretraining loss is a more transferable state variable than compute; it aligns trajectories more consistently across architectures, token budgets, and dense/sparse settings [Du2024LossPerspective].

consensusc-744c44b8d9
若决策对象是 data-mix 初筛,observational scaling 往往已足够便宜;若决策对象是 architecture 或 mid-train recipe,必须优先训练自家 model ladder,否则外推误差会被配方差异主导 [Ruan2024Observational][Bhagia2024ModelLadders]。
来源论文· 2[Ruan2024Observational][Bhagia2024ModelLadders]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.headline_claimsgpt-5.4

    If the decision is data-mix screening, observational scaling is often cheap enough; if the decision is architecture or mid-train recipe, an in-house model ladder should take priority, otherwise extrapolation error is dominated by recipe mis

consensusc-f8f2b5946f
单一幂律不是默认真相,而是默认近似;当任务曲线出现多相位、斜率反转或明显阈值区间时,broken law 比硬拟合单一幂律更诚实,也更适合给 ship/no-ship 风险带宽 [Caballero2023Broken]。
来源论文· 1[Caballero2023Broken]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.headline_claimsgpt-5.4

    A single power law is not the default truth but the default approximation; when task curves are multi-phase, slope-reversing, or thresholded, a broken law is the more honest fit and the better basis for ship/no-ship risk bands [Caballero202

consensusc-5a49c85222
pretrain 曲线不能直接等同于最终产品曲线;一旦经过 SFT、偏好优化或蒸馏,loss→任务分数的映射会按任务族重新排序,因此最终验收至少需要单独校准一层 transfer model [Isik2024DownstreamScaling][OLMo2024]。
来源论文· 2[Isik2024DownstreamScaling][OLMo2024]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.headline_claimsgpt-5.4

    Pretraining curves are not the same as product curves; once SFT, preference optimization, or distillation enters, the loss-to-task mapping can reorder by task family, so final acceptance needs at least one extra transfer calibration layer [

contestedc-7720bdf54f
反驳点在两层。第一,Schaeffer et al. [Schaeffer2023Mirage] 说明大量跳变来自离散指标。第二,Du et al. [Du2024LossPerspective] 说明 compute 轴会制造错位。也就是说,原图上的“突然出现”常常混合了指标非线性和横轴选择误差。
来源论文· 2[Schaeffer2023Mirage][Du2024LossPerspective]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[0].countergpt-5.4

    [counter to Camp A: Downstream abilities are fundamentally threshold-eme] The rebuttal has two layers. First, Schaeffer et al. [Schaeffer2023Mirage] show that many jumps come from discrete metrics. Second, Du et al. [Du2024LossPerspective]

contestedc-79bd24f2b5
Du et al. [Du2024LossPerspective] 修正这一点:compute 更像资源记账,不是学习状态本身。架构、token 预算、稀疏性变化时,同样的 compute 可能对应不同 loss,因此 compute 轴更适合粗预算,不适合跨 recipe 对齐。
来源论文· 1[Du2024LossPerspective]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[1].countergpt-5.4

    [counter to Camp B: The compute axis is already sufficient; there is no ] Du et al. [Du2024LossPerspective] revise this view: compute is closer to resource accounting than to learning state. Under architecture, token-budget, or sparsity cha

contestedc-c973674085
Bhagia et al. [Bhagia2024ModelLadders] 反驳的重点不是观测法无效,而是它对 recipe shift 更脆弱。公开模型池若缺少与你目标 arch、数据清洗、mid-train 策略相近的样本,回归误差会被系统偏差主导。
来源论文· 1[Bhagia2024ModelLadders]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[2].countergpt-5.4

    [counter to Camp C: Observational scaling over public models is sufficie] Bhagia et al. [Bhagia2024ModelLadders] do not argue that observational methods are useless, but that they are more fragile under recipe shift. If the public pool lack

contestedc-2ec7791deb
Caballero et al. [Caballero2023Broken] 给出的修正是:当斜率变化来自真实机制切换时,单一幂律的偏差不是随机噪声,而是系统性误导。Pythia [Biderman2023Pythia] 的密集 checkpoint 也说明,多相位行为需要被显式建模,而不是被平均掉。
来源论文· 2[Caballero2023Broken][Biderman2023Pythia]
1 观测Scaling Laws Downstream Tasks
证据 (1)
  • topic_reportscaling-laws-downstream-tasks· report.positions[3].countergpt-5.4

    [counter to Camp D: A single power law covers most tasks; broken laws ar] The correction from Caballero et al. [Caballero2023Broken] is that when slope changes come from real mechanism shifts, the error of a single power law is not random n

consensusc-2b8e5b1ed6
V2 到 V3 的核心增益不是单独来自 MLA 或 MoE,而是“MLA + fine-grained MoE + shared expert + 更激进训练栈”这一组联动;把其中任一项孤立复刻,通常拿不到同等成本曲线 [DeepSeekAI2024V2][DeepSeekAI2024V3][Dai2024DeepSeekMoE]。
来源论文· 3[DeepSeekAI2024V2][DeepSeekAI2024V3][Dai2024DeepSeekMoE]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.headline_claimsgpt-5.4

    The main gain from V2 to V3 does not come from MLA or MoE alone, but from the coupled package of MLA, fine-grained MoE, shared experts, and a more aggressive training stack; reproducing any one component in isolation usually does not recove

consensusc-0edca3de7a
MLA 不是无条件替代 GQA:在长上下文、decode 受 KV 限制的区间更划算;在 batch 较大、上下文较短时,额外投影与 latent 路径会吃掉一部分收益 [DeepSeekAI2024V2][Mistral7B2023]。
来源论文· 2[DeepSeekAI2024V2][Mistral7B2023]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.headline_claimsgpt-5.4

    MLA is not an unconditional replacement for GQA: it pays off most in long-context, KV-limited decode regimes; at larger batch and shorter context, the extra projections and latent path eat into the gain [DeepSeekAI2024V2][Mistral7B2023].

consensusc-f4d8d73f87
DeepSeekMoE 的稳态默认值更接近“更细专家 + 1–2 个 shared expert”,而不是盲目增加 shared 比例;shared 过多会向 dense 退化,shared 过少则更容易出现训练不稳与泛化回撤 [Dai2024DeepSeekMoE][DeepSeekAI2024V2][DeepSeekAI2024V3]。
来源论文· 3[Dai2024DeepSeekMoE][DeepSeekAI2024V2][DeepSeekAI2024V3]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.headline_claimsgpt-5.4

    The stable default in DeepSeekMoE is closer to “finer experts plus 1–2 shared experts” than to increasing the shared ratio indiscriminately; too many shared experts collapse toward dense behavior, while too few make training and generalizat

consensusc-f33ee880eb
aux-loss-free balancing 并不是真的 free:它减少了对语言建模 loss 的直接扰动,但把一部分复杂度转移到 bias 更新、sequence-level 兜底项和更难复现的训练动力学上 [DeepSeekAI2024V3]。
来源论文· 1[DeepSeekAI2024V3]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.headline_claimsgpt-5.4

    Aux-loss-free balancing is not truly free: it reduces direct interference with the language-model loss, but moves part of the complexity into bias updates, a sequence-level backstop term, and harder-to-reproduce training dynamics [DeepSeekA

consensusc-29e4c5e146
R1 的新意不在“RL 比 SFT 更强”这句口号,而在于给出一条可执行的 RL-first 流水线:DeepSeekMath 的 GRPO 去掉 critic,R1-Zero 证明纯 RL 可起步,R1 再用 cold-start 与多阶段 RL 修正可读性与稳定性 [DeepSeekMath2024][DeepSeekR12025]。
来源论文· 2[DeepSeekMath2024][DeepSeekR12025]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.headline_claimsgpt-5.4

    The novelty of R1 is not the slogan that “RL beats SFT,” but an executable RL-first pipeline: DeepSeekMath’s GRPO removes the critic, R1-Zero shows pure RL can start, and R1 then uses cold-start data and multi-stage RL to repair readability

contestedc-5a851a92f0
反方会拿 GQA 路线作对照:Mistral 7B [Mistral7B2023] 说明更简单的 KV 共享在很多短上下文、高 batch 场景已经足够,MLA 的额外投影并非总能换回同等收益。
来源论文· 1[Mistral7B2023]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[0].countergpt-5.4

    [counter to Camp A: MLA will become a general replacement for GQA] The counterpoint uses the GQA route: Mistral 7B [Mistral7B2023] shows that simpler KV sharing is already sufficient in many short-context, high-batch regimes, and MLA’s extr

contestedc-3181e7fd12
反方会指出两点:Mixtral [Mixtral2024] 用较粗粒度专家也能给出强结果;Lu et al. [NotAllExpertsEqual2024] 进一步表明专家利用率高度不均,many-expert 设计并不自动等于更高有效容量。这里是在反驳“专家越多越稳”这条隐含前提。
来源论文· 2[Mixtral2024][NotAllExpertsEqual2024]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[1].countergpt-5.4

    [counter to Camp B: many-expert plus shared experts is the stable endgam] The counterpoint has two parts: Mixtral [Mixtral2024] achieves strong results with coarser experts, and Lu et al. [NotAllExpertsEqual2024] further show that expert ut

contestedc-1ea677abf4
反方来自三类证据:RefinedWeb [RefinedWeb2023] 认为高质量 web-only 数据可打平 curated corpora;Dolma [Dolma2024] 认为透明数据账本本身是关键变量;DsDm [DsDm2024] 则认为模型感知选择能超过人工质量启发式。这是在修正“人工 mixture 一定更优”的默认想法。
来源论文· 3[RefinedWeb2023][Dolma2024][DsDm2024]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[2].countergpt-5.4

    [counter to Camp C: data quality mainly comes from curated mixtures, not] The counterpoint comes from three directions: RefinedWeb [RefinedWeb2023] argues that high-quality web-only data can match curated corpora; Dolma [Dolma2024] argues t

contestedc-eb796137b6
反方会指出,过程监督与传统对齐并没有失效。Lightman et al. [LetsVerify2023]、Math-Shepherd [MathShepherd2023]、Bai et al. [RLHF2022] 说明 step-level supervision、reward modeling 与 RLHF 仍然能提供更稳定、更可控的行为约束,尤其在 reward 稀疏时更稳。这是在反驳“纯 RL 已经足够”这条过度外推。
来源论文· 3[LetsVerify2023][MathShepherd2023][RLHF2022]
1 观测Deepseek Engineering Evolution
证据 (1)
  • topic_reportdeepseek-engineering-evolution· report.positions[3].countergpt-5.4

    [counter to Camp D: the main path for reasoning has shifted from SFT/RLH] The counterpoint is that process supervision and conventional alignment have not stopped working. Lightman et al. [LetsVerify2023], Math-Shepherd [MathShepherd2023],

consensustrainingc-43887bce13
上下文扩展配方常常提升标称窗口,却未解决模型是否能有效利用任意位置的信息,这强化了“标称 vs 实效”的区分。
1 观测Window ExtensionEffective ContextFine Tuning
证据 (1)
consensusc-93d06e8c6e
在 32K 以上,单针 NIAH 对一批前沿模型已接近饱和;RULER 的 multi-key、variable-tracking、aggregation 子任务仍能拉开 20–30+ 分差,因此“能找到一根针”不足以代表有效长文能力 [RULER2024][LongBench2023][LongBenchV22024]。
来源论文· 3[RULER2024][LongBench2023][LongBenchV22024]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.headline_claimsgpt-5.4

    Beyond 32K, single-needle NIAH is near saturation for a range of frontier models; RULER’s multi-key, variable-tracking, and aggregation subtasks still open 20–30+ point gaps, so “can find one needle” is not a sufficient proxy for effective

consensusc-3b56a0394b
只有少数 retrieval heads 承担主要长距检索职责;在 LLaMA-2、Mistral、Yi 上,mask top-5% retrieval heads 会把 NIAH 类表现打到近随机,这说明长文 factuality 依赖稀疏专家化,而不是所有 head 平均分担 [Wu2024RetrievalHead]。
来源论文· 1[Wu2024RetrievalHead]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.headline_claimsgpt-5.4

    A small subset of retrieval heads carries most long-range retrieval; on LLaMA-2, Mistral, and Yi, masking the top 5% retrieval heads drives NIAH-style performance close to random, showing that long-context factuality depends on sparse speci

contestedc-4bcd78258e
反驳来自 [RULER2024]、[LongBenchV22024]、[BABILong2024]、[Goldman2024RetrievalOnly]:单针 retrieval 只能覆盖一条路径,tracking、aggregation、reasoning 会在 NIAH 饱和后继续分化。修正建议是把 NIAH 降级为入场测试,而不是总指标。
来源论文· 4[RULER2024][LongBenchV22024][BABILong2024][Goldman2024RetrievalOnly]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[0].countergpt-5.4

    [counter to Camp A: NIAH can still serve as the primary long-context met] The counter comes from [RULER2024], [LongBenchV22024], [BABILong2024], and [Goldman2024RetrievalOnly]: single-needle retrieval covers only one path, while tracking, a

contestedc-6572db0a93
反驳来自 [LongBenchV22024]、[BABILong2024]、[LooGLE2023]、[ZeroSCROLLS2023]:一旦任务要求跨段聚合、比较、时间线跟踪或多跳推理,检索只能缩小搜索空间,不能替代同一上下文里的组合计算。
来源论文· 4[LongBenchV22024][BABILong2024][LooGLE2023][ZeroSCROLLS2023]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[1].countergpt-5.4

    [counter to Camp B: Long context is mostly a retrieval problem, and RAG ] The counter comes from [LongBenchV22024], [BABILong2024], [LooGLE2023], and [ZeroSCROLLS2023]: once the task requires cross-span aggregation, comparison, timeline tra

contestedc-bcee19cba6
反驳来自 [LostMiddle2023]、[StreamingLLM2023]、[SinkEmergence2024]、[DataEngineering128K2024]:U-shape 至少混合了位置外推边界、训练长度分布和 sink-induced priority bias。只改 PE 常能推迟 cliff,却未必消除中段偏置。
来源论文· 4[LostMiddle2023][StreamingLLM2023][SinkEmergence2024][DataEngineering128K2024]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[2].countergpt-5.4

    [counter to Camp C: Lost-in-the-middle is mainly a positional-encoding p] The counter comes from [LostMiddle2023], [StreamingLLM2023], [SinkEmergence2024], and [DataEngineering128K2024]: U-shape mixes at least positional extrapolation limit

contestedc-5eacef3d73
反驳来自 [Wu2024RetrievalHead] 与 [Olsson2022InductionHeads]:当 mask 少数 retrieval heads 就能把 NIAH 打回近随机,而 induction heads 又能解释 ICL 相变时,‘平均分担’已经很难维持。更像真实情况的是:整体表征提供底座,少数 head 决定关键路径是否存在。
来源论文· 2[Wu2024RetrievalHead][Olsson2022InductionHeads]
1 观测Long Context Capacity and Decay
证据 (1)
  • topic_reportlong-context-capacity-and-decay· report.positions[3].countergpt-5.4

    [counter to Camp D: Long-context capability is distributed across the wh] The counter comes from [Wu2024RetrievalHead] and [Olsson2022InductionHeads]: once masking a small set of retrieval heads can drive NIAH close to random, and induction

contestedc-b50fc2d2b4
Programming Every Example [ProgrammingEveryExample2024] 反驳的不是“过滤无用”,而是“过滤足以处理高价值长尾样本”。全局规则对局部结构损坏往往只能二选一:误删或放过。
来源论文· 1[ProgrammingEveryExample2024]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[0].countergpt-5.4

    [counter to Camp A: Global filtering and deduplication are already stron] Programming Every Example [ProgrammingEveryExample2024] does not dispute that filtering works; it disputes that filtering is sufficient for high-value tail samples. G

contestedc-8d6a95a410
Programming Every Example [ProgrammingEveryExample2024] 修正的是 cB:不是所有低质样本都该修,而是只有“高价值但局部损坏”的样本值得修。pruning 与 rewriting 不是互斥关系,而是按样本类型分工。
来源论文· 1[ProgrammingEveryExample2024]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[1].countergpt-5.4

    [counter to Camp B: Low-value data should be pruned directly; repair is ] Programming Every Example [ProgrammingEveryExample2024] revises this claim: not all low-quality samples should be repaired; only those that are high-value but locally

contestedc-bf506f3eab
Programming Every Example [ProgrammingEveryExample2024] 在预训练场景里需要更强约束:alignment 中“更像偏好答案”不等于预训练中“更接近真实数据分布”。没有外部验证,闭环会把模型风格写回语料。
来源论文· 1[ProgrammingEveryExample2024]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[2].countergpt-5.4

    [counter to Camp C: A model-generated loop can directly take over qualit] Programming Every Example [ProgrammingEveryExample2024] needs stronger constraints in pre-training: in alignment, “more like the preferred answer” is not the same as

contestedc-dff798c6ff
Programming Every Example [ProgrammingEveryExample2024] 的反驳点在于覆盖面:synthetic 数据在窄能力、强格式任务上更有效,但开放域知识与长尾表达仍需要真实语料支撑。逐样本改写保留了真实分布的骨架。
来源论文· 1[ProgrammingEveryExample2024]
1 观测Programming Every Example Liftin
证据 (1)
  • topic_reportprogramming-every-example-lifting-pre-tr· report.positions[3].countergpt-5.4

    [counter to Camp D: High-density synthetic data is a more direct path th] Programming Every Example [ProgrammingEveryExample2024] pushes back on coverage: synthetic data works better for narrow skills and strongly formatted tasks, but open-

consensusc-5384adfb7d
在公开证据里,depth-loop 已经从小模型现象推进到 3.5B × 8T token 预训练;“shared recurrence 只能做玩具实验”的判断已被 Huginn 直接反驳 [Geiping2025Huginn]。
来源论文· 1[Geiping2025Huginn]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.headline_claimsgpt-5.4

    In public evidence, depth-looping has moved beyond toy models to 3.5B-scale, 8T-token pretraining; the claim that shared recurrence only works in toy settings is directly contradicted by Huginn [Geiping2025Huginn].

consensusc-5491a8df2b
matched-FLOPs 下,loop 的收益更接近“1.1–1.3× compute 换接近 1× 参数等效”的量级,而不是“共享参数几乎免费替代 dense scaling”;Huginn、MoR 与 depth-vs-width 反方文献共同支持这一读法 [Geiping2025Huginn][Bae2025MoR][Levine2020DepthWidth]。
来源论文· 3[Geiping2025Huginn][Bae2025MoR][Levine2020DepthWidth]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.headline_claimsgpt-5.4

    At matched FLOPs, loop gains look closer to “1.1–1.3× compute for roughly 1× parameter-equivalent” than to “parameter sharing nearly replaces dense scaling for free”; Huginn, MoR, and depth-vs-width counter-evidence jointly support this rea

consensusc-7516132c48
loop 的最清晰卖点在推理期:同一 checkpoint 上增加 loop count,reasoning 基准表现呈单调上升,这使它成为与 explicit CoT 并列、但不等价的 test-time compute 旋钮 [Yang2025LoopedLatentThoughts][Wu2024InferenceScaling]。
来源论文· 2[Yang2025LoopedLatentThoughts][Wu2024InferenceScaling]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.headline_claimsgpt-5.4

    The clearest value proposition of looping is at inference time: increasing loop count on the same checkpoint yields monotonic gains on reasoning benchmarks, making it a test-time compute knob parallel to, but not equivalent to, explicit CoT

consensusc-bfcdc99cde
latent-loop 与 depth-loop 不是同一命题:前者主要对比 explicit CoT token budget,后者主要对比 dense scaling 或 adaptive depth;若不分开讨论,会误判收益来源 [Hao2024Coconut][Tack2025CoCoMix][Meta2024BLT][Bae2024RRT]。
来源论文· 4[Hao2024Coconut][Tack2025CoCoMix][Meta2024BLT][Bae2024RRT]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.headline_claimsgpt-5.4

    Latent-looping and depth-looping are not the same proposition: the former mainly competes with explicit CoT token budgets, while the latter mainly competes with dense scaling or adaptive depth; failing to separate them leads to wrong attrib

contestedc-c5cd1bc896
反方来自 Levine et al. [Levine2020DepthWidth]、Liu et al. [Liu2020VeryDeepTransformers]、Kaplan et al. [Kaplan2020ScalingLaws]:self-attention 未必 depth-efficient,untied depth 与 width 仍是强基线。公开结果更像“接近但未打平”,不是“共享递归普遍更优”。
来源论文· 3[Levine2020DepthWidth][Liu2020VeryDeepTransformers][Kaplan2020ScalingLaws]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[0].countergpt-5.4

    [counter to Camp A: Looping can largely substitute for adding parameters] The counter comes from Levine et al. [Levine2020DepthWidth], Liu et al. [Liu2020VeryDeepTransformers], and Kaplan et al. [Kaplan2020ScalingLaws]: self-attention may n

contestedc-be9c70a258
反方来自 MoR [Bae2025MoR]、ACT [Graves2016ACT]、Depth-Adaptive Transformer [Elbayad2019DepthAdaptive]、LayerSkip [Elhoushi2024LayerSkip],以及 gate 训练困难研究 [Lin2026ConditionalDepthRouting]。问题不是 adaptive depth 没价值,而是 gate 学习难、辅助损失敏感、收益是否超过复杂度仍需算账。
来源论文· 4[Bae2025MoR][Graves2016ACT][Elbayad2019DepthAdaptive][Elhoushi2024LayerSkip]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[1].countergpt-5.4

    [counter to Camp B: Fixed loop counts are enough; adaptive routing only ] The counter comes from MoR [Bae2025MoR], ACT [Graves2016ACT], Depth-Adaptive Transformer [Elbayad2019DepthAdaptive], LayerSkip [Elhoushi2024LayerSkip], and gate-train

contestedc-233ac5c7cb
反方不是说 latent reasoning 无效,而是指出它回答的是另一类问题。Huginn [Geiping2025Huginn]、RRT [Bae2024RRT]、MoR [Bae2025MoR] 关心的是 matched-FLOPs 下 shared-depth recurrence 的工程价值;BLT [Meta2024BLT] 还提示 latent-loop 的收益可能混入表示单位变化。
来源论文· 4[Geiping2025Huginn][Bae2024RRT][Bae2025MoR][Meta2024BLT]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[2].countergpt-5.4

    [counter to Camp C: The real loop should live in latent state, not in th] The counter is not that latent reasoning fails, but that it answers a different question. Huginn [Geiping2025Huginn], RRT [Bae2024RRT], and MoR [Bae2025MoR] care abou

contestedc-cf3ae3ea19
反方来自 Liu et al. [Liu2022AutomataShortcuts]、Csordás et al. [Csordas2021SystematicGeneralization]、Furrer et al. [Furrer2020CompositionalGeneralization]:很多看似需要 recurrence 的任务,Transformer 也可能靠 shortcut、训练细节或预训练追上,因此不能把合成任务上的外推直接推广到开放域 reasoning
来源论文· 1[Liu2022AutomataShortcuts]
1 观测Looped Language Modeling
证据 (1)
  • topic_reportlooped-language-modeling· report.positions[3].countergpt-5.4

    [counter to Camp D: Looping gains come from recurrent inductive bias, no] The counter comes from Liu et al. [Liu2022AutomataShortcuts], Csordás et al. [Csordas2021SystematicGeneralization], and Furrer et al. [Furrer2020CompositionalGenerali

contestedtrainingc-f177d985f2
RL 推理中的 aha 时刻、长度扩展和熵动态可能反映了涌现的层级推理机制。
1 观测MechanismsHierarchyReasoning
证据 (1)
consensusevaluationc-7d2a37585c
软件工程智能体评测不仅受模型能力限制,还受真实数据稀缺与污染威胁。
1 观测SweBenchmarkContamination
证据 (1)
contestedc-21521ca488
Liu et al. [R1ZeroCritical2025] 反驳的是宽泛外推:没有 matched base、数据、distillation 和 inference compute,不能说 RL alone 解释收益。From Reasoning to Agentic [CreditAssignment2026] 也指出 agentic RL 的 credit 跨动作和环境状态,不能只靠 final answer。
来源论文· 1[CreditAssignment2026]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[0].countergpt-5.5

    [counter to Camp A: Outcome-only RLVR is enough] Liu et al. [R1ZeroCritical2025] refute the broad extrapolation: without matched base model, data, distillation, and inference compute, RL alone cannot be credited for the gain. From Reasoning

contestedc-66557fa993
PRM 会引入第二个模型的偏差,且公开实验还缺少同模型、同任务、同预算下 outcome、process、step、turn 和 attribution reward 的直接对比。
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[1].countergpt-5.5

    [counter to Camp B: Process and step rewards] PRMs introduce bias from a second model, and public experiments still lack direct comparisons of outcome, process, step, turn, and attribution rewards under identical models, tasks, and budgets.

contestedc-6323bd855c
归因分数不自动等于因果贡献。Tree-structured CA [TreeCA2025] 和 CARL [CARL2025] 更适合有结构或可重放轨迹;在开放网页环境里,外部状态变化会让反事实估计变脏。
来源论文· 2[TreeCA2025][CARL2025]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[2].countergpt-5.5

    [counter to Camp C: Attribution and causal credit] Attribution scores do not automatically equal causal contribution. Tree-structured CA [TreeCA2025] and CARL [CARL2025] are better suited to structured or replayable trajectories; in open we

contestedc-a843a1a593
ReAct [ReAct2022]、Toolformer [Toolformer2023] 和 WebGPT [WebGPT2021] 表明 prompting、self-supervised tool traces、imitation 与 human feedback 仍是强对照。没有 fair budget 对比时,RL gain 可能只是更多交互采样。
来源论文· 3[ReAct2022][Toolformer2023][WebGPT2021]
1 观测Bot Topic
证据 (1)
  • topic_reportbot-topic· report.positions[3].countergpt-5.5

    [counter to Camp D: Agentic RL versus non-RL agent baselines] ReAct [ReAct2022], Toolformer [Toolformer2023], and WebGPT [WebGPT2021] show that prompting, self-supervised tool traces, imitation, and human feedback remain strong controls. Wi

consensustrainingc-ae929747e2
通用智能体 RL 框架正在将 RL 训练与智能体执行解耦,暗示后训练可在任意智能体上模块化。
1 观测FrameworkModularityAgents
证据 (1)
consensusarchc-c67f4420cd
递归式参数复用正在从纯文本 LM 扩展到多模态模型,这说明该架构思想可能先于 LLM 证据定论而跨域泛化。
1 观测RecursiveMultimodalParameter Reuse
证据 (1)