Yekun Chai

london.jpg

My research focuses on pretraining for foundation models: how data recipes, tokenization, and scale govern the emergence of reasoning, code, and agentic capabilities.

I study which training factors remain predictive as models scale, which capability bottlenecks persist under scaling, and how early training decisions set the ceiling for what later training can amplify.

I have contributed to ERNIE, ERNIE-Code, and StarCoder 2.

News

Aug 21, 2025 Four papers accepted to EMNLP 2025.
May 16, 2025 Curiosity-driven RLHF accepted to ACL 2025. [code]
Jan 23, 2025 MA-RLHF accepted to ICLR 2025. [paper] [code]

Latest Posts

Selected Publications

  1. ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
    Yekun ChaiShuohuan Wang, Chao Pang, Yu SunHao Tian, and Hua Wu
    In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023
  1. EMNLP Oral
    On Training Data Influence of GPT Models
    Yekun Chai, Qingyi Liu^Shuohuan WangYu SunQiwei Peng, and Hua Wu
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
  1. ICLR Spotlight
    Tool-Augmented Reward Modeling
    Lei Li*^Yekun Chai*†Shuohuan WangYu SunHao TianNingyu Zhang, and 1 more author
    In The Twelfth International Conference on Learning Representations(top 5%) , 2024