publications | Yekun Chai

2026

Tech Report

ERNIE 5.0 Technical Report

Baidu ERNIE

arXiv preprint arXiv:2602.04705, 2026

2025

ACL

Curiosity-Driven Reinforcement Learning from Human Feedback

Haoran Sun^{*^}, Yekun Chai^*†, Shuohuan Wang, Yu Sun, Hua Wu , and Haifeng Wang

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

Abs TL;DR HTML PDF Code

Intrinsic-reward exploration for RLHF — preserves output diversity without sacrificing alignment.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We will make our code publicly available.
ICLR

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Yekun Chai^*† , Haoran Sun^{*^}, Huang Fang, Shuohuan Wang, Yu Sun, and Hua Wu

In The Thirteenth International Conference on Learning Representations, Jul 2025

TL;DR HTML PDF Code

Temporal abstraction for RLHF — treating token sequences as macro-actions improves credit assignment in long-horizon generation.
COLING-Industry

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, Jason T. Stillerman, Felix Friedrich, and 35 more authors

In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, Jan 2025

Abs HTML PDF Blog Code

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.

2024

Tech Report

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov , Raymond Li, Loubna Ben Allal, Federico Cassano, Joel Lamy-Poirier, Nouamane Tazi, and 60 more authors

Feb 2024

PDF Blog Code
EMNLP

Autoregressive Pre-Training on Pixels and Texts

Yekun Chai, Qingyi Liu^{^}, Jingwu Xiao^{^}, Shuohuan Wang, Yu Sun, and Hua Wu

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR HTML PDF Code Poster Slides

Unifies autoregressive pre-training across document pixels and text; next-patch prediction alone scales on pixels.

The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language—both visual and textual—within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at https://github.com/ernie-research/pixelgpt.
EMNLP Oral

On Training Data Influence of GPT Models

Yekun Chai, Qingyi Liu^{^}, Shuohuan Wang, Yu Sun, Qiwei Peng, and Hua Wu

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR HTML PDF Code Poster Slides

Training-side attribution for GPT models — simulating how data examples shape loss trajectories and downstream behavior.

Oral

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.
EMNLP-Findings

Tokenization Falling Short: On Subword Robustness in Large Language Models

Yekun Chai , Yewei Fang, Qiwei Peng, and Xuhong Li

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs TL;DR HTML PDF Code Poster Slides

Shows how tokenization and subword structure constrain robustness, problem solving, and capability formation in LLMs.

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens—issues we term *the curse of tokenization*. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval.
ICML

GiLOT: Interpreting Generative Language Models via Optimal Transport

Xuhong Li^*, Jiamin Chen^*, Yekun Chai^*, and Haoyi Xiong

In Forty-first International Conference on Machine Learning, Nov 2024

HTML PDF Code
LREC-COLING

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Qiwei Peng^*, Yekun Chai^*, and Xuhong Li

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs PDF Code Poster Slides

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
ICLR Spotlight

Tool-Augmented Reward Modeling

Lei Li^{*^}, Yekun Chai^*†, Shuohuan Wang, Yu Sun, Hao Tian, Ningyu Zhang, and 1 more author

In The Twelfth International Conference on Learning Representations (top 5%) , May 2024

TL;DR HTML PDF Code Poster Slides

Early effort on agentic reward models — extending scalar preference scoring toward tool-use verification.

Spotlight

2023

ACL-Findings

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Yekun Chai, Shuohuan Wang, Chao Pang, Yu Sun, Hao Tian, and Hua Wu

In Findings of the Association for Computational Linguistics: ACL 2023, Jul 2023

TL;DR HTML PDF Code

Early effort on code pretraining with broad multilingual support across 116 natural languages and 6 programming languages.
ICASSP Oral

Improved Training of Mixture-of-Experts Language GANs

Yekun Chai, Qiyue Yin , and Junge Zhang

In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jul 2023

PDF

Oral

2022

EMNLP-Findings

Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

Yekun Chai, Shuohuan Wang, Yu Sun, Hao Tian, Hua Wu , and Haifeng Wang

In Findings of the Association for Computational Linguistics: EMNLP 2022, Dec 2022

Abs PDF

Derivative-free prompt learning has emerged as a lightweight alternative to prompt tuning, which only requires model inference to optimize the prompts. However, existing work did not take full advantage of the over-parameterized characteristics of large pre-trained language models (PLMs). In this paper, we propose Clip-Tuning, a simple yet effective method that adopts diverse frozen “thinned” networks of PLMs to obtain *a mixture of rewards* and thus advance the derivative-free prompt learning. The thinned networks consist of all the hidden units that survive a stationary dropout strategy, whose inference predictions reflect an ensemble of partial views over prompted training samples. Our method outperforms previous gradient-free prompt learning methods and achieves parity with gradient-based counterparts on seven language understanding benchmarks under few-shot settings.

2021

EMNLP-Findings

Counter-Contrastive Learning for Language GANs

Yekun Chai , Haidong Zhang, Qiyue Yin , and Junge Zhang

In Findings of the Association for Computational Linguistics: EMNLP 2021, Nov 2021

Abs PDF Poster Slides

Generative Adversarial Networks (GANs) have achieved great success in image synthesis, but have proven to be difficult to generate natural language. Challenges arise from the uninformative learning signals passed from the discriminator. In other words, the poor learning signals limit the learning capacity for generating languages with rich structures and semantics. In this paper, we propose to adopt the counter-contrastive learning (CCL) method to support the generator’s training in language GANs. In contrast to standard GANs that adopt a simple binary classifier to discriminate whether a sample is real or fake, we employ a counter-contrastive learning signal that advances the training of language synthesizers by (1) pulling the language representations of generated and real samples together and (2) pushing apart representations of real samples to compete with the discriminator and thus prevent the discriminator from being overtrained. We evaluate our method on both synthetic and real benchmarks and yield competitive performance compared to previous GANs for adversarial sequence generation.

2020

ACL

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Yekun Chai, Shuo Jin, and Xinwen Hou

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020

Abs PDF Code Slides

Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary content-based SDU gates allow for the information flow of modulated latent embeddings through skipped connections, leading to a clear margin of convergence speed with gradient descent algorithms. We may unveil the role of gating mechanism to aid in the context-based Transformer modules, with hypothesizing that SDU gates, especially on shallow layers, could push it faster to step towards suboptimal points during the optimization process.