publications | Yekun Chai

2025

EMNLP-Findings

EvolKV: Evolutionary KV Cache Compression for LLM Inference

Bohan Yu^{^}, and Yekun Chai^†

EMNLP Findings, 2025
EMNLP

CodeMixBench: Evaluating Code-Mixing Capabilities of LLMs Across 18 Languages

Yilun Yang^{^}, and Yekun Chai^†

EMNLP, 2025
EMNLP

Understanding Subword Compositionality of Large Language Models

Qiwei Peng, Yekun Chai, and Anders Søgaard

EMNLP, 2025
EMNLP

Debiasing Multilingual LLMs in Cross-lingual Latent Space

Qiwei Peng, Guimin Hu, Yekun Chai, and Anders Søgaard

EMNLP, 2025
ACL

Curiosity-Driven Reinforcement Learning from Human Feedback

Haoran Sun^{*^}, Yekun Chai^*†, Shuohuan Wang , Yu Sun, and 2 more authors

In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Jul 2025

Abs TL;DR HTML PDF Code

First to introduce curiosity-driven exploration into RLHF to improve output diversity.

Reinforcement learning from human feedback (RLHF) has proven effective in aligning large language models (LLMs) with human preferences, but often at the cost of reduced output diversity. This trade-off between diversity and alignment quality remains a significant challenge. Drawing inspiration from curiosity-driven exploration in reinforcement learning, we introduce curiosity-driven RLHF (CD-RLHF), a framework that incorporates intrinsic rewards for novel states, alongside traditional sparse extrinsic rewards, to optimize both output diversity and alignment quality. We demonstrate the effectiveness of CD-RLHF through extensive experiments on a range of tasks, including text summarization and instruction following. Our approach achieves significant gains in diversity on multiple diversity-oriented metrics while maintaining alignment with human preferences comparable to standard RLHF. We will make our code publicly available.
ICLR

MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions

Yekun Chai^*†, Haoran Sun^{*^}, Huang Fang, Shuohuan Wang, and 2 more authors

In The Thirteenth International Conference on Learning Representations, Jul 2025

TL;DR HTML PDF Code

First to extend RLHF with temporal abstraction of tokens via macro-actions (options / Semi-MDP) to improve credit assignment for long-horizon generation.
COLING-Industry

Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code

Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, and 37 more authors

In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, Jan 2025

Abs HTML PDF Blog Code

Pretrained language models are integral part of AI applications, but their high computational cost for training limits accessibility. Initiatives such as Bloom and StarCoder aim to democratize access to pretrained models for collaborative community development. Despite these efforts, such models encounter challenges such as limited multilingual capabilities, risks of catastrophic forgetting during continual pretraining, and the high costs of training models from scratch, alongside the need to align with AI safety standards and regulatory frameworks. This paper presents Aurora-M, a 15B parameter multilingual open-source model trained on English, Finnish, Hindi, Japanese, Vietnamese, and code. Continually pretrained from StarCoderPlus on 435B additional tokens, Aurora-M surpasses 2T tokens in total training token count. It is the first open-source multilingual model fine-tuned on human-reviewed safety instructions, thus aligning its development not only with conventional red-teaming considerations, but also with the specific concerns articulated in the Biden-Harris Executive Order on the Safe, Secure, and Trustworthy Development and Use of Artificial Intelligence. We evaluate Aurora-M across a wide range of tasks and languages, showcasing its robustness against catastrophic forgetting and its superior performance in multilingual settings, particularly in safety evaluations. We open-source Aurora-M and its variants to encourage responsible open-source development of large language models at https://huggingface.co/aurora-m.
COLING-Industry

Graph-Augmented Open-Domain Multi-Document Summarization

Xiaoping Shen^†, and Yekun Chai

In Proceedings of the 31st International Conference on Computational Linguistics: Industry Track, Jan 2025

Abs HTML PDF

In the open-domain multi-document summarization (ODMDS) task, retrieving relevant documents from large repositories and generating coherent summaries are crucial. However, existing methods often treat retrieval and summarization as separate tasks, neglecting the relationships among documents. To address these limitations, we propose an integrated retrieval-summarization framework that captures global document relationships through graph-based clustering, guiding the re-ranking of retrieved documents. This cluster-level thematic information is then used to guide large language models (LLMs) in refining the retrieved documents and generating more accurate, coherent summaries. Experimental results on the ODSUM benchmark demonstrate that our method significantly improves retrieval accuracy and produces summaries that surpass those derived from the oracle documents. These findings highlight the potential of our framework to improve both retrieval and summarization tasks in ODMDS.

2024

Technical Report

StarCoder 2 and The Stack v2: The Next Generation

Anton Lozhkov , Raymond Li, Loubna Ben Allal, Federico Cassano, and 62 more authors

Jan 2024

PDF Blog Code
EMNLP

Autoregressive Pre-Training on Pixels and Texts

Yekun Chai, Qingyi Liu^{^}, Jingwu Xiao^{^}, Shuohuan Wang, and 2 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs TL;DR HTML PDF Code Poster Slides

First to unify causal autoregressive pretraining on rendered document pixels and plain text, and to show scalable next-patch prediction on pixels.

The integration of visual and textual information represents a promising direction in the advancement of language models. In this paper, we explore the dual modality of language—both visual and textual—within an autoregressive framework, pre-trained on both document images and texts. Our method employs a multimodal training strategy, utilizing visual data through next patch prediction with a regression head and/or textual data through next token prediction with a classification head. We focus on understanding the interaction between these two modalities and their combined impact on model performance. Our extensive evaluation across a wide range of benchmarks shows that incorporating both visual and textual data significantly improves the performance of pixel-based language models. Remarkably, we find that a unidirectional pixel-based model trained solely on visual data can achieve comparable results to state-of-the-art bidirectional models on several language understanding tasks. This work uncovers the untapped potential of integrating visual and textual modalities for more effective language modeling. We release our code, data, and model checkpoints at https://github.com/ernie-research/pixelgpt.
EMNLPOral

On Training Data Influence of GPT Models

Yekun Chai, Qingyi Liu^{^}, Shuohuan Wang , Yu Sun, and 2 more authors

In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024

Abs HTML PDF Code Poster Slides

Oral

Amidst the rapid advancements in generative language models, the investigation of how training data shapes the performance of GPT models is still emerging. This paper presents GPTfluence, a novel approach that leverages a featurized simulation to assess the impact of training examples on the training dynamics of GPT models. Our approach not only traces the influence of individual training instances on performance trajectories, such as loss and other key metrics, on targeted test points but also enables a comprehensive comparison with existing methods across various training scenarios in GPT models, ranging from 14 million to 2.8 billion parameters, across a range of downstream tasks. Contrary to earlier methods that struggle with generalization to new data, GPTfluence introduces a parameterized simulation of training dynamics, demonstrating robust generalization capabilities to unseen training data. This adaptability is evident across both fine-tuning and instruction-tuning scenarios, spanning tasks in natural language understanding and generation. We make our code and data publicly available at https://github.com/ernie-research/gptfluence.
EMNLP-Findings

Tokenization Falling Short: On Subword Robustness in Large Language Models

Yekun Chai , Yewei Fang, Qiwei Peng, and Xuhong Li

In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024

Abs HTML PDF Code Poster Slides

Language models typically tokenize raw text into sequences of subword identifiers from a predefined vocabulary, a process inherently sensitive to typographical errors, length variations, and largely oblivious to the internal structure of tokens—issues we term *the curse of tokenization*. In this study, we delve into these drawbacks and demonstrate that large language models (LLMs) remain susceptible to these problems. This study systematically investigates these challenges and their impact on LLMs through three critical research questions: (1) complex problem solving, (2) token structure probing, and (3) resilience to typographical variation. Our findings reveal that scaling model parameters can mitigate the issue of tokenization; however, LLMs still suffer from biases induced by typos and other text format variations. Our experiments show that subword regularization such as BPE-dropout can mitigate this issue. We release our evaluation code and data at https://github.com/FloatAI/TKEval.
ICML

GiLOT: Interpreting Generative Language Models via Optimal Transport

Xuhong Li^*, Jiamin Chen^*, Yekun Chai^*, and Haoyi Xiong

In Forty-first International Conference on Machine Learning, Nov 2024

HTML PDF Code
LREC-COLING

HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization

Qiwei Peng^*, Yekun Chai^*, and Xuhong Li

In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024

Abs PDF Code Poster Slides

Large language models (LLMs) have made significant progress in generating codes from textual prompts. However, existing benchmarks have mainly concentrated on translating English prompts to multilingual codes or have been constrained to very limited natural languages (NLs). These benchmarks have overlooked the vast landscape of massively multilingual NL to multilingual code, leaving a critical gap in the evaluation of multilingual LLMs. In response, we introduce HumanEval-XL, a massively multilingual code generation benchmark specifically crafted to address this deficiency. HumanEval-XL establishes connections between 23 NLs and 12 programming languages (PLs), and comprises of a collection of 22,080 prompts with an average of 8.33 test cases. By ensuring parallel data across multiple NLs and PLs, HumanEval-XL offers a comprehensive evaluation platform for multilingual LLMs, allowing the assessment of the understanding of different NLs. Our work serves as a pioneering step towards filling the void in evaluating NL generalization in the area of multilingual code generation. We make our evaluation code and data publicly available at https://github.com/FloatAI/HumanEval-XL.
ICLRSpotlight

Tool-Augmented Reward Modeling

Lei Li^{*^}, Yekun Chai^*†, Shuohuan Wang , Yu Sun, and 3 more authors

In The Twelfth International Conference on Learning Representations (top 5%) , May 2024

TL;DR HTML PDF Code Poster Slides

First agentic reward model with tool-calling capabilities for RL-tuning.

Spotlight

2023

ACL-Findings

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

Yekun Chai, Shuohuan Wang, Chao Pang , Yu Sun, and 2 more authors

In Findings of the Association for Computational Linguistics: ACL 2023 , Jul 2023

Abs TL;DR HTML PDF Code

First to build a unified multilingual code LLM bridging many natural languages and many programming languages.

Software engineers working with the same programming language (PL) may speak different natural languages (NLs) and vice versa, erecting huge barriers to communication and working efficiency. Recent studies have demonstrated the effectiveness of generative pre-training in computer programs, yet they are always English-centric. In this work, we step towards bridging the gap between multilingual NLs and multilingual PLs for large language models (LLMs). We release ERNIE-Code, a unified pre-trained language model for 116 NLs and 6 PLs. We employ two methods for universal cross-lingual pre-training: span-corruption language modeling that learns patterns from monolingual NL or PL; and pivot-based translation language modeling that relies on parallel data of many NLs and PLs. Extensive results show that ERNIE-Code outperforms previous multilingual LLMs for PL or NL across a wide range of end tasks of code intelligence, including multilingual code-to-text, text-to-code, code-to-code, and text-to-text generation. We further show its advantage of zero-shot prompting on multilingual code summarization and text-to-text translation. We release our code and pre-trained checkpoints.
NeurIPSDatasets and Benchmarks

M⁴: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models

Xuhong Li, Mengnan Du, Jiamin Chen, Yekun Chai, and 2 more authors

In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Jul 2023

PDF Code Poster Slides

Datasets and Benchmarks
IJCNLP-AACLDemos

ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models

Pengfei Zhu^{^}, Chao Pang, Yekun Chai , Lei Li^{^}, and 4 more authors

In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, Nov 2023

PDF

Demos

2022

EMNLP-Findings

Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards

Yekun Chai, Shuohuan Wang , Yu Sun, Hao Tian, and 2 more authors

In Findings of the Association for Computational Linguistics: EMNLP 2022, Dec 2022

Abs PDF

Derivative-free prompt learning has emerged as a lightweight alternative to prompt tuning, which only requires model inference to optimize the prompts. However, existing work did not take full advantage of the over-parameterized characteristics of large pre-trained language models (PLMs). In this paper, we propose Clip-Tuning, a simple yet effective method that adopts diverse frozen “thinned” networks of PLMs to obtain *a mixture of rewards* and thus advance the derivative-free prompt learning. The thinned networks consist of all the hidden units that survive a stationary dropout strategy, whose inference predictions reflect an ensemble of partial views over prompted training samples. Our method outperforms previous gradient-free prompt learning methods and achieves parity with gradient-based counterparts on seven language understanding benchmarks under few-shot settings.
ACL Oral

Predicate-Argument Based Bi-Encoder for Paraphrase Identification

Qiwei Peng, David Weir, Julie Weeds, and Yekun Chai

In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022

Abs PDF Poster Slides

Oral

Paraphrase identification involves identifying whether a pair of sentences express the same or similar meanings. While cross-encoders have achieved high performances across several benchmarks, bi-encoders such as SBERT have been widely applied to sentence pair tasks. They exhibit substantially lower computation complexity and are better suited to symmetric tasks. In this work, we adopt a bi-encoder approach to the paraphrase identification task, and investigate the impact of explicitly incorporating predicate-argument information into SBERT through weighted aggregation. Experiments on six paraphrase identification datasets demonstrate that, with a minimal increase in parameters, the proposed model is able to outperform SBERT/SRoBERTa significantly. Further, ablation studies reveal that the predicate-argument based component plays a significant role in the performance gain.
SemEval

X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection

Yaqian Han, Yekun Chai, Shuohuan Wang , Yu Sun, and 4 more authors

In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Jul 2022

Abs PDF Poster

Detecting sarcasm and verbal irony from people’s subjective statements is crucial to understanding their intended meanings and real sentiments and positions in social scenarios. This paper describes the X-PuDu system that participated in SemEval-2022 Task 6, iSarcasmEval - Intended Sarcasm Detection in English and Arabic, which aims at detecting intended sarcasm in various settings of natural language understanding. Our solution finetunes pre-trained language models, such as ERNIE-M and DeBERTa, under the multilingual settings to recognize the irony from Arabic and English texts. Our system ranked second out of 43, and ninth out of 32 in Task A: one-sentence detection in English and Arabic; fifth out of 22 in Task B: binary multi-label classification in English; first out of 16, and fifth out of 13 in Task C: sentence-pair detection in English and Arabic.
ICASSPOral

Improved Training of Mixture-of-Experts Language GANs

Yekun Chai, Qiyue Yin , and Junge Zhang

In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Jul 2022

PDF

Oral

2021

EMNLP-Findings

Counter-Contrastive Learning for Language GANs

Yekun Chai , Haidong Zhang, Qiyue Yin , and Junge Zhang

In Findings of the Association for Computational Linguistics: EMNLP 2021, Nov 2021

Abs PDF Poster Slides

Generative Adversarial Networks (GANs) have achieved great success in image synthesis, but have proven to be difficult to generate natural language. Challenges arise from the uninformative learning signals passed from the discriminator. In other words, the poor learning signals limit the learning capacity for generating languages with rich structures and semantics. In this paper, we propose to adopt the counter-contrastive learning (CCL) method to support the generator’s training in language GANs. In contrast to standard GANs that adopt a simple binary classifier to discriminate whether a sample is real or fake, we employ a counter-contrastive learning signal that advances the training of language synthesizers by (1) pulling the language representations of generated and real samples together and (2) pushing apart representations of real samples to compete with the discriminator and thus prevent the discriminator from being overtrained. We evaluate our method on both synthetic and real benchmarks and yield competitive performance compared to previous GANs for adversarial sequence generation.
NAACL Workshop

RefineCap: Concept-Aware Refinement for Image Captioning

Yekun Chai, Shuo Jin, and Junliang Xing

NAACL Workshop on Visually Grounded Interaction and Language (ViGiL), Jun 2021

TL;DR PDF

RL-tunes transformers with visual concept guidance for semantically detailed caption generation.
NAACL Workshop

COIN: Conversational Interactive Networks for Emotion Recognition in Conversation

Haidong Zhang^*, and Yekun Chai^*

In Proceedings of the Third Workshop on Multimodal Artificial Intelligence, Jun 2021

HTML PDF
IJCNNoral

Neural Text Classification by Jointly Learning to Cluster and Align

Yekun Chai , Haidong Zhang, Qiyue Yin , and Junge Zhang

In International Joint Conference on Neural Networks, 2021

PDF

oral

2020

ACL

Highway Transformer: Self-Gating Enhanced Self-Attentive Networks

Yekun Chai, Shuo Jin, and Xinwen Hou

In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020

Abs PDF Code Slides

Self-attention mechanisms have made striking state-of-the-art (SOTA) progress in various sequence learning tasks, standing on the multi-headed dot product attention by attending to all the global contexts at different locations. Through a pseudo information highway, we introduce a gated component self-dependency units (SDU) that incorporates LSTM-styled gating units to replenish internal semantic importance within the multi-dimensional latent space of individual representations. The subsidiary content-based SDU gates allow for the information flow of modulated latent embeddings through skipped connections, leading to a clear margin of convergence speed with gradient descent algorithms. We may unveil the role of gating mechanism to aid in the context-based Transformer modules, with hypothesizing that SDU gates, especially on shallow layers, could push it faster to step towards suboptimal points during the optimization process.