publications

* indicates equal contribution; ^ denotes the student/researcher I mentored.

2025

  1. COLING-Industry
    Aurora-M: Open Source Continual Pre-training for Multilingual Language and Code
    Taishi Nakamura, Mayank Mishra, Simone Tedeschi, Yekun Chai, and 41 more authors
    2025

2024

  1. preprint
    MA-RLHF: Reinforcement Learning from Human Feedback with Macro Actions
    Yekun Chai*†, Haoran Sun*^Huang FangShuohuan Wang, and 2 more authors
    2024
  2. preprint
    StarCoder 2 and The Stack v2: The Next Generation
    Anton Lozhkov , Raymond Li, Loubna Ben Allal, Federico Cassano, and 62 more authors
    2024
  3. EMNLP
    Autoregressive Pre-Training on Pixels and Texts
    Yekun Chai, Qingyi Liu^, Jingwu Xiao^Shuohuan Wang, and 2 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
  4. EMNLPOral
    On Training Data Influence of GPT Models
    Yekun Chai, Qingyi Liu^Shuohuan Wang, Yu Sun, and 2 more authors
    In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, Nov 2024
  5. EMNLP-Findings
    Tokenization Falling Short: On Subword Robustness in Large Language Models
    Yekun Chai , Yewei Fang, Qiwei Peng, and Xuhong Li
    In Findings of the Association for Computational Linguistics: EMNLP 2024, Nov 2024
  6. ICML
    GiLOT: Interpreting Generative Language Models via Optimal Transport
    Xuhong Li*, Jiamin Chen*Yekun Chai*, and Haoyi Xiong
    In Forty-first International Conference on Machine Learning, Nov 2024
  7. LREC-COLING
    HumanEval-XL: A Multilingual Code Generation Benchmark for Cross-lingual Natural Language Generalization
    Qiwei Peng*Yekun Chai*, and Xuhong Li
    In Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), May 2024
  8. ICLRSpotlight
    Tool-Augmented Reward Modeling
    Lei Li*^Yekun Chai*†Shuohuan Wang, Yu Sun, and 3 more authors
    In The Twelfth International Conference on Learning Representations , May 2024

2023

  1. ACL-Findings
    ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages
    Yekun ChaiShuohuan Wang, Chao Pang, Yu Sun, and 2 more authors
    In Findings of the Association for Computational Linguistics: ACL 2023 , Jul 2023
  2. NeurIPSDatasets and Benchmarks
    M4: A Unified XAI Benchmark for Faithfulness Evaluation of Feature Attribution Methods across Metrics, Modalities and Models
    Xuhong LiMengnan Du, Jiamin Chen, Yekun Chai, and 2 more authors
    In Thirty-seventh Conference on Neural Information Processing Systems Datasets and Benchmarks Track, Jul 2023
  3. IJCNLP-AACLDemos
    ERNIE-Music: Text-to-Waveform Music Generation with Diffusion Models
    Pengfei Zhu^, Chao Pang, Yekun Chai , Lei Li^, and 4 more authors
    In Proceedings of the 13th International Joint Conference on Natural Language Processing and the 3rd Conference of the Asia-Pacific Chapter of the Association for Computational Linguistics: System Demonstrations, Nov 2023
  4. ICASSPOral
    Improved Training of Mixture-of-Experts Language GANs
    Yekun ChaiQiyue Yin, and Junge Zhang
    In 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Nov 2023

2022

  1. EMNLP-Findings
    Clip-Tuning: Towards Derivative-free Prompt Learning with a Mixture of Rewards
    Yekun ChaiShuohuan Wang, Yu Sun, Hao Tian, and 2 more authors
    In Findings of the Association for Computational Linguistics: EMNLP 2022, Dec 2022
  2. ACL Oral
    Predicate-Argument Based Bi-Encoder for Paraphrase Identification
    Qiwei PengDavid Weir, Julie Weeds, and Yekun Chai
    In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), May 2022
  3. SemEval
    X-PuDu at SemEval-2022 Task 6: Multilingual Learning for English and Arabic Sarcasm Detection
    Yaqian Han, Yekun ChaiShuohuan Wang, Yu Sun, and 4 more authors
    In Proceedings of the 16th International Workshop on Semantic Evaluation (SemEval-2022), Jul 2022

2021

  1. EMNLP-Findings
    Counter-Contrastive Learning for Language GANs
    Yekun Chai , Haidong Zhang, Qiyue Yin, and Junge Zhang
    In Findings of the Association for Computational Linguistics: EMNLP 2021, Nov 2021

2020

  1. ACL
    Highway Transformer: Self-Gating Enhanced Self-Attentive Networks
    Yekun Chai, Shuo Jin, and Xinwen Hou
    In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, Jul 2020