📚Papers

ACL 2023

Annual Meeting of the Association for Computational Linguistics

会议官网
424/ 2876 相关论文
Track
方向
Tier
424 / 424 篇论文
1
FindingsACL 2023

Masked Latent Semantic Modeling: an Efficient Pre-training Alternative to Masked Language Modeling

视觉词义消歧任务需要融合文本与图像的语义信息,现有方法直接使用多模态编码器的隐层表示,缺乏显式语义符号约束,泛化能力不稳定。

Gábor Berend
pretraining-objectivemasked-language-modelinglatent-semanticsDOIDBLP
7
泛读LongACL 2023

Dissecting Transformer Length Extrapolation via the Lens of Receptive Field Analysis

Transformer长度外推的内在机制缺乏清晰解释,现有参数自由的位置编码方案无法真正利用长于训练序列的长度信息,ALiBi作为目前最常用的外推位置编码,其工作原理尚未被完全拆解。

Ta-Chung Chi,Ting-Han Fan,Alexander Rudnicky,Peter J. Ramadge
long-contextlength-extrapolationpositional-encodingDOIarXivDBLP
8
精读FindingsACL 2023

Why Can GPT Learn In-Context? Language Models Secretly Perform Gradient Descent as Meta-Optimizers

Damai Dai,Yutao Sun,Li Dong,Yaru Hao,Shuming Ma,Zhifang Sui,Furu Wei
iclmeta-learninggradient-descentDOIDBLP
8
精读LongACL 2023

Pre-Training to Learn in Context

这篇论文的核心问题是:模型的 in-context learning 能力是否可以通过预训练时显式构造“从上下文中学”的任务来获得,而不是把 ICL 当作大模型规模自然涌现的副产物。过去很多解释工作分析 ICL 的机制,但较少直接从预训练目标设计上去培养这种能力。

Yuxian Gu,Li Dong,Furu Wei,Minlie Huang
in-context-learningpretrainingmeta-learningDOIDBLP
8
精读LongACL 2023

SSD-LM: Semi-autoregressive Simplex-based Diffusion Language Model for Text Generation and Modular Control

这篇论文要解决的是:离散文本上的 diffusion LM 一直落后于自回归模型,原因既有生成效率问题,也有控制接口不自然的问题。作者认为,如果仍坚持完全并行、隐空间扩散,文本 diffusion 很难同时拿到质量、长度灵活性和可控性。

Xiaochuang Han,Sachin Kumar,Yulia Tsvetkov
Carnegie Mellon UniversityAllen Institute for AIdiffusion-lmsemi-autoregressivetext-generationDOIarXivDBLP
8
精读LongACL 2023

Understanding In-Context Learning via Supportive Pretraining Data

这篇论文的核心结论是:ICL 能力并不是均匀地来自整个预训练语料,而是有一小部分“supportive pretraining data”起了 disproportionate 作用。过去研究 ICL 更多关注模型结构和隐式算法,这篇工作把焦点转到语料子集上,问哪些数据真的在教模型学会从示例中归纳。

Xiaochuang Han,Daniel Simig,Todor Mihaylov,Yulia Tsvetkov,Asli Celikyilmaz,Tianlu Wang
Allen Institute for AIin-context-learningpretraining-datamechanism-analysisDOIarXivDBLP
8
精读LongACL 2023

DiffusionBERT: Improving Generative Masked Language Models with Diffusion Models

这篇论文要解决的问题是:生成式 masked language model 往往比自回归模型更难训练出高质量生成,而离散 diffusion 提供了另一条生成路径,但和现有预训练去噪语言模型之间还缺少有效结合。过去 BERT 一类模型擅长理解和去噪预训练,但生成能力有限;纯离散 diffusion LM 又常常训练慢、设计细节敏感。

Zhengfu He,Tianxiang Sun,Qiong Tang,Kuanning Wang,Xuanjing Huang,Xipeng Qiu
diffusion-lmmasked-language-modeldiscrete-diffusionDOIarXivDBLP
8
精读LongACL 2023

Backpack Language Models

这篇论文要解决的问题是:标准 Transformer 语言模型性能强,但内部表征难解释、也很难可控;如果为了可解释性改用更简单的模型,通常又会明显掉性能。此前很多解释方法都是事后分析,不直接改变模型接口,因此很难形成可操作、可干预的控制点。

John Hewitt,John Thickstun,Christopher D. Manning,Percy Liang
Stanford Universityarchitectureinterpretabilitybackpack-lmDOIarXivDBLP
8
精读LongACL 2023

Unnatural Instructions: Tuning Language Models with (Almost) No Human Labor

这篇论文要解决的问题是:指令微调对人类标注依赖很重,而这种依赖会成为扩展 instruction tuning 的主要瓶颈。过去高质量 instruction-following 数据大多来自众包或用户交互,成本高且覆盖有限,因此作者尝试验证:噪声较大的模型合成指令数据,是否也足以支持有效的指令微调。

Or Honovich,Thomas Scialom,Omer Levy,Timo Schick
instruction-tuningsynthetic-datadata-generationDOIarXivDBLP
8
精读LongACL 2023

Glot500: Scaling Multilingual Corpora and Language Models to 500 Languages

把多语言 LM 的语种覆盖从主流 ~100 种推到 500+ 种,包括大量低资源语言。直接扩 mBERT/XLM-R 会被词表和容量卡住,而低资源语料本身噪声大、分布极不平衡。

Ayyoob Imani,Peiqin Lin,Amir Hossein Kargaran,Silvia Severini,Masoud Jalili Sabet,Nora Kassner ... 省略 1 位作者 ... ,Helmut Schmid,André F. T. Martins,François Yvon,Hinrich Schütze
LMU Munichmultilingualscalingdata-mixtureDOIDBLP
8
精读LongACL 2023

ContraCLM: Contrastive Learning For Causal Language Model

Nihal Jain,Dejiao Zhang,Wasi Uddin Ahmad,Zijian Wang,Feng Nan,Xiaopeng Li ... 省略 2 位作者 ... ,Baishakhi Ray,Parminder Bhatia,Xiaofei Ma,Bing Xiang
contrastive-learningcausal-lmobjectiveDOIDBLP
8
精读LongACL 2023

Downstream Datasets Make Surprisingly Good Pretraining Corpora

重新审视预训练数据的选择策略,解决在计算资源受限时,盲目追求海量通用语料带来的效率低下问题。

Kundan Krishna,Saurabh Garg,Jeffrey P. Bigham,Zachary C. Lipton
Carnegie Mellon Universitypretrain-datadata-mixtureanalysisDOIDBLP
8
精读LongACL 2023

Sequence Parallelism: Long Sequence Training from System Perspective

这篇论文要解决的是长序列训练的系统瓶颈:即便有高效注意力算法,很多实现仍要求单卡持有整段序列,导致显存成为硬上限。过去工作更关注从算法上把注意力从二次复杂度降下来,但系统并行策略没有同步跟上,所以长序列训练仍然卡在设备内存布局上。

Shenggui Li,Fuzhao Xue,Chaitanya Baranwal,Yongbin Li,Yang You
sequence-parallelismlong-contexttraining-efficiencyDOIarXivDBLP
8
精读FindingsACL 2023

Tokenization Impacts Multilingual Language Modeling: Assessing Vocabulary Allocation and Overlap Across Languages

这篇论文要解决的是多语种 LM 里 tokenizer 设计长期被低估:大家往往默认共享 subword vocabulary 越多越利于跨语种迁移,但实际不同任务对词表重叠和语言专属 token 覆盖的需求并不一致。过去这个问题常被笼统地归结为‘词表大小’或‘共享有助于 transfer’,缺少更细的分析标准。

Tomasz Limisiewicz,Jirí Balhar,David Marecek
tokenizermultilingualvocabulary-allocationDOIarXivDBLP
8
精读LongACL 2023

CAME: Confidence-guided Adaptive Memory Efficient Optimization

Yang Luo,Xiaozhe Ren,Zangwei Zheng,Zhuo Jiang,Xin Jiang,Yang You
optimizermemory-efficienttrainingDOIDBLP
8
精读FindingsACL 2023

Nonparametric Masked Language Modeling

Sewon Min,Weijia Shi,Mike Lewis,Xilun Chen,Wen-tau Yih,Hannaneh Hajishirzi,Luke Zettlemoyer
nonparametricmasked-lmretrievalDOIDBLP
8
精读LongACL 2023

Crosslingual Generalization through Multitask Finetuning

构建高质量的多语言指令微调数据集成本极高,模型是否能将在高资源语言(如英语)上学到的指令遵循能力零样本泛化到其他语言?

Niklas Muennighoff,Thomas Wang,Lintang Sutawika,Adam Roberts,Stella Biderman,Teven Le Scao ... 省略 9 位作者 ... ,Zaid Alyafeai,Albert Webson,Edward Raff,Colin Raffel
Hugging FaceBigSciencemultitask-finetuningmultilingualbloomDOIDBLP
8
精读LongACL 2023

Reward Gaming in Conditional Text Generation

条件文本生成里 reward model(或自动指标)会被生成器钻空子,给出高分但质量差的输出。作者要系统性回答:reward gaming 在摘要、翻译、对话里到底长什么样、从哪儿来。

Richard Yuanzhe Pang,Vishakh Padmakumar,Thibault Sellam,Ankur P. Parikh,He He
NYUGooglereward-hackingrlhftext-generationDOIDBLP
8
精读FindingsACL 2023

Discovering Language Model Behaviors with Model-Written Evaluations

模型行为的系统评测成本高、覆盖差——人手写 eval 题慢,覆盖不了 sycophancy、power-seeking、自我保存等微妙倾向。作者问:能不能让 LM 自己大规模写 eval 题来探测 LM 的行为?

Ethan Perez,Sam Ringer,Kamile Lukosiute,Karina Nguyen,Edwin Chen,Scott Heiner ... 省略 20 位作者 ... ,Jamie Kerr,Jared Mueller,Jeeyoon Hyun,Joshua Landau
Anthropicmodel-evaluationalignmentsycophancyDOIDBLP
8
精读ShortACL 2023

Randomized Positional Encodings Boost Length Generalization of Transformers

Anian Ruoss,Grégoire Delétang,Tim Genewein,Jordi Grau-Moya,Róbert Csordás,Mehdi Bennani,Shane Legg,Joel Veness
positional-encodinglength-generalizationtransformerDOIDBLP
8
精读FindingsACL 2023

Tokenization with Factorized Subword Encoding

David Samuel,Lilja Øvrelid
tokenizersubwordfactorizedDOIDBLP
8
精读LongACL 2023

A Length-Extrapolatable Transformer

这篇工作要解决的是 Transformer 的长度外推问题:模型在短上下文训练后,为什么一到更长长度就失效,以及怎样设计结构让它在更长序列上仍可用。过去大量方法在训练长度内表现不错,但超出训练分布时注意力位置编码会失真,导致困惑度和任务精度明显恶化。

Yutao Sun,Li Dong,Barun Patra,Shuming Ma,Shaohan Huang,Alon Benhaim,Vishrav Chaudhary,Xia Song,Furu Wei
length-extrapolationpositional-encodinglong-contextDOIDBLP
8
精读FindingsACL 2023

Can Diffusion Model Achieve Better Performance in Text Generation ? Bridging the Gap between Training and Inference !

Zecheng Tang,Pinzheng Wang,Keyan Zhou,Juntao Li,Ziqiang Cao,Min Zhang
diffusion-lmtext-generationtraining-inference-gapDOIDBLP
8
精读FindingsACL 2023

Better Language Models of Code through Self-Improvement

Hung Quoc To,Nghi D. Q. Bui,Jin L. C. Guo,Tien N. Nguyen
code-lmself-improvementsynthetic-dataDOIDBLP
8
精读LongACL 2023

Self-Instruct: Aligning Language Models with Self-Generated Instructions

指令微调依赖人工编写的指令数据,数量、多样性、创造力均受人力限制,制约了指令微调模型的零样本泛化能力,此前缺乏成熟的模型自举生成指令数据的方案。

Yizhong Wang,Yeganeh Kordi,Swaroop Mishra,Alisa Liu,Noah A. Smith,Daniel Khashabi,Hannaneh Hajishirzi
instruction-tuningsynthetic-dataself-trainingDOIarXivDBLP
8
精读FindingsACL 2023

Farewell to Aimless Large-scale Pretraining: Influential Subset Selection for Language Model

Xiao Wang,Weikang Zhou,Qi Zhang,Jie Zhou,Songyang Gao,Junzhe Wang,Menghan Zhang,Xiang Gao,Yunwen Chen,Tao Gui
data-selectionpretrainingdata-qualityDOIDBLP
8
精读LongACL 2023

Training Trajectories of Language Models Across Scales

揭示语言模型在预训练过程中各项能力的习得顺序,解决以往研究只关注最终 checkpoint 而忽视训练动态(Training Dynamics)的问题。

Mengzhou Xia,Mikel Artetxe,Chunting Zhou,Xi Victoria Lin,Ramakanth Pasunuru,Danqi Chen,Luke Zettlemoyer,Veselin Stoyanov
Princeton UniversityMeta AIUniversity of Washingtontraining-dynamicsscalingtrajectoriesDOIDBLP
8
精读LongACL 2023

MixCE: Training Autoregressive Language Models by Mixing Forward and Reverse Cross-Entropies

自回归语言模型训练时只用 forward cross-entropy(即最大化 p_model(x) 在真实数据上的似然),这会导致模型在低概率区域分配过多概率质量(即 recall 优先但 precision 差),生成时容易产生不连贯或重复的文本。本文提出混合 forward 和 reverse cross-entropy 来同时优化两个方向。

Shiyue Zhang,Shijie Wu,Ozan Irsoy,Steven Lu,Mohit Bansal,Mark Dredze,David S. Rosenberg
UNC Chapel HillBloomberglanguage-modelingloss-functionautoregressiveDOIDBLP
8
泛读DemoACL 2023

TencentPretrain: A Scalable and Flexible Toolkit for Pre-training Models of Different Modalities

现有预训练框架(如 Hugging Face Transformers、Megatron-LM)通常针对单一模态或特定模型架构设计,缺乏一个统一的、可扩展的多模态预训练工具包。TencentPretrain 旨在提供一个支持文本、图像、语音等多模态的灵活预训练框架。

Zhe Zhao,Yudong Li,Cheng Hou,Jing Zhao,Rong Tian,Weijie Liu ... 省略 16 位作者 ... ,Zhanhui Kang,Xiaoyong Du,Linlin Shen,Kimmo Yan
TencentpretrainingtoolkitmultimodalDOIDBLP
1
ACL 2023

SzegedAI at SemEval-2023 Task 1: Applying Quasi-Symbolic Representations in Visual Word Sense Disambiguation

视觉词义消歧任务需要融合文本与图像的语义信息,现有方法直接使用多模态编码器的隐层表示,缺乏显式语义符号约束,泛化能力不稳定。

Gábor Berend
pretraining-objectivemasked-language-modelinglatent-semanticsDOIDBLP
8
精读LongACL 2023

Cross-View Language Modeling: Towards Unified Cross-Lingual Cross-Modal Pre-training

跨语言预训练与跨模态预训练此前为独立研究方向,使用不同的架构与目标,需要分别训练模型,参数效率低,未利用两类任务对齐语义空间的共性。

Yan Zeng,Wangchunshu Zhou,Ao Luo,Ziming Cheng,Xinsong Zhang
cross-lingualcross-modalunified-pretrainingDOIarXivDBLP
7
泛读LongACL 2023

On-the-fly Cross-lingual Masking for Multilingual Pre-training

基于单语语料MLM目标的多语言预训练缺乏显式跨语言前向传播,只能通过不同语言空间的重叠隐式学习跨语言对齐,对齐效率低,对低资源语言效果差。

Xi Ai,Bin Fang
multilingualpretrainingcross-lingualDOIDBLP
7
泛读LongACL 2023

Rethinking the Role of Scale for In-Context Learning: An Interpretability-based Case Study at 66 Billion Scale

大模型的上下文学习能力被默认是全参数协同作用的结果,缺乏对上下文学习相关参数的定位,导致模型冗余度高,部署成本高。

Hritik Bansal,Karthik Gopalakrishnan,Saket Dingliwal,Sravan Bodapati,Katrin Kirchhoff,Dan Roth
in-context-learningscalinginterpretabilityDOIarXivDBLP
6
泛读FindingsACL 2023

DMLM: Descriptive Masked Language Modeling

传统MLM预训练缺乏显式语义接地,同一掩码位置的预测没有显式词义约束,导致模型对多义词的理解能力不足,下游词义相关任务性能受限。

Edoardo Barba,Niccolò Campolungo,Roberto Navigli
masked-language-modelingpretraining-objectivesemantic-groundingDOIDBLP
3
LongACL 2023

DiffusEmp: A Diffusion Model-Based Framework with Multi-Grained Control for Empathetic Response Generation

开放域对话的共情响应生成任务现有方案产出的共情表达单调通用,此前研究仅依赖上下文隐式引导共情表达,没有引入显式多粒度控制信号约束输出。

Guanqun Bi,Lei Shen,Yanan Cao,Meng Chen,Yuqiang Xie,Zheng Lin,Xiaodong He
diffusion-lmtext-generationcontrolled-generationDOIarXivDBLP
7
泛读LongACL 2023

Searching for Needles in a Haystack: On the Role of Incidental Bilingualism in PaLM's Translation Capability

Eleftheria Briakou,Colin Cherry,George F. Foster
pretraining-datadata-mixturemultilingualDOIDBLP
7
泛读LongACL 2023

What is the best recipe for character-level encoder-only modelling?

这篇论文要回答的核心问题是:字符级 encoder-only 语言模型的最优配方到底是什么,决定性能的主要因素是架构还是预训练目标。这个问题过去一直没有被说清,因为字符级模型同时改了 token granularity、网络结构和训练目标,很多工作之间不可直接比较。

Kris Cao
character-level-lmtokenizer-freepretraining-objectiveDOIarXivDBLP
7
泛读FindingsACL 2023

ERNIE-Code: Beyond English-Centric Cross-lingual Pretraining for Programming Languages

这篇论文要解决的是:现有代码预训练模型过于英语中心,无法有效连接“多自然语言 × 多编程语言”的真实开发环境。过去代码 LLM 往往默认注释、说明和交互语言是英语,这让非英语开发者与代码知识之间存在额外语言壁垒。

Yekun Chai,Shuohuan Wang,Chao Pang,Yu Sun,Hao Tian,Hua Wu
Baiducode-pretrainingmultilingualnl-pl-alignmentDOIarXivDBLP
7
泛读FindingsACL 2023

Pre-training Language Model as a Multi-perspective Course Learner

这篇论文要解决的是:ELECTRA 虽然高效,但其 generator 只做 MLM、discriminator 只被动接收替换结果,导致训练信号单一且两者互动不足。过去 ELECTRA 的成功部分掩盖了一个问题:generator-discriminator 这个双模块框架并没有被充分利用,尤其 generator 的偏置会传导成 discriminator 的标签不平衡和学习低效。

Beiduo Chen,Shaohan Huang,Zihan Zhang,Wu Guo,Zhenhua Ling,Haizhen Huang,Furu Wei,Weiwei Deng,Qi Zhang
electrapretraining-objectivemlmDOIarXivDBLP
7
泛读FindingsACL 2023

AltCLIP: Altering the Language Encoder in CLIP for Extended Language Capabilities

原生CLIP仅支持英文,扩展多语言能力的现有方案需要从头训练多模态模型,算力成本高,没有简单有效的低成本扩展路径。

Zhongzhi Chen,Guang Liu,Bo-Wen Zhang,Qinghong Yang,Ledell Wu
cliptext-encodermultilingualDOIarXivDBLP
6
泛读FindingsACL 2023

Structural Contrastive Pretraining for Cross-Lingual Comprehension

现有多语言预训练任务仅关注语义层面的对齐,忽略了跨语言平行语料的结构知识,导致跨语言语义错位,低资源语言下游任务表现差。

Nuo Chen,Linjun Shou,Tengtao Song,Ming Gong,Jian Pei,Jianhui Chang,Daxin Jiang,Jia Li
cross-lingualpretraining-objectivecontrastive-learningDOIDBLP
7
泛读ShortACL 2023

Latent Positional Information is in the Self-Attention Variance of Transformer Language Models Without Positional Embeddings

此前普遍认为Transformer必须加入显式位置嵌入才能编码序列位置信息,近期研究质疑位置嵌入的必要性,但没有给出明确的机制解释。

Ta-Chung Chi,Ting-Han Fan,Li-Wei Chen,Alexander Rudnicky,Peter J. Ramadge
positional-encodingattentiontransformerDOIarXivDBLP
6
泛读FindingsACL 2023

Honey, I Shrunk the Language: Language Model Behavior at Reduced Scale

现有scaling law研究主要集中在百万到万亿参数的大模型区间,未覆盖1M参数以下的极小模型,不清楚预训练效果在极小参数规模下的涌现规律。

Vijeta Deshpande,Dan Pechi,Shree Thatte,Vladislav Lialin,Anna Rumshisky
scaling-lawsmall-modelstraining-dynamicsDOIarXivDBLP
7
泛读FindingsACL 2023

Fixing MoE Over-Fitting on Low-Resource Languages in Multilingual Machine Translation

解决多语言机器翻译中,稀疏门控混合专家模型(MoE)在低资源语种上严重过拟合的问题。过去MoE主要用于提升高资源任务容量,但在数据极度不平衡的场景下,低资源分支极易崩溃。

Maha Elbayad,Anna Y. Sun,Shruti Bhosale
Meta AImoemultilingualregularizationDOIarXivDBLP
7
泛读LongACL 2023

Explaining How Transformers Use Context to Build Predictions

解释Transformer在生成预测时,究竟是如何利用上下文信息的(Mechanistic Interpretability)。

Javier Ferrando,Gerard I. Gállego,Ioannis Tsiamas,Marta R. Costa-jussà
UPCinterpretabilityattentioncontext-attributionDOIDBLP
7
泛读SRWACL 2023

How do different tokenizers perform on downstream tasks in scriptio continua languages?: A case study in Japanese

在没有空格分隔的连续书写语言(如日语)中,不同的Tokenizer(如BPE、Unigram、词级别)对下游任务性能的具体影响是什么。

Takuro Fujii,Koki Shibata,Atsuki Yamaguchi,Terufumi Morishita,Yasuhiro Sogawa
tokenizermultilingualjapaneseDOIDBLP
7
泛读LongACL 2023

RARR: Researching and Revising What Language Models Say, Using Language Models

解决大语言模型生成内容存在事实错误(幻觉)的问题,且要求在不重新训练模型的情况下进行事后修正。

Luyu Gao,Zhuyun Dai,Panupong Pasupat,Anthony Chen,Arun Tejasvi Chaganty,Yicheng Fan ... 省略 1 位作者 ... ,Ni Lao,Hongrae Lee,Da-Cheng Juan,Kelvin Guu
CMUGoogleattributionfactualityself-refineDOIDBLP
7
精读LongACL 2023

Model-Generated Pretraining Signals Improves Zero-Shot Generalization of Text-to-Text Transformers

这篇论文要解决的是:标准 text-to-text 预训练信号与零样本泛化并不完全对齐,模型学到的是强大的拟合能力,但未必学到更通用的任务接口。过去大家主要靠提示工程或指令微调补零样本能力,这篇工作尝试直接在预训练阶段补信号。

Linyuan Gong,Chenyan Xiong,Xiaodong Liu,Payal Bajaj,Yiqing Xie,Alvin Cheung,Jianfeng Gao,Xia Song
pretraining-signalszero-shottext-to-textDOIDBLP
7
泛读FindingsACL 2023

PEER: Pre-training ELECTRA Extended by Ranking

这篇论文关注的是:ELECTRA 式替换检测预训练虽然高效,但训练信号偏局部二分类,可能不足以充分学习 token 间的细粒度相对偏好,能否通过额外 ranking 信号补上。过去判别式预训练通常强调 sample efficiency,但对序列内部排序关系利用不够。

Ru He,Wei Wang,Songfang Huang,Fei Huang
pretraining-objectiveelectraranking-lossDOIDBLP
7
泛读FindingsACL 2023

Fourier Transformer: Fast Long Range Modeling by Removing Sequence Redundancy with FFT Operator

这篇论文要解决的问题是:Transformer 在长序列上的二次复杂度太贵,但很多高效替代注意力的方法又破坏了与现有大规模预训练权重的兼容性。过去不少长序列模型通过改写注意力核或引入新参数来降复杂度,可是这样往往很难直接继承已有预训练模型,工程迁移成本很高。

Ziwei He,Meng Yang,Minwei Feng,Jingcheng Yin,Xinbing Wang,Jingwen Leng,Zhouhan Lin
long-contextattention-alternativefftDOIarXivDBLP
7
泛读FindingsACL 2023

Distilling Step-by-Step! Outperforming Larger Language Models with Less Training Data and Smaller Model Sizes

从标题看,这篇论文要解决的问题是:能否通过 step-by-step 蒸馏,用更小模型和更少训练数据超过更大语言模型。过去常见看法是,推理能力主要由模型规模驱动,而小模型即便蒸馏也难以跨越规模差距。

Cheng-Yu Hsieh,Chun-Liang Li,Chih-Kuan Yeh,Hootan Nakhost,Yasuhisa Fujii,Alex Ratner,Ranjay Krishna,Chen-Yu Lee,Tomas Pfister
distillationchain-of-thoughtreasoningDOIDBLP
7
泛读FindingsACL 2023

Inducing Character-level Structure in Subword-based Language Models with Type-level Interchange Intervention Training

subword-based LM 对字符级结构不敏感,在需要字符操作的任务(拼写、构词、字符游戏)上很弱。之前要么靠字符级模型(慢、长),要么 byte fallback(没解决表示问题)。

Jing Huang,Zhengxuan Wu,Kyle Mahowald,Christopher Potts
Stanford UniversityUT Austintokenizersubwordcharacter-levelDOIDBLP
7
泛读LongACL 2023

UnitY: Two-pass Direct Speech-to-speech Translation with Discrete Units

直接语音到语音翻译(S2ST)要么绕 text(cascade,延迟高、错误累积),要么端到端直出波形(难训)。如何用'离散语音单元'把 S2ST 做成类似 LM 的结构,同时保留文本监督信号。

Hirofumi Inaguma,Sravya Popuri,Ilia Kulikov,Peng-Jen Chen,Changhan Wang,Yu-An Chung,Yun Tang,Ann Lee,Shinji Watanabe,Juan Pino
Meta AIspeech-to-speechdiscrete-unitstokenizerDOIDBLP
7
泛读LongACL 2023

Knowledge Unlearning for Mitigating Privacy Risks in Language Models

Joel Jang,Dongkeun Yoon,Sohee Yang,Sungmin Cha,Moontae Lee,Lajanugen Logeswaran,Minjoon Seo
unlearningprivacylanguage-modelsDOIDBLP
7
泛读FindingsACL 2023

AutoMoE: Heterogeneous Mixture-of-Experts with Adaptive Computation for Efficient Neural Machine Translation

Ganesh Jawahar,Subhabrata Mukherjee,Xiaodong Liu,Young Jin Kim,Muhammad Abdul-Mageed,Laks V. S. Lakshmanan,Ahmed Hassan Awadallah,Sébastien Bubeck,Jianfeng Gao
moeadaptive-computationefficient-trainingDOIDBLP
7
泛读FindingsACL 2023

Transformer Language Models Handle Word Frequency in Prediction Head

语言模型在预测低频词(罕见词)时表现系统性地差。过去通常认为这是因为模型深层网络未能学好低频词的上下文表征,但具体瓶颈在哪一层并不清楚。

Goro Kobayashi,Tatsuki Kuribayashi,Sho Yokoi,Kentaro Inui
Tohoku UniversityRIKENinterpretabilityfrequencyanalysisDOIDBLP
7
泛读LongACL 2023

PaCE: Unified Multi-modal Dialogue Pre-training with Progressive and Compositional Experts

解决多模态对话预训练中,不同模态(图、文、音)和不同对话任务之间存在严重干扰(Negative transfer),导致统一模型难以兼顾各项能力的问题。

Yunshui Li,Binyuan Hui,Zhichao Yin,Min Yang,Fei Huang,Yongbin Li
Alibabamultimodal-pretrainingdialoguemoeDOIDBLP
7
精读LongACL 2023

Making Language Models Better Reasoners with Step-Aware Verifier

这篇论文要解决的是:链式思维提示虽然显著提升了算术和推理任务表现,但模型仍然会生成表面流畅、步骤错误的推理轨迹,最后答案也不稳定。此前常见做法多是只看 final answer 或对整条 rationale 做粗粒度 rerank,没有精确到“哪一步错了”。

Yifei Li,Zeqi Lin,Shizhuo Zhang,Qiang Fu,Bei Chen,Jian-Guang Lou,Weizhu Chen
reasoningverifierprmDOIarXivDBLP
7
精读FindingsACL 2023

Language Modeling with Latent Situations

这篇论文要解决的是语言模型经常生成与上下文世界状态不一致的内容,比如实体状态、事件因果和时间进展前后冲突。以往这类问题大多靠更大的模型或更强解码约束间接缓解,但模型并没有被明确要求表示‘当前情境 state’,因此连贯性提升有限。

Belinda Z. Li,Maxwell I. Nye,Jacob Andreas
language-modelinglatent-statescoherenceDOIarXivDBLP
7
精读FindingsACL 2023

Large Language Models with Controllable Working Memory

这篇论文要解决的是:当上下文事实与模型预训练记忆冲突时,LLM 往往不能稳定地以当前上下文为准。这个问题以前常被当成 prompt engineering 或检索增强的使用细节处理,但如果模型内部没有可控机制去调节‘记忆’和‘上下文’的优先级,单靠喂更多上下文并不可靠。

Daliang Li,Ankit Singh Rawat,Manzil Zaheer,Xin Wang,Michal Lukasik,Andreas Veit,Felix X. Yu,Sanjiv Kumar
working-memoryarchitecturelong-contextDOIarXivDBLP
7
精读LongACL 2023

Open-ended Long Text Generation via Masked Language Modeling

这篇论文要解决的是开放式长文本生成里 AR 模型推理太慢,而直接拿预训练 MLM 配合迭代式非自回归解码又会在长文本上崩掉。过去非自回归生成在机器翻译等短序列任务上更容易成立,但开放式长文本缺少强对齐目标,错误会在迭代填空中快速累积。

Xiaobo Liang,Zecheng Tang,Juntao Li,Min Zhang
masked-lmnon-ar-generationlong-text-generationDOIDBLP
7
精读LongACL 2023

RetroMAE-2: Duplex Masked Auto-Encoder For Pre-Training Retrieval-Oriented Language Models

Zheng Liu,Shitao Xiao,Yingxia Shao,Zhao Cao
masked-lmretrievalpretrainingDOIDBLP
7
泛读LongACL 2023

Z-ICL: Zero-Shot In-Context Learning with Pseudo-Demonstrations

Xinxi Lyu,Sewon Min,Iz Beltagy,Luke Zettlemoyer,Hannaneh Hajishirzi
iclzero-shotpseudo-demoDOIDBLP
7
泛读ShortACL 2023

Teaching Small Language Models to Reason

Lucie Charlotte Magister,Jonathan Mallinson,Jakub Adámek,Eric Malmi,Aliaksei Severyn
distillationreasoningcotDOIDBLP
7
泛读LongACL 2023

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

Alex Mallen,Akari Asai,Victor Zhong,Rajarshi Das,Daniel Khashabi,Hannaneh Hajishirzi
parametric-memoryretrievalfactualityDOIDBLP
7
泛读FindingsACL 2023

Mini-Model Adaptation: Efficiently Extending Pretrained Models to New Languages via Aligned Shallow Training

Kelly Marchisio,Patrick S. H. Lewis,Yihong Chen,Mikel Artetxe
multilingualcontinual-pretrainadaptationDOIDBLP
7
泛读FindingsACL 2023

Few-shot Fine-tuning vs. In-context Learning: A Fair Comparison and Evaluation

自 GPT-3 以来,社区普遍认为在少样本场景下,上下文学习(ICL)比少样本微调(Few-shot FT)更稳定、泛化更好,但这往往建立在未充分调优微调超参的不公平对比之上。

Marius Mosbach,Tiago Pimentel,Shauli Ravfogel,Dietrich Klakow,Yanai Elazar
Saarland UniversityBar-Ilan Universityfew-shotin-context-learningfine-tuningDOIDBLP
7
泛读LongACL 2023

How to Plant Trees in Language Models: Data and Architectural Effects on the Emergence of Syntactic Inductive Biases

语言模型在预训练中如何习得层级句法结构(Inductive Biases)尚不完全清楚,数据分布的特征与模型架构的选择在其中各自扮演了什么角色缺乏定量分析。

Aaron Mueller,Tal Linzen
Johns Hopkins Universityinductive-biassyntaxdata-effectsDOIDBLP
7
泛读ShortACL 2023

Grokking of Hierarchical Structure in Vanilla Transformers

Transformer 在训练初期倾向于学习表面启发式规则(如词汇共现),这种捷径学习导致其在分布外(OOD)数据上泛化失败。它们能否以及如何学到真正的层级结构?

Shikhar Murty,Pratyusha Sharma,Jacob Andreas,Christopher D. Manning
Stanford UniversityMITgrokkinghierarchical-structuretransformerDOIDBLP
7
泛读LongACL 2023

Efficient Transformers with Dynamic Token Pooling

标准 Transformer 在每一层都保留完整的序列长度,导致计算复杂度和内存消耗随序列长度呈二次方增长,限制了长上下文预训练的效率。

Piotr Nawrot,Jan Chorowski,Adrian Lancucki,Edoardo Maria Ponti
University of Wrocławdynamic-poolingtoken-efficiencytransformer-architectureDOIDBLP
7
精读LongACL 2023

On "Scientific Debt" in NLP: A Case for More Rigour in Language Model Pre-Training Research

这篇论文的核心问题是:NLP 里的语言模型预训练研究是否积累了大量“科学债务”,导致很多结论因实验设计不严谨而不可靠。过去社区追求快节奏迭代,经常接受单次跑分、控制变量不足、随机种子缺失和训练预算不对齐,这使得不少“方法有效”实际上难以复现或难以归因。

Made Nindyatama Nityasya,Haryo Akbarianto Wibowo,Alham Fikri Aji,Genta Indra Winata,Radityo Eko Prasojo,Phil Blunsom,Adhiguna Kuncoro
scientific-rigorpretrain-methodologyreproducibilityDOIDBLP
7
泛读FindingsACL 2023

What In-Context Learning "Learns" In-Context: Disentangling Task Recognition and Task Learning

这篇论文要解决的是 ICL 研究中的一个混淆:模型在上下文里表现变好,到底是因为它真的在示例中学到了新任务规则,还是只是识别出了任务类型并调用了预训练中已有能力。过去很多 ICL 论文把这两件事合在一起讨论,因此很难判断大模型到底具备多强的‘临场学习’能力。

Jane Pan,Tianyu Gao,Howard Chen,Danqi Chen
in-context-learningmechanism-analysistask-recognitionDOIDBLP
7
泛读LongACL 2023

Beyond English-Centric Bitexts for Better Multilingual Language Representation Learning

多语言表示学习过度依赖 English-centric 的平行语料(X-En),导致非英语对之间的对齐质量差、跨语言迁移走弯路。作者要验证:加入非英语中心的 bitext 能否直接改进多语言 encoder。

Barun Patra,Saksham Singhal,Shaohan Huang,Zewen Chi,Li Dong,Furu Wei,Vishrav Chaudhary,Xia Song
Microsoftmultilingual-pretrainingdata-mixturerepresentation-learningDOIDBLP
7
精读FindingsACL 2023

Recyclable Tuning for Continual Pre-training

Yujia Qin,Cheng Qian,Xu Han,Yankai Lin,Huadong Wang,Ruobing Xie,Zhiyuan Liu,Maosong Sun,Jie Zhou
continual-pretrainingparameter-efficienttransfer-learningDOIDBLP
7
精读LongACL 2023

Parallel Context Windows for Large Language Models

Nir Ratner,Yoav Levine,Yonatan Belinkov,Ori Ram,Inbal Magar,Omri Abend,Ehud Karpas,Amnon Shashua,Kevin Leyton-Brown,Yoav Shoham
long-contextinferenceparallelismDOIDBLP
7
泛读FindingsACL 2023

Distilling Reasoning Capabilities into Smaller Language Models

小参数量模型(<1B)无法原生涌现出思维链(CoT)推理能力,而部署大模型进行推理成本过高。

Kumar Shridhar,Alessandro Stolfo,Mrinmaya Sachan
ETH ZurichdistillationreasoningcotDOIDBLP
7
泛读LongACL 2023

Measuring Inductive Biases of In-Context Learning with Underspecified Demonstrations

当 In-Context Learning (ICL) 的示例存在歧义(Underspecified)时,LLM 会默认遵循什么规则?这揭示了预训练赋予它的归纳偏置。

Chenglei Si,Dan Friedman,Nitish Joshi,Shi Feng,Danqi Chen,He He
Princeton Universityiclinductive-biasDOIDBLP
7
泛读FindingsACL 2023

Recipes for Sequential Pre-training of Multilingual Encoder and Seq2Seq Models

在已有的预训练多语言模型上继续预训练(Continual Pre-training)新语言时,通常会导致对原有语言的灾难性遗忘。

Saleh Soltan,Andy Rosenbaum,Tobias Falke,Qin Lu,Anna Rumshisky,Wael Hamza
Amazon Alexa AImultilingualsequential-pretrainseq2seqDOIDBLP
7
泛读LongACL 2023

A Causal Framework to Quantify the Robustness of Mathematical Reasoning with Language Models

这篇工作要解决的是:如何把“模型会不会做数学推理”从表面准确率,提升到可量化的因果鲁棒性分析。以往数学推理评测大多看最终答案或少量对抗样例,这会把真正的推理能力与模板记忆、表述敏感性、数据泄漏混在一起;作者试图用因果框架把这些因素拆开。

Alessandro Stolfo,Zhijing Jin,Kumar Shridhar,Bernhard Schölkopf,Mrinmaya Sachan
reasoningcausalrobustnessDOIDBLP
7
泛读LongACL 2023

From Characters to Words: Hierarchical Pre-trained Language Model for Open-vocabulary Language Understanding

这篇工作要解决的是开放词表理解中的粒度矛盾:只用词级 token 容易丢掉未登录词和形态细节,只用字/字符级 token 又会让序列过长、语义组合困难。过去这两条路线都不理想,作者希望把字符级可组合性和词级语义压缩结合起来。

Li Sun,Florian Luisier,Kayhan Batmanghelich,Dinei A. F. Florêncio,Cha Zhang
tokenizerhierarchicalcharacter-levelDOIDBLP
7
精读FindingsACL 2023

Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

这篇工作要解决的是:BIG-Bench 中真正困难的任务到底难在哪里,以及 Chain-of-Thought 是否真的能系统性解决这些难点。过去 CoT 常被描述为通用推理增强手段,但很多结论建立在可被 prompt 激活的任务上;作者转而看那些仍然难的任务,检验 CoT 的边界。

Mirac Suzgun,Nathan Scales,Nathanael Schärli,Sebastian Gehrmann,Yi Tay,Hyung Won Chung ... 省略 1 位作者 ... ,Quoc V. Le,Ed H. Chi,Denny Zhou,Jason Wei
chain-of-thoughtbenchmarkbig-benchDOIDBLP
7
泛读FindingsACL 2023

B2T Connection: Serving Stability and Performance in Deep Transformers

这篇工作要解决的是深层 Transformer 训练时稳定性与性能常常不能兼得的问题。网络做深通常能提升表达能力,但也更容易出现梯度传播困难、训练不稳定和收益饱和,因此很多模型被迫停在较保守深度。

Sho Takase,Shun Kiyono,Sosuke Kobayashi,Jun Suzuki
deep-transformertraining-stabilityresidual-connectionDOIDBLP
7
泛读FindingsACL 2023

Not Enough Data to Pre-train Your Language Model? MT to the Rescue!

小语种缺乏大规模原生文本语料,无法支撑Transformer LM预训练,此前研究要么用少量原生数据训练效果差,要么依赖跨语言迁移对齐成本高,未系统验证MT合成语料直接用于预训练的可行性。

Gorka Urbizu,Iñaki San Vicente,Xabier Saralegi,Ander Corral
pretraining-datasynthetic-datamachine-translationDOIDBLP
7
泛读FindingsACL 2023

Multi-armed bandits for resource efficient, online optimization of language model pre-training: the use case of dynamic masking

Transformer LM预训练超参(如MLM的mask概率)通常靠人工调参固定,计算成本高,无法随训练过程动态调整最优值,此前研究要么用固定超参效果次优,要么调参需要多次训练浪费算力。

Iñigo Urteaga,Moulay-Zaïdane Draïdia,Tomer Lancewicki,Shahram Khadivi
pretraining-optimizationhyperparameter-tuningdynamic-maskingDOIarXivDBLP
7
泛读LongACL 2023

Towards Understanding Chain-of-Thought Prompting: An Empirical Study of What Matters

此前CoT prompting的有效机制不明确,普遍认为示例中的正确推理步骤是CoT生效的核心,未系统拆解CoT各个组成部分的贡献。

Boshi Wang,Sewon Min,Xiang Deng,Jiaming Shen,You Wu,Luke Zettlemoyer,Huan Sun
chain-of-thoughtpromptingreasoningDOIarXivDBLP
7
精读FindingsACL 2023

Zemi: Learning Zero-Shot Semi-Parametric Language Models from Multiple Tasks

Zhenhailong Wang,Xiaoman Pan,Dian Yu,Dong Yu,Jianshu Chen,Heng Ji
semi-parametricretrievalzero-shotDOIDBLP
7
精读LongACL 2023

SimLM: Pre-training with Representation Bottleneck for Dense Passage Retrieval

Liang Wang,Nan Yang,Xiaolong Huang,Binxing Jiao,Linjun Yang,Daxin Jiang,Rangan Majumder,Furu Wei
retrievalpretrainingrepresentation-learningDOIDBLP
7
泛读FindingsACL 2023

Duplex Diffusion Models Improve Speech-to-Speech Translation

Xianchao Wu
diffusionspeechtranslationDOIDBLP
7
泛读FindingsACL 2023

How does the task complexity of masked pretraining objectives affect downstream performance?

这篇论文要回答的问题很直接:MLM 的下游效果是否来自“刚好足够”的预测难度,而不是来自 mask 形式本身。此前一些更简单的 masked 目标,比如只预测首字符,训练更便宜但通常略差;作者因此反过来检验,若把 masked 目标做得更复杂,是否能超过或至少逼近标准 MLM,以及复杂度的有效区间到底在哪里。

Atsuki Yamaguchi,Hiroaki Ozaki,Terufumi Morishita,Gaku Morio,Yasuhiro Sogawa
mlmpretrain-objectiveanalysisDOIarXivDBLP
7
泛读LongACL 2023

Learning Better Masking for Better Language Model Pre-training

这篇论文要解决的是:MLM 中固定 masking ratio 和固定被 mask 内容分布是否过于僵硬,导致预训练不同阶段都在吃同一种难度和同一种监督信号。以往做法默认整个训练过程用同一 mask 策略,但模型在早期和后期的学习状态明显不同,这使得静态 masking 很可能不是最优。

Dongjie Yang,Zhuosheng Zhang,Hai Zhao
mlmmaskingpretrain-objectiveDOIarXivDBLP
7
泛读LongACL 2023

GanLM: Encoder-Decoder Pre-training with an Auxiliary Discriminator

这篇论文要解决的是:现有 encoder-decoder 预训练没有充分把“理解”能力反哺到“生成”里,导致生成式预训练和判别式预训练仍然分家。过去的做法通常在 denoising generation、MLM、RTD 等目标之间二选一或简单拼接,但没有很好统一两者的协同关系。

Jian Yang,Shuming Ma,Li Dong,Shaohan Huang,Haoyang Huang,Yuwei Yin,Dongdong Zhang,Liqun Yang,Furu Wei,Zhoujun Li
encoder-decoderpretrainingganDOIarXivDBLP
7
泛读LongACL 2023

Soft Language Clustering for Multilingual Model Pre-training

多语预训练里传统做法要么把所有语言混在一起(语言信息被稀释),要么显式给 language ID(硬分桶、对低资源不友好)。想要一种既利用语系亲缘性、又不强行硬分的中间态。

Jiali Zeng,Yufan Jiang,Yongjing Yin,Yi Jing,Fandong Meng,Binghuai Lin,Yunbo Cao,Jie Zhou
Tencent WeChat AImultilingualpre-traininglanguage-clusteringDOIDBLP
7
泛读FindingsACL 2023

Adaptive Attention for Sparse-based Long-sequence Transformer

Xuanyu Zhang,Zhepeng Lv,Qing Yang
sparse-attentionlong-contexttransformer-architectureDOIDBLP
7
泛读FindingsACL 2023

Beyond Positive Scaling: How Negation Impacts Scaling Trends of Language Models

Scaling law 研究通常关注模型在正面/标准任务上的性能随规模增长的正向趋势,但否定(negation)理解能力是否也遵循正向 scaling 尚不清楚。本文系统研究了语言模型在包含否定的任务上的 scaling 行为,发现否定理解可能呈现反向 scaling(模型越大反而越差)。

Yuhui Zhang,Michihiro Yasunaga,Zhengping Zhou,Jeff Z. HaoChen,James Zou,Percy Liang,Serena Yeung
Stanford Universityscaling-lawnegationgeneralizationDOIDBLP
7
泛读FindingsACL 2023

Emergent Modularity in Pre-trained Transformers

预训练 Transformer 内部是否存在模块化结构——即不同的功能(如语法、语义、事实知识)是否由不同的参数子集负责?此前的 interpretability 研究多关注单个 attention head 或 neuron,缺乏对「模块」级别组织的系统分析。

Zhengyan Zhang,Zhiyuan Zeng,Yankai Lin,Chaojun Xiao,Xiaozhi Wang,Xu Han,Zhiyuan Liu,Ruobing Xie,Maosong Sun,Jie Zhou
Tsinghua UniversitytransformermodularityinterpretabilityDOIDBLP
7
泛读FindingsACL 2023

Self-Evolution Learning for Discriminative Language Model Pretraining

在 ELECTRA 风格的判别式预训练中,生成器(Generator)如果固定不变或与判别器联合训练不当,往往无法提供难度适中的替换 Token,导致判别器学习效率低下。

Qihuang Zhong,Liang Ding,Juhua Liu,Bo Du,Dacheng Tao
University of SydneyWuhan Universitypretrainmasked-lmself-trainingDOIDBLP
7
泛读LongACL 2023

Revisiting Token Dropping Strategy in Efficient BERT Pretraining

在高效预训练中,直接丢弃不重要的 Token(Token Dropping)虽然能加速计算,但会破坏序列的完整性,严重损害保留 Token 的双向上下文表征质量。

Qihuang Zhong,Liang Ding,Juhua Liu,Xuebo Liu,Min Zhang,Bo Du,Dacheng Tao
University of SydneyWuhan Universitypretraintoken-droppingefficient-pretrainDOIDBLP
7
泛读LongACL 2023

Tokenization and the Noiseless Channel

这篇论文想解决的问题是:tokenization 不是中性的预处理,它会系统性改变 noiseless channel 模型中的编码与解码代价。过去大家常把分词当工程细节,再在固定 token 上讨论语言建模或解码;作者要说明的是,分词本身已经进入了概率建模目标,会影响最优决策。

Vilém Zouhar,Clara Meister,Juan Luis Gastaldi,Li Du,Mrinmaya Sachan,Ryan Cotterell
tokenizerinformation-theoryDOIDBLP
7
泛读FindingsACL 2023

A Formal Perspective on Byte-Pair Encoding

这篇论文的核心问题是:BPE 被广泛使用,但大家对它到底在优化什么、何时有效、何时失真,缺少严格的形式化理解。过去 BPE 更多被当成一个好用的启发式压缩/分词算法;作者想给出正式视角,澄清其目标函数近似了什么,以及这种近似的边界。

Vilém Zouhar,Clara Meister,Juan Luis Gastaldi,Li Du,Tim Vieira,Mrinmaya Sachan,Ryan Cotterell
tokenizerbpetheoryDOIDBLP
7
泛读ACL 2023

Generating Text from Language Models

这篇论文关注的核心问题是:我们平时把“从语言模型生成文本”当成采样工程问题,但生成算法本身隐含了对模型分布的强干预。过去大家默认温度、top-k、top-p 只是调风格;作者要更系统地问,这些生成策略到底在优化什么,又会把模型分布扭曲到什么程度。

Afra Amini,Ryan Cotterell,John Hewitt,Clara Meister,Tiago Pimentel
decodingsamplingtutorialDOIDBLP
7
泛读ACL 2023

Retrieval-based Language Models and Applications

这篇论文要解决的不是单一技术问题,而是系统梳理 retrieval-based language models 的设计空间、优势和应用边界。随着参数化记忆越来越昂贵,外部检索被反复用来补知识、减幻觉、降训练成本,但相关方法分散在不同任务和名称下,缺少统一视角。

Akari Asai,Sewon Min,Zexuan Zhong,Danqi Chen
retrieval-lmragtutorialDOIDBLP
7
泛读ACL 2023

Everything you need to know about Multilingual LLMs: Towards fair, performant and reliable models for languages of the world

这篇论文的核心问题是:多语言 LLM 的讨论长期被高资源语言主导,导致模型在世界语言上的公平性、性能和可靠性缺少系统评估与方法总结。过去工作常把 multilingual 近似成“多几种大语种”,而真正低资源、文字系统差异、文化偏置和评测缺口没有被同等重视。

Sunayana Sitaram,Monojit Choudhury,Barun Patra,Vishrav Chaudhary,Kabir Ahuja,Kalika Bali
multilingualllmtutorialDOIDBLP
7
精读ACL 2023

Augmentation Invariant Discrete Representation for Generative Spoken Language Modeling

这篇论文的核心问题很明确:生成式语音语言模型依赖离散语音单元,但现有离散表示对与语义无关的声学扰动并不稳健,导致模型把不该建模的变化也学进去。过去工作更关注离散单元能否支持高质量生成,却很少系统考察它们对 time-stretch 等不改变 spoken content 的变换是否不变。

Itai Gat,Felix Kreuk,Tu Anh Nguyen,Ann Lee,Jade Copet,Gabriel Synnaeve,Emmanuel Dupoux,Yossi Adi
speech-lmdiscrete-representationaudio-tokenizationDOIarXivDBLP
7
泛读ACL 2023

Fine-grained Text Style Transfer with Diffusion-Based Language Models

Yiwei Lyu,Tiange Luo,Jiacheng Shi,Todd C. Hollon,Honglak Lee
diffusion-lmstyle-transfernon-autoregressiveDOIDBLP
7
泛读ACL 2023

MUX-PLMs: Pre-training Language Models with Data Multiplexing

大语言模型推理吞吐量低、成本高,现有数据多路复用(MIMO)方法虽能以单输入成本处理多请求,但精度损失过大无法落地,缺乏适配预训练流程的多路复用优化方案。

Vishvak Murahari,Ameet Deshpande,Carlos E. Jimenez,Izhak Shafran,Mingqiu Wang,Yuan Cao,Karthik Narasimhan
Google DeepMindPrinceton Universitypretrainingefficiencydata-mixingDOIarXivDBLP
7
泛读ACL 2023

Arithmetic-Based Pretraining Improving Numeracy of Pretrained Language Models

Dominic Petrak,Nafise Sadat Moosavi,Iryna Gurevych
pretrainingnumeracysynthetic-dataDOIDBLP
7
泛读LongACL 2023

MatCha: Enhancing Visual Language Pretraining with Math Reasoning and Chart Derendering

现有视觉语言模型在图表、绘图类结构化视觉数据上的推理性能差,此前预训练任务主要聚焦自然图像与文本对齐,缺乏针对图表解构、数值推理的专项目标。

Fangyu Liu,Francesco Piccinno,Syrine Krichene,Chenxi Pang,Kenton Lee,Mandar Joshi,Yasemin Altun,Nigel Collier,Julian Martin Eisenschlos
Google DeepMindchartvlm-pretrainmath-reasoningDOIarXivDBLP
4
LongACL 2023

Causes and Cures for Interference in Multilingual Translation

多语言翻译模型中不同语言对存在干扰,导致低资源语言性能下降,此前研究多聚焦消除干扰的方法,对干扰产生的核心原因缺乏系统量化分析。

Uri Shaham,Maha Elbayad,Vedanuj Goswami,Omer Levy,Shruti Bhosale
Meta AIUniversity of Washingtonmultilingualinterferencedata-mixtureDOIarXivDBLP
6
泛读LongACL 2023

Self-Adaptive In-Context Learning: An Information Compression Perspective for In-Context Example Selection and Ordering

大模型上下文学习(ICL)性能高度依赖样例的选择与排序,此前常用的随机采样方式性能波动大、平均表现差,缺乏统一的样例组织原则。

Zhiyong Wu,Yaoxiang Wang,Jiacheng Ye,Lingpeng Kong
The University of Hong Kongin-context-learningexample-selectioninformation-compressionDOIarXivDBLP
4
FindingsACL 2023

Towards Reasoning in Large Language Models: A Survey

大语言模型推理能力的边界、诱发方法、评估方式缺乏系统梳理,现有研究分散在不同子领域,没有统一的认知框架。

Jie Huang,Kevin Chen-Chuan Chang
University of Illinois Urbana-ChampaignreasoningllmsurveyDOIarXivDBLP
7
泛读FindingsACL 2023

SERENGETI: Massively Multilingual Language Models for Africa

现有多语言预训练模型仅覆盖约31种非洲语言,非洲2000多种语言缺乏可用的预训练资源,低资源非洲语言的NLP应用难以落地。

Ife Adebara,AbdelRahim A. Elmadany,Muhammad Abdul-Mageed,Alcides Alcoba Inciarte
University of British Columbiamultilingualpretraininglow-resourceDOIarXivDBLP
6
泛读FindingsACL 2023

Multilingual Pre-training with Self-supervision from Global Co-occurrence Information

现有多语言预训练的MLM目标仅依赖局部上下文信息,缺乏跨语言全局结构信号,跨语言迁移性能存在瓶颈。

Xi Ai,Bin Fang
multilingualpretrainingco-occurrenceDOIDBLP
6
泛读LongACL 2023

How Do In-Context Examples Affect Compositional Generalization?

大模型上下文学习的组合泛化能力影响因素不明确,此前组合泛化研究多聚焦微调场景,缺乏对ICL场景下组合泛化的系统分析。

Shengnan An,Zeqi Lin,Qiang Fu,Bei Chen,Nanning Zheng,Jian-Guang Lou,Dongmei Zhang
Microsoft Research AsiaXi'an Jiaotong Universityin-context-learningcompositional-generalizationllmDOIarXivDBLP
6
泛读FindingsACL 2023

Learning from Children: Improving Image-Caption Pretraining via Curriculum

图文预训练需要将字幕中的多个概念对齐到图像对应物体,现有方法直接使用全难度图文对训练,对齐效率低、易引入噪声。

Hammad A. Ayyubi,Rahul Lokesh,Alireza Zareian,Bo Wu,Shih-Fu Chang
哥伦比亚大学vlm-pretrainingcurriculum-learningimage-text-alignmentDOIarXivDBLP
5
泛读FindingsACL 2023

The Larger they are, the Harder they Fail: Language Models do not Recognize Identifier Swaps in Python

现有大语言模型在代码生成任务上表现较好,但未验证是否掌握编程语言的语义不变性(如标识符重命名的近乎不变性),现有评估通常不覆盖这类语义鲁棒性场景。

Antonio Valerio Miceli Barone,Fazl Barez,Shay B. Cohen,Ioannis Konstas
爱丁堡大学code-llmrobustnesstokenizationDOIarXivDBLP
5
泛读LongACL 2023

I2D2: Inductive Knowledge Distillation with NeuroLogic and Self-Imitation

现有常识能力提升普遍依赖模型规模放大,小模型受参数量限制难以达到大规模模型的常识生成水平,缺少不依赖规模的常识蒸馏方法。

Chandra Bhagavatula,Jena D. Hwang,Doug Downey,Ronan Le Bras,Ximing Lu,Lianhui Qin,Keisuke Sakaguchi,Swabha Swayamdipta,Peter West,Yejin Choi
艾伦人工智能研究院(AI2)knowledge-distillationsynthetic-datacommonsense-reasoningDOIarXivDBLP
5
泛读LongACL 2023

Simplicity Bias in Transformers and their Ability to Learn Sparse Boolean Functions

现有研究发现Transformer在形式语言建模上表现不如循环模型,但无法解释Transformer在实际NLP任务上效果更好的原因,缺少对Transformer归纳偏置的量化研究。

Satwik Bhattamishra,Arkil Patel,Varun Kanade,Phil Blunsom
牛津大学transformer-mechanismsinductive-biaslearning-dynamicsDOIarXivDBLP
4
DemoACL 2023

Petals: Collaborative Inference and Fine-tuning of Large Models

100B以上参数的大语言模型推理和微调需要高端硬件,现有内存卸载方案速度太慢不适合交互场景,商用API不开放权重、注意力等中间结果,无法满足研究需求。

Alexander Borzunov,Dmitry Baranchuk,Tim Dettmers,Maksim Riabinin,Younes Belkada,Artem Chumachenko,Pavel Samygin,Colin Raffel
Hugging Face华盛顿大学distributed-inferenceparameter-efficient-finetuningsystem-optimizationDOIarXivDBLP
6
泛读FindingsACL 2023

On the Expressivity Role of LayerNorm in Transformers&apos; Attention

Shaked Brody,Uri Alon,Eran Yahav
transformer-architecturelayernormattention-mechanismDOIDBLP
6
泛读LongACL 2023

Peek Across: Improving Multi-Document Modeling via Cross-Document Question-Answering

现有多文档预训练目标通常只学习单文档内部的语义表示,缺少跨文档信息对齐和推理能力的预训练监督,导致多文档下游任务表现受限。

Avi Caciularu,Matthew E. Peters,Jacob Goldberger,Ido Dagan,Arman Cohan
艾伦人工智能研究院(AI2)pretraining-objectivelong-contextmulti-documentDOIarXivDBLP
6
泛读LongACL 2023

PuMer: Pruning and Merging Tokens for Efficient Vision Language Models

这篇论文要解决的是:Vision-Language Transformer 的跨模态层计算太贵,尤其图像和文本 token 一起做注意力时二次复杂度直接吃掉推理时延和显存。过去常见做法要么只压图像 token,要么做统一 token 剪枝,但没有充分利用“文本查询决定哪些视觉 token 真有用”这一点。

Qingqing Cao,Bhargavi Paranjape,Hannaneh Hajishirzi
University of Washingtonvlmtoken-mergingtoken-pruningDOIarXivDBLP
6
泛读LongACL 2023

Data Curation Alone Can Stabilize In-context Learning

这篇论文要解决的是:为什么 in-context learning 对示例选择如此不稳定,以及能否不改 ICL 算法本身,只靠数据筛选就把方差压下来。过去主流解法多依赖检索、排序、校准或复杂提示工程,默认不稳定性来自推理时策略,而不是训练样本集合本身的质量差异。

Ting-Yun Chang,Robin Jia
University of Southern Californiaicldata-qualitypromptingDOIarXivDBLP
6
泛读FindingsACL 2023

On the Difference of BERT-style and CLIP-style Text Encoders

这篇论文要解决的是:BERT-style 和 CLIP-style 文本编码器到底学到了什么不同能力,尤其 CLIP 文本塔是否只是“做图像对比学习的附属品”。过去社区更多关注 CLIP 的视觉编码器,文本编码器往往被拿来直接用,却缺少系统分析它和 MLM 文本编码器能力边界的工作。

Zhihong Chen,Guiming Chen,Shizhe Diao,Xiang Wan,Benyou Wang
cliptext-encodermultimodal-pretrainingDOIarXivDBLP
6
泛读LongACL 2023

DISCO: Distilling Counterfactuals with Large Language Models

这篇论文要解决的是:高质量 counterfactual data 稀缺,导致模型难以学到更因果、更稳健的任务表示,而人工构造又贵又慢。以前要么依赖小规模众包反事实数据,要么为特定维度训练监督生成器,都难扩展到新任务和新干预类型。

Zeming Chen,Qiyue Gao,Antoine Bosselut,Ashish Sabharwal,Kyle Richardson
llmdata-synthesiscounterfactualDOIarXivDBLP
6
泛读LongACL 2023

mCLIP: Multilingual CLIP via Cross-lingual Transfer

从标题看,这篇论文要解决的是:如何通过跨语言迁移,把 CLIP 从英语扩展成多语言图文对齐模型。问题的现实背景很明确:原始 CLIP 的文本侧高度英语中心,导致跨语言图文检索和多语言多模态理解受限;但在没有摘要的情况下,具体是 teacher-student distillation、翻译蒸馏还是对齐迁移,不能可靠确认。

Guanhua Chen,Lu Hou,Yun Chen,Wenliang Dai,Lifeng Shang,Xin Jiang,Qun Liu,Jia Pan,Wenping Wang
clipmultilingualcross-lingualDOIDBLP
6
泛读LongACL 2023

Weakly Supervised Vision-and-Language Pre-training with Relative Representations

Chi Chen,Peng Li,Maosong Sun,Yang Liu
vl-pretrainingweak-supervisioncontrastive-learningDOIDBLP
5
泛读FindingsACL 2023

Decouple knowledge from paramters for plug-and-play language modeling

现有预训练语言模型将知识隐式编码在参数里,训练完成后知识不可编辑、不可扩展,同时缺乏可解释性,无法适配知识快速更新的场景。

Xin Cheng,Yankai Lin,Xiuying Chen,Dongyan Zhao,Rui Yan
北京大学knowledgeretrievallanguage-modelingDOIDBLP
6
泛读LongACL 2023

Analyzing Transformers in Embedding Space

现有Transformer可解释性方法大多需要前向/后向传播计算,零样本(不需要输入)的可解释性方法只能覆盖部分参数或浅层网络,缺少对全参数的零样本可解释性框架。

Guy Dar,Mor Geva,Ankit Gupta,Jonathan Berant
特拉维夫大学interpretabilitytransformerembedding-spaceDOIarXivDBLP
6
泛读LongACL 2023

Mixture-of-Domain-Adapters: Decoupling and Injecting Domain Knowledge to Pre-trained Language Models&apos; Memories

Shizhe Diao,Tianyang Xu,Ruijia Xu,Jiawei Wang,Tong Zhang
moedomain-adaptationcontinual-pretrainDOIDBLP
6
泛读ShortACL 2023

When to Use Efficient Self Attention? Profiling Text, Speech and Image Transformer Variants

此前高效自注意力变体的效率阈值仅在单模态下单独测试,无跨文本、语音、视觉的统一对比框架,从业者常默认短序列下切换高效注意力即可提效,反而造成不必要的性能或效率损失。

Anuj Diwan,Eunsol Choi,David Harwath
University of Texas at AustinMeta AIattentionefficiencyprofilingDOIarXivDBLP
6
泛读LongACL 2023

Towards Leaving No Indic Language Behind: Building Monolingual Corpora, Benchmark and Models for Indic Languages

此前印度语预训练语料覆盖语言少、规模小,缺乏统一的多任务NLU基准,导致低资源印度语的大语言模型能力远低于高资源语言,10亿级使用者的语言需求未被覆盖。

Sumanth Doddapaneni,Rahul Aralikatte,Gowtham Ramesh,Shreya Goyal,Mitesh M. Khapra,Anoop Kunchukuttan,Pratyush Kumar
Indian Institute of Technology MadrasmultilingualindiccorpusDOIarXivDBLP
5
泛读LongACL 2023

ColD Fusion: Collaborative Descent for Distributed Multitask Finetuning

此前大规模多任务微调需要同时访问所有训练数据集,算力要求极高,仅资源充足的团队可实现,且无法复用不同团队的微调成果,多任务学习门槛较高。

Shachar Don-Yehiya,Elad Venezian,Colin Raffel,Noam Slonim,Leshem Choshen
GoogleUniversity of North Carolina at Chapel Hillmultitaskdistributed-trainingfinetuningDOIarXivDBLP
4
LongACL 2023

Generalizing Backpropagation for Gradient-Based Interpretability

此前基于梯度的特征归因方法仅能得到输入对输出的贡献,无法解释模型内部的推理路径,可解释性粒度仅停留在输入层。

Kevin Du,Lucas Torroba Hennigen,Niklas Stoehr,Alex Warstadt,Ryan Cotterell
ETH ZurichJohns Hopkins UniversityinterpretabilitygradientattributionDOIarXivDBLP
5
泛读FindingsACL 2023

LMentry: A Language Model Benchmark of Elementary Language Tasks

现有大语言模型基准越来越复杂,与模型能力提升形成军备竞赛,无法快速检测模型的基础能力缺陷,很多大模型在人类极易完成的任务上仍然出错。

Avia Efrat,Or Honovich,Omer Levy
Technion – Israel Institute of Technologybenchmarkevaluationelementary-tasksDOIarXivDBLP
6
泛读LongACL 2023

Mitigating Label Biases for In-context Learning

解决大模型在上下文学习(ICL)中存在的标签偏置问题(即模型倾向于预测某些特定标签,而无视输入内容)。

Yu Fei,Yifan Hou,Zeming Chen,Antoine Bosselut
EPFLiclbiascalibrationDOIDBLP
6
泛读LongACL 2023

From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models

追踪政治偏见如何从预训练数据语料,一步步传导至语言模型参数,并最终影响下游NLP任务的公平性。

Shangbin Feng,Chan Young Park,Yuhan Liu,Yulia Tsvetkov
University of Washingtonbiaspretrain-datadownstreamDOIDBLP
6
泛读FindingsACL 2023

DiffuDetox: A Mixed Diffusion Model for Text Detoxification

探索使用混合扩散模型(Diffusion Model)替代传统的自回归(AR)模型来进行文本去毒(Text Detoxification)任务。

Griffin Floto,Mohammad Mahdi Abdollah Pour,Parsa Farinneya,Zhenwei Tang,Ali Pesaranghader,Manasa Bharadwaj,Scott Sanner
diffusion-lmdetoxificationcontrollable-generationDOIDBLP
6
泛读IndustryACL 2023

pNLP-Mixer: an Efficient all-MLP Architecture for Language

解决Transformer架构中自注意力机制带来的二次计算复杂度问题,探索纯MLP架构在语言建模中的可行性。

Francesco Fusco,Damian Pascual,Peter W. J. Staar,Diego Antognini
IBM ResearchefficientmlparchitectureDOIDBLP
6
泛读LongACL 2023

Precise Zero-Shot Dense Retrieval without Relevance Labels

在缺乏相关性标注数据的情况下,如何实现高精度的零样本(Zero-Shot)密集检索。以往的无监督检索通常受限于Query和Document之间的语义鸿沟。

Luyu Gao,Xueguang Ma,Jimmy Lin,Jamie Callan
CMUUniversity of Waterlooretrievalhydezero-shotDOIDBLP
6
泛读LongACL 2023

Small Pre-trained Language Models Can be Fine-tuned as Large Models via Over-Parameterization

小规模预训练语言模型在微调时容量有限,难以拟合复杂的下游任务数据。

Ze-Feng Gao,Kun Zhou,Peiyu Liu,Wayne Xin Zhao,Ji-Rong Wen
Renmin University of Chinafine-tuningover-parameterizationsmall-modelsDOIDBLP
6
泛读LongACL 2023

The Benefits of Bad Advice: Autocontrastive Decoding across Model Layers

这篇论文要解决的是:如何在不改模型参数、不重新训练的前提下,缓解大语言模型解码时的泛化失真,尤其是高层表示过度自信带来的退化。传统做法主要在输出分布上做温度、对比解码或重排序,但很少直接利用同一模型不同层之间“好建议”和“坏建议”的结构差异。

Ariel Gera,Roni Friedman,Ofir Arviv,R. Chulaka Gunasekara,Benjamin Sznajder,Noam Slonim,Eyal Shnarch
decodingcontrastive-decodinglayer-analysisDOIDBLP
6
泛读FindingsACL 2023

PreQuant: A Task-agnostic Quantization Approach for Pre-trained Language Models

这篇论文要解决的是:预训练语言模型量化通常依赖具体下游任务做校准或微调,导致部署成本高、迁移脆弱,能否做一种与任务无关的量化方案。过去很多 PTQ/QAT 方法在分类或单任务设置里有效,但对通用 PLM,跨任务性能掉点往往不可控。

Zhuocheng Gong,Jiahao Liu,Qifan Wang,Yang Yang,Jingang Wang,Wei Wu,Yunsen Xian,Dongyan Zhao,Rui Yan
quantizationtask-agnosticpretrained-lmDOIDBLP
6
泛读FindingsACL 2023

Mitigating the Learning Bias towards Repetition by Self-Contrastive Training for Open-Ended Generation

这篇论文的结论是:开放式生成中的重复不只是解码问题,部分根源在于 MLE 训练让模型过早学会了简单重复模式,并对重复 token 的概率产生系统性高估。过去大多数工作在解码阶段做惩罚或重排,这篇工作把问题往训练动态上追了一层。

Jian Guan,Minlie Huang
Peking Universityrepetitiontraining-biascontrastive-trainingDOIarXivDBLP
6
泛读LongACL 2023

PAD-Net: An Efficient Framework for Dynamic Networks

这篇论文要解决的问题是:动态网络不一定需要“整层全动态”,因为那会带来明显的参数冗余和部署成本。此前 DY-Conv、MoE 一类方法通常默认把一层里的参数都做成随输入变化的动态参数,但作者认为很多动态性其实只需要落在少数关键参数上,剩下部分保持静态也能保留大部分收益。

Shwai He,Liang Ding,Daize Dong,Boan Liu,Fuqiang Yu,Dacheng Tao
moedynamic-networksefficiencyDOIarXivDBLP
6
泛读LongACL 2023

Z-Code++: A Pre-trained Language Model Optimized for Abstractive Summarization

这篇论文要解决的问题是:通用 encoder-decoder 预训练模型并不专门为抽象式摘要而设计,尤其在低资源摘要和长文档摘要上存在明显短板。过去常见做法是拿通用 Seq2Seq 模型直接微调,但这种路线对任务特定结构、长上下文组织和摘要语料中的“有根据生成”利用不够充分。

Pengcheng He,Baolin Peng,Song Wang,Yang Liu,Ruochen Xu,Hany Hassan ... 省略 2 位作者 ... ,Wayne Xiong,Michael Zeng,Jianfeng Gao,Xuedong Huang
continual-pretrainingencoder-decodersummarizationDOIarXivDBLP
6
泛读LongACL 2023

Large Language Models Are Reasoning Teachers

这篇论文的核心问题可以从标题判断为:大语言模型能否作为“推理教师”,把自身的推理能力转化成对其他模型可学习的监督信号。过去 few-shot CoT 更多是在测试时直接让大模型自己推理,而不是系统性地把这些推理过程蒸馏给更小模型或更少数据设置。

Namgyu Ho,Laura Schmid,Se-Young Yun
reasoningdistillationsynthetic-dataDOIDBLP
6
泛读FindingsACL 2023

The Diminishing Returns of Masked Language Models to Science

这篇论文要解决的问题是:在科学领域,masked language model 是否还遵循“更大模型、更多数据、更长训练时间就会更强”的常见扩展规律。过去通用 NLP 里这种 scaling 叙事相当成功,但科学任务的数据分布、标签需求和领域术语密度都不同,简单照搬通用领域经验可能并不成立。

Zhi Hong,Aswathy Ajith,J. Gregory Pauloski,Eamon Duede,Kyle Chard,Ian T. Foster
scaling-lawmasked-language-modeldomain-adaptationDOIarXivDBLP
6
泛读LongACL 2023

Faithful Question Answering with Monte-Carlo Planning

这篇论文要解决的问题是:语言模型虽然能答对问题,但它展示出来的中间推理步骤未必真是它赖以得出答案的过程,因此“可解释”不等于“忠实”。过去链式思维通常是让模型自由生成 reasoning trace,但这些步骤可能只是事后合理化,无法保证与答案生成机制一致。

Ruixin Hong,Hongming Zhang,Hong Zhao,Dong Yu,Changshui Zhang
monte-carlo-planningreasoninginference-time-computeDOIarXivDBLP
6
泛读LongACL 2023

In-Context Analogical Reasoning with Pre-Trained Language Models

LLM 在 in-context 设置下能否做 Raven-style 视觉-空间类比推理(矩阵补全)。之前 analogical reasoning 的研究多在纯文本语义类比(A:B::C:?)上,结构化视觉推理几乎没人用 LLM 碰。

Xiaoyang Hu,Shane Storks,Richard L. Lewis,Joyce Chai
University of Michiganin-context-learningreasoninganalogyDOIDBLP
6
泛读IndustryACL 2023

MathPrompter: Mathematical Reasoning using Large Language Models

LLM 做数学题容易算错,而且对同一问题给不同生成路径的答案不一致。纯 CoT 给推理链但没机制验证算术是否正确。

Shima Imani,Liang Du,Harsh Shrivastava
Microsoft ResearchreasoningpromptingmathematicsDOIDBLP
6
泛读FindingsACL 2023

Differentiable Instruction Optimization for Cross-Task Generalization

instruction tuning 的指令是人写的离散文本,不一定最优;能不能把指令本身当参数来优化,让 cross-task 泛化更好。

Masaru Isonuma,Junichiro Mori,Ichiro Sakata
University of Tokyoinstruction-tuningoptimizationcross-taskDOIDBLP
6
泛读LongACL 2023

HINT: Hypernetwork Instruction Tuning for Efficient Zero- and Few-Shot Generalisation

instruction-tuned 模型在推理时每条 prompt 都要把长指令塞进 context,既贵又冗余。能不能把指令'编译'成参数级的模型变体,让推理时不再依赖 prompt?

Hamish Ivison,Akshita Bhagia,Yizhong Wang,Hannaneh Hajishirzi,Matthew E. Peters
Allen Institute for AIUniversity of Washingtoninstruction-tuninghypernetworkparameter-efficientDOIDBLP
6
泛读LongACL 2023

LLM-Blender: Ensembling Large Language Models with Pairwise Ranking and Generative Fusion

Dongfu Jiang,Xiang Ren,Bill Yuchen Lin
llm-ensemblerankingfusionDOIDBLP
6
泛读LongACL 2023

Pruning Pre-trained Language Models Without Fine-Tuning

Ting Jiang,Deqing Wang,Fuzhen Zhuang,Ruobing Xie,Feng Xia
pruningplmcompressionDOIDBLP
6
泛读LongACL 2023

Patton: Language Model Pretraining on Text-Rich Networks

标准语言模型预训练将文档视为独立序列,忽略了现实数据中丰富的图结构(如引用网络、商品共现图),导致模型无法内化节点间的关系上下文。

Bowen Jin,Wentao Zhang,Yu Zhang,Yu Meng,Xinyang Zhang,Qi Zhu,Jiawei Han
UIUCTencentpretraingraphtext-richDOIDBLP
6
泛读FindingsACL 2023

FiDO: Fusion-in-Decoder optimized for stronger performance and faster inference

Fusion-in-Decoder (FiD) 模型在处理大量检索文档时,解码器的交叉注意力计算量随文档数量线性增长,导致推理极度缓慢,难以在生产环境中扩展。

Michiel de Jong,Yury Zemlyanskiy,Joshua Ainslie,Nicholas FitzGerald,Sumit Sanghai,Fei Sha,William W. Cohen
Google ResearchfiddecoderinferenceDOIDBLP
6
泛读FindingsACL 2023

Feature Interactions Reveal Linguistic Structure in Language Models

语言模型(如 Transformer)在架构上是纯序列化的,缺乏显式的树状结构归纳偏置,但它们却能表现出对复杂语言句法(如成分树)的深刻理解,其内部机制尚不明确。

Jaap Jumelet,Willem H. Zuidema
University of Amsterdaminterpretabilityfeature-interactionlinguistic-structureDOIDBLP
6
泛读LongACL 2023

Evaluating Open-Domain Question Answering in the Era of Large Language Models

在评估大语言模型(LLMs)的开放域问答(ODQA)能力时,传统的精确匹配(Exact Match, EM)指标会严重惩罚模型生成的冗长或对话式正确答案(如回答“答案是巴黎”而非“巴黎”),导致对模型真实能力的严重低估。

Ehsan Kamalloo,Nouha Dziri,Charles L. A. Clarke,Davood Rafiei
University of WaterlooCohereevaluationqallmDOIDBLP
6
泛读FindingsACL 2023

Uncovering Hidden Consequences of Pre-training Objectives in Sequence-to-Sequence Models

在 Seq2Seq 模型的预训练中,不同的目标函数(如 T5 的 Span Corruption 与 BART 的 Denoising)被认为只是学习通用表征的手段,但它们实际上会在模型中植入难以通过微调消除的生成偏置(如长度偏好、复制倾向)。

Tannon Kew,Rico Sennrich
University of Zurichpretrain-objectiveseq2seqanalysisDOIDBLP
6
泛读LongACL 2023

Entity Tracking in Language Models

自回归语言模型擅长记忆静态知识,但在处理长文本时,能否真正在内部维护一个动态更新的“世界模型”(例如追踪某个实体在故事中位置或所有权的变化)仍存疑。

Najoung Kim,Sebastian Schuster
Boston UniversityRutgers Universityentity-trackingprobinganalysisDOIDBLP
6
泛读ShortACL 2023

Do Models Really Learn to Follow Instructions? An Empirical Study of Instruction Tuning

探究指令微调(Instruction Tuning)模型究竟是真正“理解了指令语义”,还是仅仅“拟合了输入输出的表面模式(Surface patterns)”。

Po-Nien Kung,Nanyun Peng
UCLAinstruction-tuningsftempirical-studyDOIDBLP
6
泛读FindingsACL 2023

A Study on Knowledge Distillation from Weak Teacher for Scaling Up Pre-trained Language Models

解决大语言模型从头预训练收敛速度慢、计算成本极高的问题,探索如何利用已有的较小模型来加速大模型的预训练。

Hayeon Lee,Rui Hou,Jongpil Kim,Davis Liang,Sung Ju Hwang,Alexander Min
KakaoBrainKAISTknowledge-distillationscalingpretrainDOIDBLP
6
泛读LongACL 2023

Symbolic Chain-of-Thought Distillation: Small Models Can Also &quot;Think&quot; Step-by-Step

解决小模型(如参数量小于1B的T5/BART)缺乏多步推理(Chain-of-Thought)能力,而大模型推理成本过高的问题。

Liunian Harold Li,Jack Hessel,Youngjae Yu,Xiang Ren,Kai-Wei Chang,Yejin Choi
University of WashingtonAllen Institute for AI (AI2)cotdistillationreasoningDOIDBLP
6
泛读ShortACL 2023

NarrowBERT: Accelerating Masked Language Model Pretraining and Inference

解决掩码语言模型(MLM,如BERT/RoBERTa)在预训练和推理时,深层Transformer需要对所有Token(包括未被Mask的冗余Token)进行全量计算,导致计算效率低下的问题。

Haoxin Li,Phillip Keung,Daniel Cheng,Jungo Kasai,Noah A. Smith
University of Washingtonmlmpretraining-accelerationefficiencyDOIDBLP
6
泛读FindingsACL 2023

Recurrent Attention Networks for Long-text Modeling

这篇论文要解决的是 Transformer 在长文本建模上上下文长度受限、显存代价高,而传统循环模型又难以保留强建模能力的问题。过去长文本通常靠截断、稀疏注意力或外部记忆做近似,但这些方法要么丢信息,要么训练和实现复杂,因此作者尝试把循环机制重新引入注意力网络里。

Xianming Li,Zongxi Li,Xiaotian Luo,Haoran Xie,Xing Lee,Yingbin Zhao,Fu Lee Wang,Qing Li
long-contextrecurrent-attentionarchitectureDOIDBLP
6
泛读LongACL 2023

Unified Demonstration Retriever for In-Context Learning

这篇论文要解决的是 in-context learning 里的 demonstration retrieval 过于任务专用,导致难迁移、难扩展、部署成本高。过去常见做法是每个任务单独训练一个检索器,短期有效,但面对多任务场景会带来参数冗余,也无法保证跨任务泛化。

Xiaonan Li,Kai Lv,Hang Yan,Tianyang Lin,Wei Zhu,Yuan Ni,Guotong Xie,Xiaoling Wang,Xipeng Qiu
icldemonstration-retrievalin-context-learningDOIarXivDBLP
6
泛读FindingsACL 2023

The Web Can Be Your Oyster for Improving Language Models

这篇论文的标题表明它在解决如何利用开放网络资源改进语言模型的问题。过去 web 数据一直是预训练主力来源,但常见做法要么是离线大规模抓取后一次性训练,要么是在推理时做检索增强;标题暗示作者关注的是更主动、更持续地把 Web 当成模型改进资源,而不是静态语料仓库。

Junyi Li,Tianyi Tang,Wayne Xin Zhao,Jingyuan Wang,Jian-Yun Nie,Ji-Rong Wen
web-datadata-qualitypretrainingDOIDBLP
6
泛读FindingsACL 2023

Mind the Biases: Quantifying Cognitive Biases in Language Model Prompting

Ruixi Lin,Hwee Tou Ng
biaspromptingevaluationDOIDBLP
3
FindingsACL 2023

Transferring General Multimodal Pretrained Models to Text Recognition

此前的OCR模型需要在大量标注或合成的OCR专属数据上预训练,数据采集成本高,迁移到新语言或场景的速度慢。

Junyang Lin,Xuancheng Ren,Yichang Zhang,Gao Liu,Peng Wang,An Yang,Chang Zhou
Alibaba DAMO Academyvlmtransfer-learningocrDOIarXivDBLP
6
泛读LongACL 2023

On Improving Summarization Factual Consistency from Natural Language Feedback

Yixin Liu,Budhaditya Deb,Milagro Teruel,Aaron Halfaker,Dragomir Radev,Ahmed Hassan Awadallah
feedback-learningsummarizationfactualityDOIDBLP
6
泛读ShortACL 2023

BOLT: Fast Energy-based Controlled Text Generation with Tunable Biases

此前基于能量模型的可控文本生成需要多轮迭代采样,收敛速度慢,实用性差,且容易破坏生成文本的流畅度。

Xin Liu,Muhammad Khalifa,Lu Wang
University of Michiganenergy-basedcontrolled-generationdecodingDOIarXivDBLP
6
泛读FindingsACL 2023

Code Execution with Pre-trained Language Models

此前的代码预训练模型仅关注源代码的语法结构,不学习代码执行的动态语义,导致代码生成正确率低,无法判断代码的实际运行结果。

Chenxiao Liu,Shuai Lu,Weizhu Chen,Daxin Jiang,Alexey Svyatkovskiy,Shengyu Fu,Neel Sundaresan,Nan Duan
Microsoftcodeexecutiondata-augmentationDOIarXivDBLP
6
泛读LongACL 2023

Binary and Ternary Natural Language Generation

二/三值化Transformer生成模型优化难度极高,传统量化方案会导致注意力层精度损失过大、自回归解码误差累积,低精度生成模型此前一直无法达到可用标准。

Zechun Liu,Barlas Oguz,Aasish Pappu,Yangyang Shi,Raghuraman Krishnamoorthi
quantizationbinaryternaryDOIarXivDBLP
6
泛读FindingsACL 2023

The Magic of IF: Investigating Causal Reasoning Abilities in Large Language Models of Code

Xiao Liu,Da Yin,Chen Zhang,Yansong Feng,Dongyan Zhao
code-llmcausal-reasoningevaluationDOIDBLP
6
泛读LongACL 2023

What Makes Pre-trained Language Models Better Zero-shot Learners?

Jinghui Lu,Dongsheng Zhu,Weidong Han,Rui Zhao,Brian Mac Namee,Fei Tan
zero-shotpretrain-analysisiclDOIDBLP
6
泛读LongACL 2023

HyperMixer: An MLP-based Low Cost Alternative to Transformers

Florian Mai,Arnaud Pannatier,Fabio Fehr,Haolin Chen,François Marelli,François Fleuret,James Henderson
mlparchitecturetransformer-altDOIDBLP
6
泛读LongACL 2023

Benchmarking Large Language Model Capabilities for Conditional Generation

Joshua Maynez,Priyanka Agrawal,Sebastian Gehrmann
benchmarkconditional-generationllm-evalDOIDBLP
6
泛读LongACL 2023

On the Efficacy of Sampling Adapters

Clara Meister,Tiago Pimentel,Luca Malagutti,Ethan Wilcox,Ryan Cotterell
samplingdecodinggenerationDOIDBLP
6
泛读ShortACL 2023

A Natural Bias for Language Generation Models

Clara Meister,Wojciech Stokowiec,Tiago Pimentel,Lei Yu,Laura Rimell,Adhiguna Kuncoro
inductive-biasgenerationtraining-dynamicsDOIDBLP
6
泛读FindingsACL 2023

Rarely a problem? Language models exhibit inverse scaling in their predictions following few-type quantifiers

James A. Michaelov,Benjamin K. Bergen
inverse-scalingquantifiersscalingDOIDBLP
6
泛读LongACL 2023

DecompX: Explaining Transformers Decisions by Propagating Token Decomposition

现有的 Transformer 归因方法(如 Attention Rollout 或基于梯度的方法)在处理非线性激活和注意力机制的复杂信息流时,往往会产生失真或不忠实(unfaithful)的解释。

Ali Modarressi,Mohsen Fayyaz,Ehsan Aghazadeh,Yadollah Yaghoobzadeh,Mohammad Taher Pilehvar
Tehran UniversityinterpretabilityattributiontransformerDOIDBLP
6
泛读LongACL 2023

Large-scale Lifelong Learning of In-context Instructions and How to Tackle It

在模型生命周期中持续注入新的指令微调数据(终身学习)会导致对早期指令的灾难性遗忘,而每次都从头混合所有数据重新微调的计算成本不可接受。

Jisoo Mok,Jaeyoung Do,Sungjin Lee,Tara Taghavi,Seunghak Yu,Sungroh Yoon
Amazoncontinual-learningin-contextinstruction-tuningDOIDBLP
6
泛读ShortACL 2023

MetaVL: Transferring In-Context Learning Ability From Language Models to Vision-Language Models

视觉-语言模型(VLM)通常缺乏纯文本 LLM 那样强大的零样本和少样本上下文学习(ICL)能力,从头预训练一个具备强 ICL 能力的多模态模型成本极高。

Masoud Monajatipoor,Liunian Harold Li,Mozhdeh Rouhsedaghat,Lin Yang,Kai-Wei Chang
UCLAvlmin-context-learningmeta-learningDOIDBLP
6
泛读FindingsACL 2023

Meta-training with Demonstration Retrieval for Efficient Few-shot Learning

基于检索的上下文学习(Retrieval-augmented ICL)在推理时需要检索并拼接大量示例,导致上下文过长、计算成本高昂。

Aaron Mueller,Kanika Narang,Lambert Mathias,Qifan Wang,Hamed Firooz
Meta AImeta-trainingfew-shotretrievalDOIDBLP
6
泛读LongACL 2023

Finding the Pillars of Strength for Multi-Head Attention

这篇论文要回答的核心问题是:多头注意力里究竟哪些头或哪些结构单元真正承担了主要功能,而不是把所有头默认看成同等重要。以往工作常用平均注意力图或做粗粒度剪枝,能看到冗余但很难定位“关键支柱”,因此难以指导压缩、解释和稳定训练。

Jinjie Ni,Rui Mao,Zonglin Yang,Han Lei,Erik Cambria
multi-head-attentionattention-analysistransformerDOIDBLP
6
泛读ShortACL 2023

Self-Distilled Quantization: Achieving High Compression Rates in Transformer-Based Language Models

这篇论文解决的是 Transformer 语言模型高压缩率量化时精度掉得太快的问题。传统后训练量化在极低 bit 下常出现表示能力断崖,量化感知训练又会增加训练成本;作者想用 self-distillation 在不引入重教师的情况下把压缩率继续往上推。

James O&apos;Neill,Sourav Dutta
quantizationself-distillationcompressionDOIDBLP
6
泛读LongACL 2023

Token-wise Decomposition of Autoregressive Language Model Hidden States for Analyzing Model Predictions

这篇论文要解决的是:自回归语言模型在某一步预测某个 token 时,隐藏状态里究竟有多少信息来自历史中每个具体 token。过去常见分析停留在 attention 权重或整体激活层面,但这两者都不能直接告诉你“哪个历史 token 以多大程度塑造了当前预测”。

Byung-Doh Oh,William Schuler
interpretabilityhidden-statesautoregressiveDOIDBLP
6
泛读LongACL 2023

Can LMs Learn New Entities from Descriptions? Challenges in Propagating Injected Knowledge

这篇论文研究的核心问题是:把一个新实体及其描述注入语言模型后,模型是否真的学会了这个实体,并能把相关知识传播到不同表述和推理场景中。过去很多 work 只验证模型能否复述训练过的描述句,回避了更难的问题:新知识是否进入了可泛化的参数化表示。

Yasumasa Onoe,Michael J. Q. Zhang,Shankar Padmanabhan,Greg Durrett,Eunsol Choi
knowledge-injectionentity-learningknowledge-propagationDOIDBLP
6
泛读ShortACL 2023

Controlling the Extraction of Memorized Data from Large Language Models via Prompt-Tuning

这篇论文的核心问题是:能否通过 prompt-tuning 主动控制大语言模型里已记忆数据的提取,而不是只被动测泄露风险。过去成员推断或提取攻击多关注“模型会不会泄露”,较少研究“我们能否用轻量调参系统地放大或抑制这种提取行为”。

Mustafa Özdayi,Charith Peris,Jack FitzGerald,Christophe Dupuy,Jimit Majmudar,Haidar Khan,Rahil Parikh,Rahul Gupta
memorizationprivacyprompt-tuningDOIDBLP
6
泛读LongACL 2023

Socratic Pretraining: Question-Driven Pretraining for Controllable Summarization

这篇论文的核心问题是:摘要模型如何在预训练阶段就学会更可控的行为,而不是等到微调时再靠少量监督去补。传统 summarization 预训练通常是通用 LM 目标,控制属性如焦点、角度、问题导向性只能在下游时硬加标签或 prompt,效果常不稳。

Artidoro Pagnoni,Alexander R. Fabbri,Wojciech Kryscinski,Chien-Sheng Wu
pretraining-objectivecontrollable-generationsummarizationDOIDBLP
6
泛读LongACL 2023

Cross-modal Attention Congruence Regularization for Vision-Language Relation Alignment

VLM 的 cross-modal attention 并不真正把图像中正确的对象和文本中对应的 token 对齐,尤其是在需要区分'主语-谓语-宾语'这种关系语义时。现有对比式预训练(CLIP 类)更多学到粗粒度匹配,对细粒度关系不敏感。

Rohan Pandey,Rulin Shao,Paul Pu Liang,Ruslan Salakhutdinov,Louis-Philippe Morency
CMUvision-language-alignmentcross-modal-attentionregularizationDOIDBLP
6
泛读IndustryACL 2023

HyperT5: Towards Compute-Efficient Korean Language Modeling

韩语 T5 式预训练在小/中算力下效率差:韩语形态丰富、BPE/word 片段化严重,导致训练需要更多 tokens 才能达到同等下游性能。作者要在 compute 预算有限的情况下,把韩语 LM 训得更省。

Dongju Park,Soonwon Ka,Kang Min Yoo,Gichang Lee,Jaewook Kang
NAVERcompute-efficiencylanguage-modelingt5DOIDBLP
6
泛读ShortACL 2023

Token-Level Self-Evolution Training for Sequence-to-Sequence Learning

seq2seq 训练里,teacher forcing 对所有 token 同等对待,但 token 难度差异很大(高频 function word vs 低频实体)。对 hard token 训练不足导致收敛慢、长尾性能差。

Keqin Peng,Liang Ding,Qihuang Zhong,Yuanxin Ouyang,Wenge Rong,Zhang Xiong,Dacheng Tao
ByteDanceBeihang Universityself-trainingseq2seqtraining-dynamicsDOIDBLP
6
泛读FindingsACL 2023

GeoDRL: A Self-Learning Framework for Geometry Problem Solving using Reinforcement Learning in Deductive Reasoning

几何题求解需要长链演绎和严格符号推理,神经端到端方法容易出错;已有 symbolic solver 又依赖手写规则,泛化差。

Shuai Peng,Di Fu,Yijun Liang,Liangcai Gao,Zhi Tang
Peking Universityreinforcement-learningdeductive-reasoningmath-reasoningDOIDBLP
6
泛读LongACL 2023

Towards a Common Understanding of Contributing Factors for Cross-Lingual Transfer in Multilingual Language Models: A Review

多语言 LM 为什么能跨语言迁移?社区给出过一堆因素(词表共享、语言相似度、数据量、typology、anchor tokens),但这些解释互相冲突、结论依赖实验设置,缺乏系统梳理。

Fred Philippy,Siwen Guo,Shohreh Haddadan
University of LuxembourgZortifymultilingual-lmscross-lingual-transferreviewDOIDBLP
6
泛读DemoACL 2023

The ROOTS Search Tool: Data Transparency for LLMs

Aleksandra Piktus,Christopher Akiki,Paulo Villegas,Hugo Laurençon,Gérard Dupont,Sasha Luccioni,Yacine Jernite,Anna Rogers
pretraining-datadata-transparencycorpus-analysisDOIDBLP
6
泛读LongACL 2023

Limitations of Language Models in Arithmetic and Symbolic Induction

Jing Qian,Hong Wang,Zekun Li,Shiyang Li,Xifeng Yan
llmreasoningarithmeticDOIDBLP
6
泛读FindingsACL 2023

Conformal Nucleus Sampling

Shauli Ravfogel,Yoav Goldberg,Jacob Goldberger
decodingsamplinguncertaintyDOIDBLP
6
泛读FindingsACL 2023

Pruning Pre-trained Language Models with Principled Importance and Self-regularization

Siyu Ren,Kenny Q. Zhu
pruningpretrained-modelscompressionDOIDBLP
6
泛读LongACL 2023

Accelerating Transformer Inference for Translation via Parallel Decoding

Andrea Santilli,Silvio Severino,Emilian Postolache,Valentino Maiorca,Michele Mancusi,Riccardo Marin,Emanuele Rodolà
parallel-decodinginferencemtDOIDBLP
6
泛读FindingsACL 2023

Curating Datasets for Better Performance with Example Training Dynamics

Aviad Sar-Shalom,Roy Schwartz
data-selectiontraining-dynamicsDOIDBLP
6
泛读LongACL 2023

Free Lunch: Robust Cross-Lingual Transfer via Model Checkpoint Averaging

Fabian David Schmidt,Ivan Vulic,Goran Glavas
checkpoint-averagingcross-lingualtransferDOIDBLP
6
泛读SRWACL 2023

Data Selection for Fine-tuning Large Language Models Using Transferred Shapley Values

计算数据 Shapley 值来筛选高质量微调数据在 LLM 上计算成本过高,导致数据价值评估难以在大模型 SFT 阶段落地。

Stephanie Schoch,Ritwick Mishra,Yangfeng Ji
University of Virginiadata-selectionshapleyfine-tuningDOIDBLP
6
泛读LongACL 2023

Training-free Neural Architecture Search for RNNs and Transformers

传统的神经架构搜索(NAS)需要训练大量候选模型,在 Transformer 和 RNN 等序列模型上计算代价极其高昂。

Aaron Serianni,Jugal Kalita
nasarchitecturetraining-freeDOIDBLP
6
泛读LongACL 2023

On Second Thought, Let&apos;s Not Think Step by Step! Bias and Toxicity in Zero-Shot Reasoning

Zero-shot CoT(如“Let's think step by step”)能显著提升推理能力,但这种中间推理过程是否会无意中激发预训练数据中的有害偏见和毒性?

Omar Shaikh,Hongxin Zhang,William Barr Held,Michael S. Bernstein,Diyi Yang
Stanford Universitycotbiaszero-shotDOIDBLP
6
泛读LongACL 2023

Language model acceptability judgements are not always robust to context

语言模型在孤立句子上表现出类似人类的句法可接受性判断能力,但在真实的上下文中,这种句法知识是否依然稳健?

Koustuv Sinha,Jon Gauthier,Aaron Mueller,Kanishka Misra,Keren Fuentes,Roger Levy,Adina Williams
lm-evaluationcontextrobustnessDOIDBLP
6
泛读LongACL 2023

Local Byte Fusion for Neural Machine Translation

纯字节级(Byte-level)模型序列过长导致注意力计算极慢,而子词(Subword)模型又存在 OOV 和分词器带来的伪影问题。

Makesh Narsimhan Sreedhar,Xiangpeng Wan,Yu Cheng,Junjie Hu
tokenizerbytemtDOIDBLP
6
泛读FindingsACL 2023

One Embedder, Any Task: Instruction-Finetuned Text Embeddings

这篇工作要解决的是:文本 embedding 能否摆脱‘一任务一模型’的碎片化范式,变成一个通过指令即可适配多任务的统一编码器。过去检索、聚类、分类、匹配等任务通常各自训练 embedding,迁移性弱,部署也重;作者希望像 instruction-tuned 生成模型那样,让 embedding 模型也具备任务条件化能力。

Hongjin Su,Weijia Shi,Jungo Kasai,Yizhong Wang,Yushi Hu,Mari Ostendorf,Wen-tau Yih,Noah A. Smith,Luke Zettlemoyer,Tao Yu
text-embeddingsinstruction-tuningmulti-taskDOIDBLP
6
泛读FindingsACL 2023

Fusion or Defusion? Flexible Vision-and-Language Pre-Training

这篇工作要解决的是视觉-语言预训练中的融合策略过于僵化:要么早融合,把跨模态交互做得很强但代价高;要么晚融合,保留单模态能力却牺牲细粒度对齐。作者的问题是,融合和解耦是否可以按任务与阶段灵活切换,而不是二选一。

Rongyi Sun,Ziran Li,Yifeng Ding,Qifan Wang,Jingang Wang,Haitao Zheng,Wei Wu,Yunsen Xian
vlm-pretrainvision-languageflexible-fusionDOIDBLP
6
泛读FindingsACL 2023

Large Language Models Can be Lazy Learners: Analyze Shortcuts in In-Context Learning

这篇工作要解决的是 In-Context Learning 的一个被忽视问题:大语言模型可能是‘懒惰学习者’,它们在示例中抓的是最省力的 shortcut,而不是任务真正要求的规则。过去很多 ICL 成功案例默认模型在做快速归纳,但作者怀疑模型常常只是利用表面模式、标签分布或位置偏置。

Ruixiang Tang,Dehan Kong,Longtao Huang,Hui Xue
in-context-learningshortcutsllm-analysisDOIDBLP
6
泛读FindingsACL 2023

MVP: Multi-task Supervised Pre-training for Natural Language Generation

Tianyi Tang,Junyi Li,Wayne Xin Zhao,Ji-Rong Wen
multi-task-pretrainnlgsupervised-pretrainDOIDBLP
6
泛读LongACL 2023

Multilingual LLMs are Better Cross-lingual In-context Learners with Alignment

Eshaan Tanwar,Subhabrata Dutta,Manish Borthakur,Tanmoy Chakraborty
multilingualin-context-learningalignmentDOIDBLP
6
泛读FindingsACL 2023

Structured Pruning for Efficient Generative Pre-trained Language Models

Chaofan Tao,Lu Hou,Haoli Bai,Jiansheng Wei,Xin Jiang,Qun Liu,Ping Luo,Ngai Wong
structured-pruningmodel-compressionefficiencyDOIDBLP
6
泛读LongACL 2023

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi,Niranjan Balasubramanian,Tushar Khot,Ashish Sabharwal
retrieval-augmented-generationchain-of-thoughtreasoningDOIDBLP
6
泛读FindingsACL 2023

Scaling Laws for BERT in Low-Resource Settings

Gorka Urbizu,Iñaki San Vicente,Xabier Saralegi,Rodrigo Agerri,Aitor Soroa
scaling-lawslow-resourcebertDOIDBLP
5
泛读FindingsACL 2023

Better Zero-Shot Reasoning with Self-Adaptive Prompting

大模型少样本推理高度依赖人工设计的示例,性能对示例选择敏感且标注成本高;零样本推理依赖固定触发词,泛化性差、性能上限低。

Xingchen Wan,Ruoxi Sun,Hanjun Dai,Sercan Ö. Arik,Tomas Pfister
Google DeepMindzero-shot-reasoningpromptingchain-of-thoughtDOIarXivDBLP
6
泛读FindingsACL 2023

Logical Transformers: Infusing Logical Structures into Pre-Trained Language Models

Borui Wang,Qiuyuan Huang,Budhaditya Deb,Aaron Halfaker,Liqun Shao,Daniel McDuff,Ahmed Hassan Awadallah,Dragomir Radev,Jianfeng Gao
plmlogical-reasoningstructure-biasDOIDBLP
6
泛读FindingsACL 2023

Rethinking Dictionaries and Glyphs for Chinese Language Pre-training

Yuxuan Wang,Jianghui Wang,Dongyan Zhao,Zilong Zheng
pretrainingchinesetokenizerDOIDBLP
6
泛读LongACL 2023

SCOTT: Self-Consistent Chain-of-Thought Distillation

Peifeng Wang,Zhengyang Wang,Zheng Li,Yifan Gao,Bing Yin,Xiang Ren
chain-of-thoughtdistillationself-consistencyDOIDBLP
6
泛读FindingsACL 2023

Revisiting Non-Autoregressive Translation at Scale

Zhihao Wang,Longyue Wang,Jinsong Su,Junfeng Yao,Zhaopeng Tu
non-autoregressivesequence-modelingscalingDOIDBLP
6
泛读LongACL 2023

Plan-and-Solve Prompting: Improving Zero-Shot Chain-of-Thought Reasoning by Large Language Models

Lei Wang,Wanyu Xu,Yihuai Lan,Zhiqiang Hu,Yunshi Lan,Roy Ka-Wei Lee,Ee-Peng Lim
chain-of-thoughtpromptingplanningDOIDBLP
6
泛读FindingsACL 2023

Overcoming Catastrophic Forgetting in Massively Multilingual Continual Learning

Genta Indra Winata,Lingjue Xie,Karthik Radhakrishnan,Shijie Wu,Xisen Jin,Pengxiang Cheng,Mayank Kulkarni,Daniel Preotiuc-Pietro
continual-pretrainmultilingualforgettingDOIDBLP
6
泛读DemoACL 2023

OpenICL: An Open-Source Framework for In-context Learning

解决大模型上下文学习(ICL)研究中评估脚本碎片化、复现困难、缺乏统一接口的问题。

Zhenyu Wu,Yaoxiang Wang,Jiacheng Ye,Zhiyong Wu,Jiangtao Feng,Jingjing Xu,Yu Qiao
Shanghai Jiao Tong UniversityFudan UniversityShanghai AI Laboratoryiclframeworkopen-sourceDOIDBLP
6
泛读LongACL 2023

Rethinking Masked Language Modeling for Chinese Spelling Correction

解决标准掩码语言模型(MLM)直接用于中文拼写纠错(CSC)时,过度依赖上下文而忽略原错字字音/字形信息的问题。

Hongqiu Wu,Shaohua Zhang,Yuchen Zhang,Hai Zhao
Shanghai Jiao Tong UniversitymlmobjectivechineseDOIDBLP
6
泛读FindingsACL 2023

On Isotropy, Contextualization and Learning Dynamics of Contrastive-based Sentence Representation Learning

探究对比学习句子表征(如 SimCSE)中,各向同性(Isotropy)与上下文依赖(Contextualization)在训练动态中的真实作用,挑战“各向同性越高越好”的固有假设。

Chenghao Xiao,Yang Long,Noura Al Moubayed
Durham UniversityisotropycontrastiverepresentationDOIDBLP
6
泛读LongACL 2023

Introducing Semantics into Speech Encoders

解决自监督语音预训练模型(如 HuBERT、wav2vec 2.0)偏向底层声学特征,在缺乏大量标注数据时难以捕获高层语义信息的问题。

Derek Xu,Shuyan Dong,Changhan Wang,Suyoun Kim,Zhaojiang Lin,Bing Liu ... 省略 4 位作者 ... ,Alexei Baevski,Hung-yi Lee,Yizhou Sun,Wei Wang
Meta AIUT AustinspeechsemanticencoderDOIDBLP
6
泛读LongACL 2023

KILM: Knowledge Injection into Encoder-Decoder Language Models

解决 Encoder-Decoder 语言模型在生成时容易产生事实性幻觉的问题,探索如何将知识图谱中的结构化事实显式注入到模型参数中。

Yan Xu,Mahdi Namazifar,Devamanyu Hazarika,Aishwarya Padmakumar,Yang Liu,Dilek Hakkani-Tür
Amazon Alexa AIUniversity of Illinois Chicagoknowledge-injectionencoder-decoderpretrainDOIDBLP
6
泛读LongACL 2023

MultiInstruct: Improving Multi-Modal Zero-Shot Learning via Instruction Tuning

将纯文本领域的指令微调(Instruction Tuning)范式扩展到多模态领域,解决视觉-语言模型(VLM)在未见过的多模态任务上零样本泛化能力弱的问题。

Zhiyang Xu,Ying Shen,Lifu Huang
Virginia Techinstruction-tuningmultimodalzero-shotDOIDBLP
6
泛读FindingsACL 2023

Causal interventions expose implicit situation models for commonsense language understanding

这篇论文要解决的是:Transformer 在常识理解里是否真的形成了类似“situation model”的隐式状态,而不只是表面词法或句法匹配。这个问题过去多靠行为测试间接推断,难以定位模型内部到底哪条路径在携带这类未明说的世界知识。

Takateru Yamakoshi,James L. McClelland,Adele Goldberg,Robert D. Hawkins
interpretabilitycausal-interventioncommonsenseDOIarXivDBLP
6
泛读LongACL 2023

Local Interpretation of Transformer Based on Linear Decomposition

这篇论文要解决的是:能否给 Transformer 做局部、实例级、代价可控的解释,而不是只给出全局重要性或高成本的近似归因。现有解释方法往往在忠实度、效率和可操作性之间取舍明显,尤其对深层网络难兼顾。

Sen Yang,Shujian Huang,Wei Zou,Jianbing Zhang,Xinyu Dai,Jiajun Chen
interpretabilitytransformerlinear-decompositionDOIDBLP
6
泛读FindingsACL 2023

Complementary Explanations for Effective In-Context Learning

这篇论文要回答的是:解释型 demonstration 在 ICL 里为什么有效,模型到底利用了哪部分信息。过去很多工作发现加 explanation 能提分,但对“是 computation trace 有用,还是自然语言表述本身有用,还是两者都重要”缺少拆分。

Xi Ye,Srinivasan Iyer,Asli Celikyilmaz,Veselin Stoyanov,Greg Durrett,Ramakanth Pasunuru
in-context-learningexplanationspromptingDOIarXivDBLP
6
泛读LongACL 2023

BLOOM+1: Adding Language Support to BLOOM for Zero-Shot Prompting

如何把 BLOOM 这种多语预训练模型扩展到它预训练时完全没见过的低资源语言,并保留 zero-shot prompting 能力。问题的背景是:BLOOM 官方只覆盖 46 种语言,其他语言要么从零训要么做完整的 continued pretrain,代价大且容易灾难遗忘。

Zheng Xin Yong,Hailey Schoelkopf,Niklas Muennighoff,Alham Fikri Aji,David Ifeoluwa Adelani,Khalid Almubarak ... 省略 5 位作者 ... ,Stella Biderman,Edward Raff,Dragomir Radev,Vassilina Nikoulina
Brown UniversityEleutherAIBigSciencebloommultilingualcontinual-pretrainDOIDBLP
6
泛读FindingsACL 2023

Boost Transformer-based Language Models with GPU-Friendly Sparsity and Quantization

如何让 Transformer LM 的推理在 GPU 上真正吃到稀疏+低比特的硬件红利。以前的非结构化剪枝和纯量化要么在 GPU 上跑不快,要么精度掉太多,实际部署收益有限。

Chong Yu,Tao Chen,Zhongxue Gan
Fudan UniversityNVIDIAsparsityquantizationgpuDOIDBLP
6
泛读LongACL 2023

Speech-Text Pre-training for Spoken Dialog Understanding with Explicit Cross-Modal Alignment

如何做一个真正通用、且带对话上下文感知的 speech-text 预训练模型。之前的 speech-text 预训练多半只针对 ASR 或某一类下游,缺乏对对话中时序与跨模态对齐的显式建模。

Tianshu Yu,Haoyu Gao,Ting-En Lin,Min Yang,Yuchuan Wu,Wentao Ma,Chao Wang,Fei Huang,Yongbin Li
Alibaba DAMO Academyspeech-text-pretrainingcross-modal-alignmentdialogDOIarXivDBLP
6
泛读LongACL 2023

Hints on the data for language modeling of synthetic languages with transformers

对于形态复杂的合成语(synthetic languages,如凯楚瓦语、芬兰语),同等数据量下 LM 训练效果明显比英语差。想搞清楚是数据量问题、tokenizer 问题,还是形态学本身就需要更多数据才能学好。

Rodolfo Zevallos,Núria Bel
Universitat Pompeu Fabrasynthetic-datalanguage-modelingtransformersDOIDBLP
6
泛读LongACL 2023

Contrastive Learning with Adversarial Examples for Alleviating Pathology of Language Model

语言模型存在一种'病态':面对语义等价但表层不同的输入,表示或预测会显著漂移,鲁棒性和一致性都差。以往的对比学习用随机增强构造正样本,信号太弱;对抗训练则只注入负向扰动。

Pengwei Zhan,Jing Yang,Xiao Huang,Chunlei Jing,Jingying Li,Liming Wang
contrastive-learningadversarial-examplestext-degenerationDOIDBLP
6
泛读ShortACL 2023

Understanding Demonstration-based Learning from a Causal Perspective

ICL 里 demonstration 到底如何影响预测:是示例的内容带来信息,还是示例只是在触发模型已有知识?以前的研究停在相关性层面(换标签、换顺序观察性能变化),缺乏因果刻画。

Ruiyi Zhang,Tong Yu
Adobe Researchin-context-learningcausal-inferencedemonstrationDOIDBLP
6
泛读FindingsACL 2023

Diffusion Theory as a Scalpel: Detecting and Purifying Poisonous Dimensions in Pre-trained Language Models Caused by Backdoor or Bias

Zhiyuan Zhang,Deli Chen,Hao Zhou,Fandong Meng,Jie Zhou,Xu Sun
backdoor-detectionbias-purificationdiffusion-theoryDOIDBLP
6
泛读FindingsACL 2023

Improved Logical Reasoning of Language Models via Differentiable Symbolic Programming

Hanlin Zhang,Jiani Huang,Ziyang Li,Mayur Naik,Eric P. Xing
logical-reasoningneuro-symbolicdifferentiable-programmingDOIDBLP
6
泛读LongACL 2023

Interpreting Positional Information in Perspective of Word Order

Xilong Zhang,Ruochen Liu,Jin Liu,Xuefeng Liang
positional-encodinginterpretabilityword-orderDOIDBLP
6
泛读LongACL 2023

Fine-tuning Happens in Tiny Subspaces: Exploring Intrinsic Task-specific Subspaces of Pre-trained Language Models

Zhong Zhang,Bang Liu,Junming Shao
fine-tuningintrinsic-dimensionsubspace-learningDOIDBLP
6
泛读LongACL 2023

Dialog-Post: Multi-Level Self-Supervised Objectives and Hierarchical Model for Dialogue Post-Training

Zhenyu Zhang,Lei Shen,Yuming Zhao,Meng Chen,Xiaodong He
dialoguepost-trainingself-supervisedDOIDBLP
6
泛读ShortACL 2023

Towards Adaptive Prefix Tuning for Parameter-Efficient Language Model Fine-tuning

Zhenru Zhang,Chuanqi Tan,Haiyang Xu,Chengyu Wang,Jun Huang,Songfang Huang
prefix-tuningpeftadaptationDOIDBLP
6
泛读LongACL 2023

ETHICIST: Targeted Training Data Extraction Through Loss Smoothed Soft Prompting and Calibrated Confidence Estimation

LLM 训练数据提取攻击(training data extraction)此前主要依赖手工设计的 prompt 或简单的前缀搜索,成功率低且缺乏系统化方法。本文研究如何更高效地从预训练语言模型中提取其记忆的训练数据。

Zhexin Zhang,Jiaxin Wen,Minlie Huang
Tsinghua Universitydata-extractionprivacymemorizationDOIDBLP
6
泛读LongACL 2023

Lifting the Curse of Capacity Gap in Distilling Language Models

知识蒸馏中,当教师模型和学生模型的容量差距(capacity gap)很大时,学生反而学不好——这被称为 capacity gap curse。此前的解决方案(如引入中间大小的助教模型)增加了训练成本且效果不稳定。本文直接从蒸馏目标函数角度解决这个问题。

Chen Zhang,Yang Yang,Jiahao Liu,Jingang Wang,Yunsen Xian,Benyou Wang,Dawei Song
distillationcapacity-gapcompressionDOIDBLP
6
泛读LongACL 2023

Pre-trained Language Models Can be Fully Zero-Shot Learners

零样本评估高度依赖手工设计的提示模板(Prompt),导致评估结果方差极大,难以区分是模型本身能力强还是提示工程做得好。

Xuandong Zhao,Siqi Ouyang,Zhiguo Yu,Ming Wu,Lei Li
UCSBzero-shotpretrained-modelgeneralizationDOIDBLP
6
泛读FindingsACL 2023

Deep Equilibrium Non-Autoregressive Sequence Learning

非自回归(NAT)序列生成模型受限于固定的网络深度,难以充分建模目标端复杂的依赖关系(多模态问题),导致生成质量不如自回归模型。

Zaixiang Zheng,Yi Zhou,Hao Zhou
ByteDanceTsinghua Universitynon-autoregressivedeep-equilibriumsequence-modelingDOIDBLP
6
泛读FindingsACL 2023

Modular Transformers: Compressing Transformers into Modularized Layers for Flexible Efficient Inference

在实际部署中,不同场景对延迟的要求不同。传统做法需要为每个延迟约束单独训练和部署一个特定大小的模型,导致极大的训练和维护成本。

Wangchunshu Zhou,Ronan Le Bras,Yejin Choi
University of WashingtonAllen Institute for AI (AI2)transformercompressionmodularDOIDBLP
6
泛读FindingsACL 2023

Large Language Models are Built-in Autoregressive Search Engines

这篇论文要回答的核心问题是:大语言模型在生成时到底是在“逐词推理”,还是更像在参数记忆里做自回归式检索。过去很多工作把 LLM 的知识能力归结为记忆或泛化,但对生成轨迹本身缺少机制化刻画;作者试图把 LLM 看成一种内置搜索引擎,解释其为何能沿着局部高概率路径把相关文档片段“搜出来”。

Noah Ziems,Wenhao Yu,Zhihan Zhang,Meng Jiang
retrievalgenerative-searchllmDOIDBLP
6
泛读ACL 2023

Complex Reasoning in Natural Languag

这篇论文关注的核心问题是:自然语言中的复杂推理能力到底应如何定义、分解和评估。过去很多工作用少量数学或常识 benchmark 近似“reasoning”,但真实复杂推理往往涉及多步组合、知识调用和中间状态管理,这些被单一正确率指标掩盖了。

Wenting Zhao,Mor Geva,Bill Yuchen Lin,Michihiro Yasunaga,Aman Madaan,Tao Yu
reasoningcottutorialDOIDBLP
6
泛读ACL 2023

DePA: Improving Non-autoregressive Translation with Dependency-Aware Decoder

这篇论文要解决的是 NAT 机器翻译的老问题:完全非自回归解码速度快,但因为 decoder 输入里没有先前目标 token,目标端依赖建模不足,翻译质量通常落后于自回归模型。此前常见补救是迭代细化、fertility 或更强蒸馏,但根因——目标依赖缺失——并没有被直接补上。

Jiaao Zhan,Qian Chen,Boxing Chen,Wen Wang,Yu Bai,Yang Gao
non-autoregressivetranslationdependency-modelingDOIarXivDBLP
6
泛读ACL 2023

Token-level Fitting Issues of Seq2seq Models

这篇论文的核心结论是:seq2seq 模型在 early stopping 下并不是“整体上刚好拟合”,而是词表层面同时存在一部分 token 过拟合、另一部分 token 欠拟合。过去训练停止通常靠验证集整体指标决定,这默认误差在 token 上比较均匀;作者指出这种假设不成立,且问题在不同架构、任务和预训练模型微调里都存在。

Guangsheng Bao,Zhiyang Teng,Yue Zhang
training-dynamicsoverfittingseq2seqDOIarXivDBLP
6
泛读ACL 2023

Do not Mask Randomly: Effective Domain-adaptive Pre-training by Masking In-domain Keywords

Shahriar Golchin,Mihai Surdeanu,Nazgol Tavabi,Ata M. Kiapour
domain-adaptationmasked-lmmaskingDOIDBLP
5
泛读ACL 2023

Retrieval-Augmented Domain Adaptation of Language Models

通用域预训练LM迁移到特定领域时性能下降显著,现有域适配方案要么从头训练域内LM要么全参数微调,资源消耗高且无法覆盖不同粒度的领域需求。

Benfeng Xu,Chunxu Zhao,Wenbin Jiang,Pengfei Zhu,Songtai Dai,Chao Pang,Zhuo Sun,Shuohuan Wang,Yu Sun
retrievaldomain-adaptationlanguage-modelDOIDBLP
6
泛读ACL 2023

Scalable Performance Analysis for Vision-Language Models

Santiago Castro,Oana Ignat,Rada Mihalcea
vlmscalingevaluationDOIDBLP
6
泛读ACL 2023

Can Pretrained Language Models Derive Correct Semantics from Corrupt Subwords under Noise?

Xinzhe Li,Ming Liu,Shang Gao
tokenizersubwordnoise-robustnessDOIDBLP
4
LongACL 2023

Gradient-based Intra-attention Pruning on Pre-trained Language Models

预训练LM参数冗余度高,现有结构化剪枝方案多以整个注意力头为最小粒度,搜索空间小,且剪枝与知识蒸馏结合时存在优化干扰,导致压缩后模型精度损失大。

Ziqing Yang,Yiming Cui,Xin Yao,Shijin Wang
哈工大讯飞联合实验室(HFL)pruningdistillationattentionDOIarXivDBLP
6
泛读FindingsACL 2023

DePlot: One-shot visual language reasoning by plot-to-table translation

图表类视觉语言推理任务此前需要数万标注样本训练端到端多模态模型,复杂查询下推理能力差,无法实现少样本适配。

Fangyu Liu,Julian Martin Eisenschlos,Francesco Piccinno,Syrine Krichene,Chenxi Pang,Kenton Lee,Mandar Joshi,Wenhu Chen,Nigel Collier,Yasemin Altun
Google DeepMindchartvlmreasoningDOIarXivDBLP
2
LongACL 2023

DiffusionNER: Boundary Diffusion for Named Entity Recognition

传统NER为序列标注范式,依赖固定实体标签空间,对嵌套实体、低资源实体的识别性能差,灵活性不足。

Yongliang Shen,Kaitao Song,Xu Tan,Dongsheng Li,Weiming Lu,Yueting Zhuang
微软亚洲研究院(MSRA)浙江大学diffusionnergenerativeDOIarXivDBLP
6
泛读LongACL 2023

APOLLO: A Simple Approach for Adaptive Pretraining of Language Models for Logical Reasoning

现有提升LM逻辑推理能力的预训练方案需要复杂的训练数据处理(如符号知识与文本对齐),无法适配通用语料,落地成本高。

Soumya Sanyal,Yichong Xu,Shuohang Wang,Ziyi Yang,Reid Pryzant,Wenhao Yu,Chenguang Zhu,Xiang Ren
微软continual-pretrainreasoningDOIarXivDBLP
6
泛读IndustryACL 2023

Federated Learning of Gboard Language Models with Differential Privacy

输入法语言模型训练需要用户隐私数据,传统集中式训练存在隐私泄露风险,现有联邦学习+差分隐私方案在大规模系统下隐私效用比差,无法落地。

Zheng Xu,Yanxiang Zhang,Galen Andrew,Christopher A. Choquette-Choo,Peter Kairouz,H. Brendan McMahan,Jesse Rosenstock,Yuanbo Zhang
Googlefederated-learningdifferential-privacylanguage-modelDOIarXivDBLP
3
LongACL 2023

Revealing Single Frame Bias for Video-and-Language Learning

现有视频语言学习默认训练时必须输入多帧时序数据,但多帧输入会带来数倍的计算与内存开销,此前没有工作验证单帧输入是否存在性能劣势。

Jie Lei,Tamara L. Berg,Mohit Bansal
UNC Chapel Hillvideo-languagesingle-frameefficiencyDOIarXivDBLP
6
泛读LongACL 2023

ManagerTower: Aggregating the Insights of Uni-Modal Experts for Vision-Language Representation Learning

现有双塔视觉语言模型的跨模态层仅融合单模态编码器的顶层特征,无法分层灵活利用单模态专家不同层级的语义信息,跨模态对齐效率低。

Xiao Xu,Bei Li,Chenfei Wu,Shao-Yen Tseng,Anahita Bhiwandiwalla,Shachar Rosenman,Vasudev Lal,Wanxiang Che,Nan Duan
Microsoft Research Asiavision-languagetwo-towerrepresentation-learningDOIarXivDBLP
2
FindingsACL 2023

DiffuSum: Generation Enhanced Extractive Summarization with Diffusion

现有抽取式摘要将任务建模为序列标注,逐句独立预测选择标签,忽略了摘要句子之间的全局语义一致性和冗余度控制,性能存在天花板。

Haopeng Zhang,Xiao Liu,Jiawei Zhang
diffusionextractive-summarizationnon-autoregressiveDOIarXivDBLP
7
泛读LongACL 2023

Towards Boosting the Open-Domain Chatbot with Human Feedback

基于社交媒体评论预训练的开放域聊天机器人回复连贯性达标,但吸引力不足,核心原因是缺少高质量人-人对话标注数据,且未对齐人类偏好。

Hua Lu,Siqi Bao,Huang He,Fan Wang,Hua Wu,Haifeng Wang
Baiduhuman-feedbackdialoguealignmentDOIarXivDBLP
4
FindingsACL 2023

In-context Examples Selection for Machine Translation

机器翻译的上下文学习示例通常是从开发集随机采样的,此前没有系统研究示例的质量、域相关性、排序等因素对翻译性能的影响,尤其是域外场景。

Sweta Agrawal,Chunting Zhou,Mike Lewis,Luke Zettlemoyer,Marjan Ghazvininejad
Meta AIUniversity of Washingtonin-context-learningmachine-translationexample-selectionDOIarXivDBLP
6
泛读LongACL 2023

RL4F: Generating Natural Language Feedback with Reinforcement Learning for Repairing Model Outputs

现有利用自然语言反馈修复模型输出的方法需要微调目标模型,无法应用于黑盒大模型场景,且微调会产生多份模型副本,部署成本高。

Afra Feyza Akyürek,Ekin Akyürek,Ashwin Kalyan,Peter Clark,Derry Tanti Wijaya,Niket Tandon
Allen Institute for AIBoston Universityrlnatural-language-feedbackself-refinementDOIarXivDBLP
4
FindingsACL 2023

Distilling Efficient Language-Specific Models for Cross-Lingual Transfer

大规模多语言Transformer(如mBERT、XLM-R)覆盖上百种语言,但多数用户仅需要单语言部署,模型冗余度高,推理成本是专用单语言模型的5-10倍。

Alan Ansell,Edoardo Maria Ponti,Anna Korhonen,Ivan Vulic
University of Cambridgedistillationmultilingualcross-lingualDOIarXivDBLP
6
泛读FindingsACL 2023

CoMix: Guide Transformers to Code-Mix using POS structure and Phonetics

现有多语言Transformer在纯单语言语料上预训练,无法处理多语言社会普遍存在的代码混合(code-mix)数据,对拼写变体的鲁棒性差。

Gaurav Arora,Srujana Merugu,Vivek Sembium
multilingualcode-mixingpretraining-objectiveDOIDBLP
5
泛读LongACL 2023

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Haoli Bai,Zhiguang Liu,Xiaojun Meng,Wentao Li,Shuang Liu,Yifeng Luo ... 省略 3 位作者 ... ,Lu Hou,Jiansheng Wei,Xin Jiang,Qun Liu
vlm-pretrainingdocument-understandingmultimodalDOIDBLP
5
泛读LongACL 2023

Measuring Progress in Fine-grained Vision-and-Language Understanding

现有图文预训练模型普遍缺乏细粒度理解能力(如识别图像中的关系、动词、数字),但该方向的进展没有统一量化标准,此前研究多零散提出新基准或模型,缺乏横向对比。

Emanuele Bugliarello,Laurent Sartran,Aishwarya Agrawal,Lisa Anne Hendricks,Aida Nematzadeh
vlm-evaluationfine-grained-understandingmultimodal-benchmarkDOIarXivDBLP
5
泛读LongACL 2023

A Systematic Study of Knowledge Distillation for Natural Language Generation with Pseudo-Target Training

这篇论文要解决的是:在面向特定 NLG 任务和特定数据集做模型压缩时,知识蒸馏到底该怎么用伪目标数据才最有效。此前很多 KD 工作要么主要看分类任务,要么只在有限标注集上蒸馏,回避了真实场景里“大量任务内无标注数据可用、但生成任务训练信号更脆弱”这个更关键的问题。

Nitay Calderon,Subhabrata Mukherjee,Roi Reichart,Amir Kantor
knowledge-distillationmodel-compressionnlgDOIarXivDBLP
5
泛读LongACL 2023

Characterizing and Measuring Linguistic Dataset Drift

这篇论文要解决的是:NLP 里的 dataset drift 该怎么被语言学地拆解和量化,且这些度量能否真正预测模型在单个样本上的性能变化。过去很多 drift 指标停留在总体分布差异,缺乏对词汇、结构和语义漂移的区分,也没有证明对实际错误风险有足够预测力。

Tyler A. Chang,Kishaloy Halder,Neha Anna John,Yogarshi Vyas,Yassine Benajiba,Miguel Ballesteros,Dan Roth
data-qualitydistribution-shiftevaluationDOIarXivDBLP
5
泛读FindingsACL 2023

Revisiting the Architectures like Pointer Networks to Efficiently Improve the Next Word Distribution, Summarization Factuality, and Beyond

这篇论文的核心问题大概率是:如何用类似 pointer network 的结构,在不重做整个语言模型训练的前提下,直接改善 next-word distribution,并进一步提升摘要事实性等生成质量。但由于给定信息没有摘要,具体问题设定、任务范围和与已有方法的差异无法可靠确认。

Haw-Shiuan Chang,Zonghai Yao,Alolika Gon,Hong Yu,Andrew McCallum
University of Massachusetts Amherstnext-tokenarchitecturesummarizationDOIDBLP
5
泛读LongACL 2023

Dynamic Transformers Provide a False Sense of Efficiency

这篇论文要解决的是:动态 Transformer,特别是 multi-exit 模型,宣称的推理效率是否真的稳健。过去大家默认早退出可以在精度-效率之间灵活折中,但很少问一个更现实的问题:这些节省的算力是否会被输入扰动轻易击穿,从而在部署中失去意义。

Yiming Chen,Simin Chen,Zexin Li,Wei Yang,Cong Liu,Robby T. Tan,Haizhou Li
efficient-inferenceearly-exitrobustnessDOIarXivDBLP
5
泛读LongACL 2023

Say What You Mean! Large Language Models Speak Too Positively about Negative Commonsense Knowledge

现有LLM知识能力评估主要集中在正面常识(描述事物存在/成立的知识),几乎没有系统探查LLM对极少在文本中显式出现的负面常识的建模能力,负面常识是LLM知识能力的已知盲区。

Jiangjie Chen,Wei Shi,Ziquan Fu,Sijie Cheng,Lei Li,Yanghua Xiao
llmcommonsenseevaluationDOIarXivDBLP
5
泛读FindingsACL 2023

Controllable Conversation Generation with Conversation Structures via Diffusion Models

Jiaao Chen,Diyi Yang
diffusiondialogue-generationcontrollable-generationDOIDBLP
6
泛读LongACL 2023

A Close Look into the Calibration of Pre-trained Language Models

预训练语言模型(PLM)的预测不确定性估计不可靠,但现有研究没有系统探查校准能力在训练过程中的变化规律,也没有统一验证现有校准方法的有效性。

Yangyi Chen,Lifan Yuan,Ganqu Cui,Zhiyuan Liu,Heng Ji
清华大学UIUCcalibrationplmtraining-dynamicsDOIarXivDBLP
5
泛读FindingsACL 2023

Improving Contrastive Learning of Sentence Embeddings from AI Feedback

Qinyuan Cheng,Xiaogui Yang,Tianxiang Sun,Linyang Li,Xipeng Qiu
sentence-embeddingcontrastive-learningai-feedbackDOIDBLP
4
LongACL 2023

Can Large Language Models Be an Alternative to Human Evaluations?

NLP任务的人工评估可复现性差、质量不稳定、成本高,阻碍不同模型的公平对比,此前没有系统验证LLM作为评估主体替代人类的可行性和边界。

David Cheng-Han Chiang,Hung-yi Lee
台湾大学llm-as-judgeevaluationalignmentDOIarXivDBLP
6
泛读LongACL 2023

Increasing Diversity While Maintaining Accuracy: Text Data Generation with Large Language Models and Human Interventions

用LLM生成训练/评估用文本数据时,现有多样性增强方法(高温采样、logit抑制)都会显著降低数据准确率,无法同时满足高多样性和高准确率的训练数据要求。

John Joon Young Chung,Ece Kamar,Saleema Amershi
data-synthesisdiversityllm-generated-dataDOIarXivDBLP
7
泛读LongACL 2023

Improving Pretraining Techniques for Code-Switched NLP

语码切换(单句内混合多种语言的文本)的无标注数据量远少于单语言数据,现有通用MLM预训练方法在语码切换任务上性能差,没有针对语码切换特点设计的预训练目标。

Richeek Das,Sahasra Ranjan,Shreya Pathak,Preethi Jyothi
multilingualcode-switchingpretrainingDOIDBLP
6
泛读FindingsACL 2023

Cost-effective Distillation of Large Language Models

现有大模型知识蒸馏方法大多是任务或架构特定的,泛化性差,且需要教师模型在任务特定数据集上预训练,成本高,在小数据集上效果不稳定。

Sayantan Dasgupta,Trevor Cohn,Timothy Baldwin
distillationcompressionDOIDBLP
7
泛读FindingsACL 2023

CIF-PT: Bridging Speech and Text Representations for Spoken Language Understanding via Continuous Integrate-and-Fire Pre-Training

现有口语理解(SLU)的预训练方法分别对语音和文本模态做建模,模态表示之间存在鸿沟,跨模态对齐的性能差,没有统一的预训练范式桥接语音帧和文本token的表示。

Linhao Dong,Zhecheng An,Peihao Wu,Jun Zhang,Lu Lu,Zejun Ma
speechpretrainingalignmentDOIarXivDBLP
5
泛读FindingsACL 2023

Exploring the Relationship between Alignment and Cross-lingual Transfer in Multilingual Transformers

探究多语言Transformer中,不同语言表征的对齐程度(Alignment)与跨语言迁移能力(Cross-lingual Transfer)之间是否真的存在正相关关系。

Félix Gaschi,Patricio Cerda,Parisa Rastin,Yannick Toussaint
cross-lingualalignmentmultilingualDOIDBLP
5
泛读FindingsACL 2023

Reasoning in Large Language Models Through Symbolic Math Word Problems

评估并提升大语言模型在复杂逻辑推理上的能力,特别是通过符号化数学应用题来剥离自然语言的干扰,直击推理本质。

Vedant Gaur,Nikunj Saunshi
reasoningsymbolic-mathllm-evaluationDOIDBLP
5
泛读LongACL 2023

Controllable Text Generation via Probability Density Estimation in the Latent Space

这篇论文要解决的是:可控文本生成常依赖属性分类器、prompt 或强化学习,但这些方法要么控制不稳,要么会明显伤害流畅性,能否在潜空间里直接建模属性概率密度来做更平滑的控制。过去 latent control 常停留在简单线性方向或后验搜索,密度结构利用不足。

Yuxuan Gu,Xiaocheng Feng,Sicheng Ma,Lingyuan Zhang,Heng Gong,Weihong Zhong,Bing Qin
controllable-generationlatent-spacedensity-estimationDOIDBLP
5
泛读FindingsACL 2023

Images in Language Space: Exploring the Suitability of Large Language Models for Vision &amp; Language Tasks

这篇论文关注的是:仅靠大型语言模型的语言空间,是否足以承载视觉信息并完成视觉-语言任务,还是必须依赖专门的视觉编码与多模态对齐训练。过去多模态系统通常默认视觉表示需要单独建模,这篇工作在试探这个边界。

Sherzod Hakimov,David Schlangen
vlmllmvision-languageDOIDBLP
5
泛读FindingsACL 2023

Exploring Anisotropy and Outliers in Multilingual Language Models for Cross-Lingual Semantic Sentence Similarity

这篇论文的核心结论是:多语种语言模型中的 anisotropy(表示方向高度集中)和 outlier dimensions(少数异常维度)会影响跨语种句向量相似度,而且两者并不是完全同一个问题。过去关于 anisotropy 的研究多集中在单语设置,多语句表示尤其是 cross-lingual STS 上的系统分析较少。

Katharina Hämmerl,Alina Fastowski,Jindrich Libovický,Alexander Fraser
LMU MunichCharles Universityanisotropyoutlier-dimensionsmultilingualDOIarXivDBLP
5
泛读FindingsACL 2023

Synthetic Pre-Training Tasks for Neural Machine Translation

这篇论文的核心问题是:如果不想让预训练吸收真实世界语料中的毒性、版权和隐私风险,能否只用合成任务和合成数据做出对机器翻译有用的预训练。传统观点通常认为预训练效果高度依赖大规模自然文本,这篇工作在测试一个更受约束但更安全的替代方案。

Zexue He,Graeme Blackwood,Rameswar Panda,Julian J. McAuley,Rogério Feris
AWS AIUniversity of California San Diegosynthetic-datapretrainingmachine-translationDOIarXivDBLP
5
泛读LongACL 2023

Targeted Data Generation: Finding and Fixing Model Weaknesses

这篇论文要解决的问题是:整体准确率高并不代表模型没有系统性弱点,尤其在少数子群上会持续出错,而随机补数据通常修不到这些点。以往数据增强大多面向全局平均性能,默认新数据越多越好,但如果不知道模型具体在哪类样本上脆弱,就很容易增加无效数据,甚至伤害总体表现。

Zexue He,Marco Túlio Ribeiro,Fereshte Khani
synthetic-datadata-generationrobustnessDOIarXivDBLP
5
泛读LongACL 2023

Instruction Induction: From Few Examples to Natural Language Task Descriptions

从标题看,这篇论文要解决的问题是:能否从少量示例自动归纳出自然语言任务描述,也就是把样例级监督提升为指令级监督。过去 few-shot learning 往往直接把示例喂给模型,而任务说明通常依赖人工编写,因此可迁移性和可复用性受限。

Or Honovich,Uri Shaham,Samuel R. Bowman,Omer Levy
instruction-inductionin-context-learningsynthetic-dataDOIDBLP
5
泛读DemoACL 2023

OpenDelta: A Plug-and-play Library for Parameter-efficient Adaptation of Pre-trained Models

把参数高效微调(PEFT)从零散的研究代码变成一个工程可用的工具库。此前 LoRA/Adapter/Prefix-tuning 各自实现、各自 hack 模型代码,复用成本高,切换方法基本意味着改模型。

Shengding Hu,Ning Ding,Weilin Zhao,Xingtai Lv,Zhen Zhang,Zhiyuan Liu,Maosong Sun
清华大学 THUNLPOpenBMBpeftadaptationlibraryDOIDBLP
5
泛读LongACL 2023

Won&apos;t Get Fooled Again: Answering Questions with False Premises

LM 在被问到带有错误前提的问题时(比如'为什么太阳从西边升起')倾向于顺着前提编答案,而不是指出前提错了。这是 hallucination 的一个典型子类,之前多被当作通用幻觉一起处理。

Shengding Hu,Yifan Luo,Huadong Wang,Xingyi Cheng,Zhiyuan Liu,Maosong Sun
清华大学 THUNLPhallucinationrobustnessquestion-answeringDOIDBLP
5
泛读LongACL 2023

Zero-shot Faithful Factual Error Correction

事实错误纠正(FEC)要求不改动非错误部分、只修错事实,而且要 faithful(纠正后的句子真能在证据里找到支持)。有监督 FEC 数据极稀缺,零样本方法要么不 faithful 要么改动太大。

Kung-Hsiang Huang,Hou Pong Chan,Heng Ji
UIUCfactualityzero-shoteditingDOIDBLP
5
泛读ShortACL 2023

MultiTool-CoT: GPT-3 Can Use Multiple External Tools with Chain of Thought Prompting

让 LLM 通过 CoT 调用**多个**外部工具(计算器、化学知识库、分子量查询等),而不是只 toolformer 式地调一个。关键是如何在一条推理链里协调多工具、处理错误与回填。

Tatsuro Inaba,Hirokazu Kiyomaru,Fei Cheng,Sadao Kurohashi
Kyoto Universitytool-usechain-of-thoughtpromptingDOIDBLP
5
泛读FindingsACL 2023

Data-Efficient Finetuning Using Cross-Task Nearest Neighbors

给定一个目标任务,只有少量标注数据,怎么从一个巨大的 multi-task instruction 数据集中挑出最有用的子集做 finetune,而不是全量训?全量 multi-task 贵且噪声大。

Hamish Ivison,Noah A. Smith,Hannaneh Hajishirzi,Pradeep Dasigi
Allen Institute for AIUniversity of Washingtondata-efficiencyfinetuningnearest-neighborDOIDBLP
5
泛读FindingsACL 2023

Multi-Dimensional Evaluation of Text Summarization with In-Context Learning

Sameer Jain,Vaishakh Keshava,Swarnashree Mysore Sathyendra,Patrick Fernandes,Pengfei Liu,Graham Neubig,Chunting Zhou
evaluationsummarizationin-context-learningDOIDBLP
5
泛读FindingsACL 2023

RHO: Reducing Hallucination in Open-domain Dialogues with Knowledge Grounding

Ziwei Ji,Zihan Liu,Nayeon Lee,Tiezheng Yu,Bryan Wilie,Min Zeng,Pascale Fung
hallucinationdialogueknowledge-groundingDOIDBLP
5
泛读FindingsACL 2023

Early Exit with Disentangled Representation and Equiangular Tight Frame

Yixin Ji,Jikai Wang,Juntao Li,Qiang Chen,Wenliang Chen,Min Zhang
early-exitefficient-inferencerepresentation-learningDOIDBLP
5
泛读FindingsACL 2023

&quot;Low-Resource&quot; Text Classification: A Parameter-Free Classification Method with Compressors

Zhiying Jiang,Matthew Y. R. Yang,Mikhail Tsirlin,Raphael Tang,Yiqin Dai,Jimmy Lin
compressionclassificationparameter-freeDOIDBLP
5
泛读LongACL 2023

Vision Language Pre-training by Contrastive Learning with Cross-Modal Similarity Regulation

Chaoya Jiang,Wei Ye,Haiyang Xu,Songfang Huang,Fei Huang,Shikun Zhang
vlpcontrastiveDOIDBLP
5
泛读FindingsACL 2023

Leveraging Training Data in Few-Shot Prompting for Numerical Reasoning

Zhanming Jie,Wei Lu
few-shotpromptingreasoningDOIDBLP
5
泛读LongACL 2023

DarkBERT: A Language Model for the Dark Side of the Internet

Youngjin Jin,Eugene Jang,Jian Cui,Jin-Woo Chung,Yongjae Lee,Seungwon Shin
domain-lmpretrain-dataDOIDBLP
5
泛读LongACL 2023

Are Machine Rationales (Not) Useful to Humans? Measuring and Improving Human Utility of Free-text Rationales

当前普遍假设大模型生成的自然语言解释(Rationales / CoT)能帮助人类更好地理解和完成任务,但这一假设缺乏严格的实证检验,且往往被证明是错的。

Brihi Joshi,Ziyi Liu,Sahana Ramnath,Aaron Chan,Zhewei Tong,Shaoliang Nie,Qifan Wang,Yejin Choi,Xiang Ren
USCAllen Institute for AI (AI2)rationalesexplanationDOIDBLP
5
泛读FindingsACL 2023

Is Continuous Prompt a Combination of Discrete Prompts? Towards a Novel View for Interpreting Continuous Prompts

连续提示(Continuous Prompts / Soft Prompts)虽然在参数高效微调中效果极佳,但它们是存在于连续空间中的黑盒向量,无法映射回人类可读的离散词汇,导致其工作机制难以解释。

Tianjie Ju,Yubin Zheng,Hanyi Wang,Haodong Zhao,Gongshen Liu
Shanghai Jiao Tong Universityprompt-tuninginterpretabilityDOIDBLP
5
泛读IndustryACL 2023

Evaluating Embedding APIs for Information Retrieval

商业 Embedding API(如 OpenAI, Cohere)被广泛应用于检索增强生成(RAG),但它们是闭源黑盒,缺乏透明度,且其在不同分布数据上的真实泛化能力与开源模型相比究竟如何,缺乏系统性评估。

Ehsan Kamalloo,Xinyu Zhang,Odunayo Ogundepo,Nandan Thakur,David Alfonso-Hermelo,Mehdi Rezagholizadeh,Jimmy Lin
University of WaterlooembeddingretrievalevaluationDOIDBLP
5
泛读ShortACL 2023

A Better Way to Do Masked Language Model Scoring

使用掩码语言模型(MLM,如 BERT)对句子的自然度进行打分时,标准的伪对数似然(PLL)方法需要逐个掩码 token 并进行 O(N) 次前向传播,不仅计算极其昂贵,而且双向上下文的泄露会导致打分不准确。

Carina Kauf,Anna A. Ivanova
MITmasked-lmscoringevaluationDOIDBLP
5
泛读LongACL 2023

infoVerse: A Universal Framework for Dataset Characterization with Multidimensional Meta-information

在预训练和微调阶段,数据选择往往依赖启发式规则或单一指标,缺乏一个统一、多维度的框架来系统性地表征和量化数据集的内在属性(如语言复杂度、多样性、毒性)。

Jaehyung Kim,Yekyung Kim,Karin de Langis,Jinwoo Shin,Dongyeop Kang
KAISTUniversity of Minnesotadata-qualitydataset-analysisDOIDBLP
5
泛读LongACL 2023

Memory-efficient NLLB-200: Language-specific Expert Pruning of a Massively Multilingual Machine Translation Model

解决万亿/千亿级多语言MoE模型(如NLLB-200)在实际部署特定语言翻译时,显存开销过大且存在大量冗余计算的问题。

Yeskendir Koishekenov,Alexandre Berard,Vassilina Nikoulina
Naver Labs EuropemoepruningmultilingualDOIDBLP
5
泛读FindingsACL 2023

Neural Architecture Search for Parameter-Efficient Fine-tuning of Large Pre-trained Language Models

解决手工设计的参数高效微调(PET,如Adapter/LoRA)架构在不同下游任务上往往次优的问题,探索如何自动寻找最优的参数插入位置和预算分配。

Neal Lawton,Anoop Kumar,Govind Thattai,Aram Galstyan,Greg Ver Steeg
USCAmazonparameter-efficientnasfine-tuningDOIarXivDBLP
5
泛读FindingsACL 2023

Recursion of Thought: A Divide-and-Conquer Approach to Multi-Context Reasoning with Language Models

解决语言模型在进行复杂多步推理(如长链条CoT)时,中间步骤长度迅速增长并轻易超出模型最大上下文窗口限制的问题。

Soochan Lee,Gunhee Kim
Seoul National Universitychain-of-thoughtreasoningdivide-and-conquerDOIarXivDBLP
5
泛读LongACL 2023

Diverse Demonstrations Improve In-context Compositional Generalization

解决大语言模型在In-Context Learning (ICL) 中组合泛化(Compositional Generalization)能力薄弱的问题,即模型难以将已知的简单概念组合成复杂的未见结构。

Itay Levy,Ben Bogin,Jonathan Berant
Tel Aviv Universityin-context-learningcompositional-generalizationdemonstration-diversityDOIDBLP
5
泛读LongACL 2023

Unifying Cross-Lingual and Cross-Modal Modeling Towards Weakly Supervised Multilingual Vision-Language Pre-training

解决多语言视觉-语言预训练(VLP)中,非英语图文对数据极度匮乏,导致模型在非英语多模态任务上表现不佳的问题。

Zejun Li,Zhihao Fan,Jingjing Chen,Qi Zhang,Xuanjing Huang,Zhongyu Wei
Fudan Universitymultilingualvision-language-pretrainingcross-lingualDOIDBLP
5
泛读LongACL 2023

Contrastive Decoding: Open-ended Text Generation as Optimization

解决自回归语言模型在开放式文本生成(Open-ended generation)中,使用最大化概率解码(如Greedy/Beam Search)容易产生重复、平淡文本,而采样方法(如Nucleus Sampling)又容易偏离主题或产生逻辑错误的问题。

Xiang Lisa Li,Ari Holtzman,Daniel Fried,Percy Liang,Jason Eisner,Tatsunori Hashimoto,Luke Zettlemoyer,Mike Lewis
Stanford UniversityUniversity of Washingtondecodingtext-generationcontrastive-decodingDOIDBLP
5
泛读FindingsACL 2023

Structure-Aware Language Model Pretraining Improves Dense Retrieval on Structured Data

这篇论文要解决的是:现有 dense retrieval 对结构化数据支持很弱,因为预训练语言模型主要按自然文本学表示,缺少对表格、实体关系和字段结构的建模。过去常见做法是把结构化对象线性化成文本再检索,但这样会把结构信息压平,尤其在字段缺失、表达多样和跨模态对齐时效果不稳。

Xinze Li,Zhenghao Liu,Chenyan Xiong,Shi Yu,Yu Gu,Zhiyuan Liu,Ge Yu
pretrainingdense-retrievalstructured-dataDOIarXivDBLP
5
泛读ShortACL 2023

AutoConv: Automatically Generating Information-seeking Conversations with Large Language Models

这篇论文要解决的是信息寻求型对话数据稀缺,限制了对话系统训练。过去常见做法是人工标注或模板合成,但前者贵、后者分布僵硬;随着 LLM 具备更强 few-shot 生成能力,作者重新考虑用模型自己生成高质量训练对话。

Siheng Li,Cheng Yang,Yichun Yin,Xinyu Zhu,Zesen Cheng,Lifeng Shang,Xin Jiang,Qun Liu,Yujiu Yang
synthetic-dataconversation-generationDOIarXivDBLP
5
泛读LongACL 2023

Parameter-Efficient Fine-Tuning without Introducing New Latency

这篇论文要解决的是 PEFT 的一个现实短板:很多方法虽然参数高效,但要么新增模块带来推理延迟,要么依赖任务特定的稀疏参数选择,不利于迁移和联邦场景。过去大家常接受‘少量新参数换一点 latency’这个代价,但在严格线上系统里,这个代价并不小。

Baohao Liao,Yan Meng,Christof Monz
peftfine-tuninglatencyDOIarXivDBLP
4
LongACL 2023

Revisiting the Gold Standard: Grounding Summarization Evaluation with Robust Human Evaluation

现有文本摘要人工评估要么标注者间一致性低、要么规模不足,且缺乏深度分析,导致自动指标和系统的评估结果可比性差。

Yixin Liu,Alexander R. Fabbri,Pengfei Liu,Yilun Zhao,Linyong Nan,Ruilin Han ... 省略 1 位作者 ... ,Shafiq Joty,Chien-Sheng Wu,Caiming Xiong,Dragomir Radev
evaluationsummarizationhuman-evalDOIarXivDBLP
5
泛读ShortACL 2023

Are Sample-Efficient NLP Models More Robust?

Nelson F. Liu,Ananya Kumar,Percy Liang,Robin Jia
sample-efficiencyrobustnessgeneralizationDOIDBLP
5
泛读LongACL 2023

Towards Robust Low-Resource Fine-Tuning with Multi-View Compressed Representations

预训练语言模型参数量大,低资源场景微调容易过拟合,现有参数高效微调方法要么增加推理开销,要么效果提升有限。

Linlin Liu,Xingxuan Li,Megh Thakkar,Xin Li,Shafiq Joty,Luo Si,Lidong Bing
Nanyang Technological Universityfine-tuninglow-resourceregularizationDOIarXivDBLP
5
泛读FindingsACL 2023

Maximum Entropy Loss, the Silver Bullet Targeting Backdoor Attacks in Pre-trained Language Models

预训练语言模型的后门攻击通过最小化交叉熵损失在毒化数据上微调实现,现有防御方法只能在预训练或推理阶段检测删除毒化样本,无法直接逆转攻击效果。

Zhengxiao Liu,Bowen Shen,Zheng Lin,Fali Wang,Weiping Wang
securitybackdoorloss-functionDOIDBLP
5
泛读FindingsACL 2023

Deeply Coupled Cross-Modal Prompt Learning

Xuejing Liu,Wei Tang,Jinghui Lu,Rui Zhao,Zhaojun Guo,Fei Tan
multimodalprompt-learningcross-modalDOIDBLP
5
泛读ShortACL 2023

MolXPT: Wrapping Molecules with Text for Generative Pre-training

Zequn Liu,Wei Zhang,Yingce Xia,Lijun Wu,Shufang Xie,Tao Qin,Ming Zhang,Tie-Yan Liu
moleculargenerative-pretrainingtext-conditioningDOIDBLP
5
泛读LongACL 2023

Toward Human-Like Evaluation for Natural Language Generation with Error Analysis

Qingyu Lu,Liang Ding,Liping Xie,Kanjian Zhang,Derek F. Wong,Dacheng Tao
evaluationnlgerror-analysisDOIDBLP
5
泛读LongACL 2023

Explanation-based Finetuning Makes Models More Robust to Spurious Cues

Josh Magnus Ludan,Yixuan Meng,Tai Nguyen,Saurabh Shah,Qing Lyu,Marianna Apidianaki,Chris Callison-Burch
explanationfinetunerobustnessDOIDBLP
5
泛读ShortACL 2023

Parameter-efficient Weight Ensembling Facilitates Task-level Knowledge Transfer

Xingtai Lv,Ning Ding,Yujia Qin,Zhiyuan Liu,Maosong Sun
peftweight-ensembletransferDOIDBLP
5
泛读FindingsACL 2023

LightFormer: Light-weight Transformer Using SVD-based Weight Transfer and Parameter Sharing

Xiuqing Lv,Peng Zhang,Sunzhu Li,Guobing Gan,Yueheng Sun
svdparameter-sharingcompressionDOIDBLP
5
泛读LongACL 2023

World-to-Words: Grounded Open Vocabulary Acquisition through Fast Mapping in Vision-Language Models

Ziqiao Ma,Jiayi Pan,Joyce Chai
vlmgroundingvocabularyDOIDBLP
5
泛读ShortACL 2023

Exploring the Impact of Layer Normalization for Zero-shot Neural Machine Translation

Zhuoyuan Mao,Raj Dabre,Qianying Liu,Haiyue Song,Chenhui Chu,Sadao Kurohashi
layernormmtzero-shotDOIDBLP
5
泛读FindingsACL 2023

Membership Inference Attacks against Language Models via Neighbourhood Comparison

Justus Mattern,Fatemehsadat Mireshghallah,Zhijing Jin,Bernhard Schölkopf,Mrinmaya Sachan,Taylor Berg-Kirkpatrick
membership-inferenceprivacymemorizationDOIDBLP
5
泛读LongACL 2023

Cross-lingual Continual Learning

Meryem M&apos;hamdi,Xiang Ren,Jonathan May
continual-learningmultilingualDOIDBLP
5
泛读LongACL 2023

LAIT: Efficient Multi-Segment Encoding in Transformers with Layer-Adjustable Interaction

Jeremiah Milbauer,Annie Louis,Mohammad Javad Hosseini,Alex Fabrikant,Donald Metzler,Tal Schuster
transformerefficient-encodingDOIDBLP
5
泛读FindingsACL 2023

Triggering Multi-Hop Reasoning for Question Answering in Language Models using Soft Prompts and Random Walks

冻结的语言模型在处理多跳问答时往往缺乏显式的推理路径规划,而传统的离散提示工程(Discrete Prompting)搜索空间大且对微小扰动极其敏感。

Kanishka Misra,Cícero Nogueira dos Santos,Siamak Shakeri
Google Researchmulti-hopsoft-promptreasoningDOIDBLP
5
泛读SRWACL 2023

Can LMs Store and Retrieve 1-to-N Relational Knowledge?

现有的知识探测(Knowledge Probing)主要关注 1对1 关系(如“法国的首都是__”),而忽略了语言模型是否能完整存储和检索 1对N 的关系知识(如“披头士的成员有__”)。

Haruki Nagasawa,Benjamin Heinzerling,Kazuma Kokuta,Kentaro Inui
Tohoku UniversityRIKENknowledge-storagerelational-knowledgeprobingDOIDBLP
5
泛读LongACL 2023

DisentQA: Disentangling Parametric and Contextual Knowledge with Counterfactual Question Answering

在 RAG 或提供外部上下文的场景中,当给定的上下文与模型预训练学到的参数化知识冲突时,模型往往会产生幻觉,固执地输出参数化知识而非遵循上下文。

Ella Neeman,Roee Aharoni,Or Honovich,Leshem Choshen,Idan Szpektor,Omri Abend
Googleparametric-knowledgecontextual-knowledgecounterfactualDOIDBLP
5
泛读FindingsACL 2023

DiMS: Distilling Multiple Steps of Iterative Non-Autoregressive Transformers for Machine Translation

这篇论文解决的是非自回归机器翻译里一个老问题:迭代 refinement 模型要靠多步解码提升质量,但步数一多延迟就回来了。过去的折中通常是少步推理换速度、多步推理换质量;作者想把“多步教师的质量”压缩进“少步甚至一步学生”的推理路径里。

Sajad Norouzi,Rasa Hosseinzadeh,Felipe Pérez,Maksims Volkovs
non-autoregressivedistillationmachine-translationDOIDBLP
5
泛读IndustryACL 2023

Application-Agnostic Language Modeling for On-Device ASR

这篇论文要解决的问题是:面向端侧 ASR 的语言模型,如何做到应用无关而不是为某个固定场景或解码器特化。传统 on-device ASR 常把 LM 设计成窄域、短上下文、强约束的配套模块,这样体积可控,但迁移到新应用或新文本分布时效果不稳。

Markus Nußbaum-Thom,Lyan Verwimp,Youssef Oualil
on-devicelanguage-modelasrDOIDBLP
5
泛读FindingsACL 2023

Second Language Acquisition of Neural Language Models

这篇论文的核心问题是:神经语言模型在学习第二语言时是否表现出类似人类二语习得的规律,而不是把多语言能力简单看成更多数据的结果。过去多语言 LM 研究更关注 zero-shot 迁移和跨语种性能,很少把“学习顺序、已有语言知识、迁移方向”当成一个习得过程来分析。

Miyu Oba,Tatsuki Kuribayashi,Hiroki Ouchi,Taro Watanabe
second-language-acquisitionneural-lmlearning-dynamicsDOIDBLP
5
泛读LongACL 2023

ThinkSum: Probabilistic reasoning over sets using large language models

这篇论文要解决的是:大语言模型如何对“集合”做概率推理,而不是被迫把集合线性化成某种任意顺序后再做文本生成。传统 LLM 天然是序列模型,面对集合时会引入顺序偏置;这在概率聚合、证据合并和 permutation-invariant reasoning 中是系统性缺陷。

Batu Ozturkler,Nikolay Malkin,Zhen Wang,Nebojsa Jojic
probabilistic-reasoningllm-reasoninginferenceDOIDBLP
5
泛读LongACL 2023

Fact-Checking Complex Claims with Program-Guided Reasoning

复杂事实核查需要多步推理和多种证据来源(文本、表格、计算),纯检索+分类的端到端模型难以拆解推理链。作者想把复杂 claim 的验证变成可组合的程序。

Liangming Pan,Xiaobao Wu,Xinyuan Lu,Anh Tuan Luu,William Yang Wang,Min-Yen Kan,Preslav Nakov
UCSBNanyang Technological Universityprogram-guided-reasoningfact-checkingagentDOIDBLP
5
泛读LongACL 2023

MM-SHAP: A Performance-agnostic Metric for Measuring Multimodal Contributions in Vision and Language Models &amp; Tasks

现有 VLM 评测没法回答一个基本问题:模型在做这个任务时真的用到图像了吗,还是走了文本单模态捷径? 'accuracy 高'不等于'多模态融合好'。

Letitia Parcalabescu,Anette Frank
Heidelberg Universitymultimodal-evaluationinterpretabilityshapley-valuesDOIDBLP
5
泛读FindingsACL 2023

Enhancing Out-of-Vocabulary Estimation with Subword Attention

OOV(词表外词)表示质量差,传统做法用 subword 均值/拼接来近似词向量,信息利用不充分。

Raj Patel,Carlotta Domeniconi
George Mason Universitytokenizationsubword-attentionoovDOIDBLP
5
泛读FindingsACL 2023

PREADD: Prefix-Adaptive Decoding for Controlled Text Generation

受控生成里的常用做法(FUDGE、PPLM、GeDi 等)要么额外训分类器要么推理成本高。作者想用一个更轻的解码时方法来控属性,尤其适合 LLM 不方便再训的场景。

Jonathan Pei,Kevin Yang,Dan Klein
UC Berkeleycontrolled-generationdecoding-strategyDOIDBLP
5
泛读DemoACL 2023

GAIA Search: Hugging Face and Pyserini Interoperability for NLP Training Data Exploration

Aleksandra Piktus,Odunayo Ogundepo,Christopher Akiki,Akintunde Oladipo,Xinyu Zhang,Hailey Schoelkopf,Stella Biderman,Martin Potthast,Jimmy Lin
data-explorationpretraining-datatoolingDOIDBLP
5
泛读ShortACL 2023

Multi-Document Summarization with Centroid-Based Pretraining

Ratish Surendran Puduppully,Parag Jain,Nancy Chen,Mark Steedman
pretrainingsummarizationmulti-documentDOIDBLP
5
泛读LongACL 2023

ClarifyDelphi: Reinforced Clarification Questions with Defeasibility Rewards for Social and Moral Situations

Valentina Pyatkin,Jena D. Hwang,Vivek Srikumar,Ximing Lu,Liwei Jiang,Yejin Choi,Chandra Bhagavatula
rlclarificationreward-designDOIDBLP
5
泛读LongACL 2023

Reasoning with Language Model Prompting: A Survey

Shuofei Qiao,Yixin Ou,Ningyu Zhang,Xiang Chen,Yunzhi Yao,Shumin Deng,Chuanqi Tan,Fei Huang,Huajun Chen
promptingreasoningsurveyDOIDBLP
5
泛读LongACL 2023

What Are You Token About? Dense Retrieval as Distributions Over the Vocabulary

Ori Ram,Liat Bezalel,Adi Zicher,Yonatan Belinkov,Jonathan Berant,Amir Globerson
retrievalvocabularyrepresentation-learningDOIDBLP
5
泛读LongACL 2023

A Comparative Study on the Impact of Model Compression Techniques on Fairness in Language Models

Krithika Ramesh,Arnav Chavan,Shrey Pandit,Sunayana Sitaram
model-compressionfairnesspruningDOIDBLP
5
泛读LongACL 2023

Knowledge of cultural moral norms in large language models

Aida Ramezani,Yang Xu
llmevaluationmoral-reasoningDOIDBLP
5
泛读LongACL 2023

Cross-Modal Attribute Insertions for Assessing the Robustness of Vision-and-Language Learning

Shivaen Ramshetty,Gaurav Verma,Srijan Kumar
vlmrobustnesscross-modalDOIDBLP
5
泛读FindingsACL 2023

Residual Prompt Tuning: improving prompt tuning with residual reparameterization

Anastasia Razdaibiedina,Yuning Mao,Madian Khabsa,Mike Lewis,Rui Hou,Jimmy Ba,Amjad Almahairi
prompt-tuningparameter-efficientreparameterizationDOIDBLP
5
泛读FindingsACL 2023

On the Role of Parallel Data in Cross-lingual Transfer Learning

Machel Reid,Mikel Artetxe
cross-lingualtransfer-learningparallel-dataDOIDBLP
5
泛读FindingsACL 2023

Delving into the Openness of CLIP

Shuhuai Ren,Lei Li,Xuancheng Ren,Guangxiang Zhao,Xu Sun
clipvlmopennessDOIDBLP
5
泛读ShortACL 2023

Is Anisotropy Truly Harmful? A Case Study on Text Clustering

Mira Ait Saada,Mohamed Nadif
anisotropyrepresentationembeddingDOIDBLP
5
泛读FindingsACL 2023

Numeric Magnitude Comparison Effects in Large Language Models

LLM 在处理数字时,究竟是仅仅记住了表面字符串的共现模式,还是像人类一样学到了内在的连续数值幅度(Magnitude)表示?

Raj Sanjay Shah,Vijay Marupudi,Reba Koenen,Khushi Bhardwaj,Sashank Varma
numeracyprobingDOIDBLP
5
泛读ShortACL 2023

ScoNe: Benchmarking Negation Reasoning in Language Models With Fine-Tuning and In-Context Learning

语言模型在处理逻辑否定(特别是多重否定或复杂作用域)时经常失效,且缺乏专门针对此类现象的细粒度评估基准。

Jingyuan Selena She,Christopher Potts,Samuel R. Bowman,Atticus Geiger
Stanford UniversityNYUAnthropicnegationbenchmarkiclDOIDBLP
5
泛读FindingsACL 2023

Rethinking Semi-supervised Learning with Language Models

传统的半监督学习(SSL,如伪标签)通常用于小模型,但在拥有强大预训练表示的现代 LLM 时代,这些方法是否依然有效?

Zhengxiang Shi,Francesco Tonolini,Nikolaos Aletras,Emine Yilmaz,Gabriella Kazai,Yunlong Jiao
University College Londonsemi-supervisedlm-trainingDOIDBLP
5
泛读LongACL 2023

SLUE Phase-2: A Benchmark Suite of Diverse Spoken Language Understanding Tasks

语音语言模型(Speech LMs)缺乏能够评估高阶语义理解(如问答、摘要)的基准,大多停留在 ASR 或简单的意图分类层面。

Suwon Shon,Siddhant Arora,Chyi-Jiunn Lin,Ankita Pasad,Felix Wu,Roshan S. Sharma,Wei-Lun Wu,Hung-yi Lee,Karen Livescu,Shinji Watanabe
speechbenchmarksluDOIDBLP
5
泛读FindingsACL 2023

Follow the Wisdom of the Crowd: Effective Text Generation via Minimum Bayes Risk Decoding

这篇工作要解决的是文本生成解码目标与评测指标不一致的问题:标准 beam search 或采样主要优化模型似然,但实际我们关心的是和参考答案或任务效用更接近的输出。过去 MBR decoding 在机器翻译里有历史,但在通用文本生成里用得不多,作者想验证‘跟随候选集共识’是否比盯着单条高概率路径更有效。

Mirac Suzgun,Luke Melas-Kyriazi,Dan Jurafsky
decodingminimum-bayes-risktext-generationDOIDBLP
5
泛读FindingsACL 2023

Reimagining Retrieval Augmented Language Models for Answering Queries

这篇工作要解决的是检索增强语言模型回答查询时,现有 RAG 管线仍然不够理想:检索、阅读、生成彼此割裂,导致相关证据召回不足、证据融合不稳、答案常被生成器先验主导。作者试图重新思考 RALM 的整体设计,而不是只在单个模块上打补丁。

Wang-Chiew Tan,Yuliang Li,Pedro Rodriguez,Richard James,Xi Victoria Lin,Alon Y. Halevy,Wen-tau Yih
retrieval-augmentedlanguage-modelragDOIDBLP
5
泛读FindingsACL 2023

Are Intermediate Layers and Labels Really Necessary? A General Language Model Distillation Method

这篇工作要解决的是语言模型蒸馏是否真的需要中间层对齐和人工标签这两类常见但昂贵的监督信号。传统蒸馏常依赖 teacher logits、hidden states 甚至标注数据,实施复杂且对架构兼容性要求高;作者想要一个更一般、更轻量的蒸馏方法。

Shicheng Tan,Weng Lam Tam,Yuanchun Wang,Wenwen Gong,Shu Zhao,Peng Zhang,Jie Tang
knowledge-distillationlanguage-modelintermediate-layersDOIDBLP
5
泛读IndustryACL 2023

GKD: A General Knowledge Distillation Framework for Large-scale Pre-trained Language Model

这篇工作要解决的是大规模预训练语言模型蒸馏缺少一个足够通用、可扩展的框架。很多蒸馏方法要么依赖具体任务,要么依赖特定架构或繁重监督,使它们难以成为基础模型训练链路中的标准组件。

Shicheng Tan,Weng Lam Tam,Yuanchun Wang,Wenwen Gong,Shu Zhao,Peng Zhang,Jie Tang
knowledge-distillationpretrained-lmcompressionDOIDBLP
5
泛读LongACL 2023

What&apos;s the Meaning of Superhuman Performance in Today&apos;s NLU?

Simone Tedeschi,Johan Bos,Thierry Declerck,Jan Hajic,Daniel Hershcovich,Eduard H. Hovy ... 省略 2 位作者 ... ,Steven Schockaert,Rico Sennrich,Ekaterina Shutova,Roberto Navigli
evaluationnlubenchmarksDOIDBLP
6
泛读LongACL 2023

LayoutMask: Enhance Text-Layout Interaction in Multi-modal Pre-training for Document Understanding

多模态文档理解预训练模型的文本和布局信息交互不充分,现有方法用全局1D位置作为布局输入,无法捕捉局部文本与位置的对应关系。

Yi Tu,Ya Guo,Huan Chen,Jinyang Tang
multimodal-pretrainingdocument-understandinglayoutDOIarXivDBLP
5
泛读LongACL 2023

Prompting PaLM for Translation: Assessing Strategies and Performance

大语言模型的翻译能力评估存在提示策略选择不统一、测试集老旧、评估指标不合理的问题,之前对PaLM翻译能力的评估结果存在偏差。

David Vilar,Markus Freitag,Colin Cherry,Jiaming Luo,Viresh Ratnakar,George F. Foster
Google Researchin-context-learningmachine-translationpromptingDOIarXivDBLP
6
泛读LongACL 2023

KGA: A General Machine Unlearning Framework Based on Knowledge Gap Alignment

现有机器遗忘方法主要针对计算机视觉场景,NLP场景下文本数据包含更多敏感信息,现有方法依赖严格的分布假设,适配性差。

Lingzhi Wang,Tong Chen,Wei Yuan,Xingshan Zeng,Kam-Fai Wong,Hongzhi Yin
Chinese University of Hong Kongmachine-unlearningknowledge-editingDOIarXivDBLP
6
泛读ShortACL 2023

Learning Multi-Step Reasoning by Solving Arithmetic Tasks

多步数学推理能力仅在大参数语言模型上涌现,小参数模型缺乏有效方法注入该能力,现有CoT方法对小模型效果有限。

Tianduo Wang,Wei Lu
reasoningchain-of-thoughtarithmeticDOIarXivDBLP
5
泛读FindingsACL 2023

Pre-Trained Language-Meaning Models for Multilingual Parsing and Generation

Chunliu Wang,Huiyuan Lai,Malvina Nissim,Johan Bos
pretrainingsemantic-parsingmultilingualDOIDBLP
5
泛读FindingsACL 2023

Benchmarking Diverse-Modal Entity Linking with Generative Models

现有实体链接模型仅支持单模态配置,缺乏统一的多模态实体链接基准和模型,无法适配文本、图像、表格等多模态实体的链接需求。

Sijia Wang,Alexander Hanbo Li,Henghui Zhu,Sheng Zhang,Pramuditha Perera,Chung-Wei Hang ... 省略 2 位作者 ... ,Zhiguo Wang,Vittorio Castelli,Bing Xiang,Patrick Ng
multimodalgenerative-modelsentity-linkingDOIarXivDBLP
5
泛读ShortACL 2023

How to Distill your BERT: An Empirical Study on the Impact of Weight Initialisation and Distillation Objectives

Xinpeng Wang,Leonie Weissweiler,Hinrich Schütze,Barbara Plank
distillationbertcompressionDOIDBLP
5
泛读ShortACL 2023

Let Me Check the Examples: Enhancing Demonstration Learning via Explicit Imitation

Sirui Wang,Kaiwen Wei,Hongzhi Zhang,Yuntao Li,Wei Wu
in-context-learningdemonstrationimitationDOIDBLP
5
泛读FindingsACL 2023

EfficientVLM: Fast and Accurate Vision-Language Models via Knowledge Distillation and Modal-adaptive Pruning

Tiannan Wang,Wangchunshu Zhou,Yan Zeng,Xinsong Zhang
vlmdistillationpruningDOIDBLP
5
泛读FindingsACL 2023

Disagreement Matters: Preserving Label Diversity by Jointly Modeling Item and Annotator Label Distributions with DisCo

Tharindu Cyril Weerasooriya,Alexander Ororbia,Raj Bhensadadia,Ashiqur R. KhudaBukhsh,Christopher Homan
label-qualityannotator-modelingdata-qualityDOIDBLP
5
泛读LongACL 2023

f-Divergence Minimization for Sequence-Level Knowledge Distillation

Yuqiao Wen,Zichao Li,Wenyu Du,Lili Mou
distillationf-divergencesequence-levelDOIDBLP
5
泛读FindingsACL 2023

ANALOGICAL - A Novel Benchmark for Long Text Analogy Evaluation in Large Language Models

Thilini Wijesiriwardene,Ruwan Wickramarachchi,Bimal G. Gajera,Shreeyash Mukul Gowaikar,Chandan Gupta,Aman Chadha,Aishwarya Naresh Reganti,Amit P. Sheth,Amitava Das
benchmarkanalogylong-textDOIDBLP
5
泛读LongACL 2023

AD-KD: Attribution-Driven Knowledge Distillation for Language Model Compression

Siyue Wu,Hongzhan Chen,Xiaojun Quan,Qifan Wang,Rui Wang
distillationcompressionattributionDOIDBLP
5
泛读FindingsACL 2023

Hence, Socrates is mortal: A Benchmark for Natural Language Syllogistic Reasoning

评估大语言模型在自然语言中进行严格三段论推理的能力,解决现有推理评测集常将逻辑演绎与常识知识混淆的问题。

Yongkang Wu,Meng Han,Yutao Zhu,Lei Li,Xinyu Zhang,Ruofei Lai,Xiaoguang Li,Yuanhang Ren,Zhicheng Dou,Zhao Cao
Renmin University of ChinaUCSBreasoningbenchmarksyllogismDOIDBLP
5
泛读LongACL 2023

Do PLMs Know and Understand Ontological Knowledge?

探究预训练语言模型(PLMs)是否真正理解本体知识(如概念层级、属性关系),还是仅仅记住了词汇的表面共现频率。

Weiqi Wu,Chengyue Jiang,Yong Jiang,Pengjun Xie,Kewei Tu
ShanghaiTech UniversityAlibabaprobingknowledgeplmDOIDBLP
5
泛读FindingsACL 2023

Probabilistic Transformer: A Probabilistic Dependency Model for Contextual Word Representation

解决标准 Transformer 自注意力机制缺乏明确结构归纳偏置的问题,尝试为其提供概率图模型的理论解释并加以改进。

Haoyi Wu,Kewei Tu
ShanghaiTech UniversitytransformerprobabilisticarchitectureDOIDBLP
5
泛读FindingsACL 2023

Chain of Thought Prompting Elicits Knowledge Augmentation

探究思维链(CoT)提示为何有效,揭示其除了提供“逐步推理”结构外,是否还起到了激活模型内部隐式知识的作用。

Dingjun Wu,Jing Zhang,Xinmei Huang
University of Science and Technology of ChinacotpromptingknowledgeDOIDBLP
5
泛读FindingsACL 2023

Connectivity Patterns are Task Embeddings

探索如何更高效、本质地表征神经网络在不同任务上的学习状态,替代传统的基于密集向量(dense vectors)的任务 embedding。

Zhiheng Xi,Rui Zheng,Yuansen Zhang,Xuanjing Huang,Zhongyu Wei,Minlong Peng,Mingming Sun,Qi Zhang,Tao Gui
Fudan Universitytask-embeddingconnectivityDOIDBLP
5
泛读FindingsACL 2023

GLUE-X: Evaluating Natural Language Understanding Models from an Out-of-Distribution Generalization Perspective

这篇论文要解决的是:我们对 NLU 模型的评估过度集中在 IID 场景,导致预训练模型上线后真正棘手的 OOD 泛化问题被系统性低估。GLUE 长期充当通用 NLU 基准,但它基本不回答模型面对分布偏移时会掉多少、掉在哪里、哪些方法更抗偏移。

Linyi Yang,Shuibai Zhang,Libo Qin,Yafu Li,Yidong Wang,Hanmeng Liu,Jindong Wang,Xing Xie,Yue Zhang
ood-generalizationbenchmarkpretrained-lmDOIarXivDBLP
5
泛读LongACL 2023

FiD-ICL: A Fusion-in-Decoder Approach for Efficient In-Context Learning

这篇论文解决的是 ICL 的一个现实瓶颈:把多个 demonstrations 直接串接到输入里,计算和上下文长度成本都会快速上升,而且示例一多还容易互相干扰。传统 few-shot ICL 默认 early fusion,但这种方式在序列很长时并不高效,也未必是信息整合的最佳形式。

Qinyuan Ye,Iz Beltagy,Matthew E. Peters,Xiang Ren,Hannaneh Hajishirzi
Allen Institute for AIUniversity of Washingtonin-context-learningfusion-in-decoderefficiencyDOIDBLP
5
泛读LongACL 2023

How poor is the stimulus? Evaluating hierarchical generalization in neural networks trained on child-directed speech

这篇论文要回答的是:儿童在句法习得中表现出的层级泛化能力,究竟需要先天的层级偏置,还是只靠儿童规模、儿童风格输入和一般学习偏置就够了。过去许多神经网络研究使用远大于儿童实际接触量的数据,因此很难拿来回应“贫乏刺激”争论。

Aditya Yedetore,Tal Linzen,Robert Frank,R. Thomas McCoy
syntax-acquisitionhierarchical-generalizationtransformerDOIarXivDBLP
5
泛读FindingsACL 2023

Do Large Language Models Know What They Don&apos;t Know?

这篇论文要解决的问题,从题目看,是评估大语言模型是否能够识别自己的知识边界,即在不知道时给出不确定或拒答,而不是继续生成看似流畅但错误的答案。这个问题过去常被笼统归入 hallucination 或 calibration,但题目暗示作者更关心的是“模型是否知道自己不知道”这一元认知能力。

Zhangyue Yin,Qiushi Sun,Qipeng Guo,Jiawen Wu,Xipeng Qiu,Xuanjing Huang
calibrationself-knowledgellmDOIDBLP
5
泛读LongACL 2023

Did You Read the Instructions? Rethinking the Effectiveness of Task Definitions in Instruction Learning

这篇论文要解决的是:instruction learning 里的人类编写 task definition 是否真的被模型充分使用,以及这些定义是否写得足够高效。过去很多工作默认定义越完整越好,但实际上模型可能只依赖其中少数关键 token,其余部分只是冗余上下文。

Fan Yin,Jesse Vig,Philippe Laban,Shafiq Joty,Caiming Xiong,Chien-Sheng Wu
instruction-learningtask-definitionllmDOIarXivDBLP
5
泛读LongACL 2023

NUWA-XL: Diffusion over Diffusion for eXtremely Long Video Generation

这篇论文要解决的是超长视频生成中的两个老问题:短视频训练、长视频推理之间存在明显训练-推理落差;以及逐段顺序生成效率低,时间一致性也难维护。传统做法通常按 segment 自回归或级联扩展长度,但一旦跨到几千帧,误差积累和算力开销都很难控制。

Shengming Yin,Chenfei Wu,Huan Yang,Jianfeng Wang,Xiaodong Wang,Minheng Ni ... 省略 6 位作者 ... ,Lijuan Wang,Zicheng Liu,Houqiang Li,Nan Duan
diffusionvideo-generationlong-videoDOIarXivDBLP
5
泛读LongACL 2023

ALERT: Adapt Language Models to Reasoning Tasks

想搞清楚大模型在 few-shot CoT 里到底是在用预训练学到的推理技能,还是只是更精细地记忆 + 上下文匹配。这个区分对判断 scaling 会不会继续带来推理能力提升很关键。

Ping Yu,Tianlu Wang,Olga Golovneva,Badr AlKhamissi,Siddharth Verma,Zhijing Jin,Gargi Ghosh,Mona T. Diab,Asli Celikyilmaz
Meta AIreasoningfine-tuningllmDOIarXivDBLP
5
泛读LongACL 2023

HyPe: Better Pre-trained Language Model Fine-tuning with Hidden Representation Perturbation

PLM 在下游 finetune 时普遍存在过拟合与表示塌缩(hidden states 退化)。以前在输入或参数上加噪(如 R-Drop、对抗训练)只触及了模型的一小部分,没直接保护中间层表示。

Hongyi Yuan,Zheng Yuan,Chuanqi Tan,Fei Huang,Songfang Huang
Alibaba DAMO AcademyTsinghua Universityfine-tuningregularizationrepresentation-collapseDOIarXivDBLP
5
泛读LongACL 2023

One Network, Many Masks: Towards More Parameter-Efficient Transfer Learning

参数高效迁移(PEFT)里 adapter / LoRA 等仍然需要额外新增参数,能否在完全不加参数的前提下做到可比效果?背景是大模型部署时希望每任务只多存极少信息。

Guangtao Zeng,Peiyuan Zhang,Wei Lu
Singapore University of Technology and Designpefttransfer-learningmaskingDOIDBLP
5
泛读LongACL 2023

AlignScore: Evaluating Factual Consistency with A Unified Alignment Function

评估生成文本的事实一致性(factual consistency)——摘要是否忠于原文、QA 回答是否被证据支持——长期没有统一、强的自动指标。以前的 NLI-based、QA-based、embedding-based 指标各自只盖一类任务,跨任务泛化差。

Yuheng Zha,Yichi Yang,Ruichen Li,Zhiting Hu
UC San Diegoevaluationfactual-consistencyalignmentDOIDBLP
5
泛读LongACL 2023

Interpretable Math Word Problem Solution Generation via Step-by-step Planning

LLM 解数学应用题(MWP)时要么直接给答案不可解释,要么 CoT 但中间步骤不可控、易跳步。想要的是可解释 + 可控的逐步解题。

Mengxue Zhang,Zichao Wang,Zhichao Yang,Weiqi Feng,Andrew S. Lan
UMass Amherstmath-reasoningstep-by-stepplanningDOIDBLP
5
泛读FindingsACL 2023

HyperPELT: Unified Parameter-Efficient Language Model Tuning for Both Language and Vision-and-Language Tasks

Zhengkun Zhang,Wenya Guo,Xiaojun Meng,Yasheng Wang,Yadao Wang,Xin Jiang,Qun Liu,Zhenglu Yang
peftvision-languageunified-tuningDOIDBLP
5
泛读LongACL 2023

Self-Edit: Fault-Aware Code Editor for Code Generation

Kechi Zhang,Zhuo Li,Jia Li,Ge Li,Zhi Jin
code-generationself-editingfault-awareDOIDBLP
5
泛读FindingsACL 2023

Revisit Few-shot Intent Classification with PLMs: Direct Fine-tuning vs. Continual Pre-training

Haode Zhang,Haowen Liang,Li-Ming Zhan,Xiao-Ming Wu,Albert Y. S. Lam
continual-pretrainingfine-tuningfew-shotDOIDBLP
5
泛读ShortACL 2023

ReAugKD: Retrieval-Augmented Knowledge Distillation For Pre-trained Language Models

Jianyi Zhang,Aashiq Muhamed,Aditya Anantharaman,Guoyin Wang,Changyou Chen,Kai Zhong ... 省略 1 位作者 ... ,Yi Xu,Belinda Zeng,Trishul Chilimbi,Yiran Chen
knowledge-distillationretrieval-augmentationplmDOIDBLP
5
泛读SRWACL 2023

LECO: Improving Early Exiting via Learned Exits and Comparison-based Exiting Mechanism

Jingfan Zhang,Ming Tan,Pengyu Dai,Wei Zhu
early-exitinferenceefficiencyDOIDBLP
5
泛读FindingsACL 2023

Learned Adapters Are Better Than Manually Designed Adapters

Adapter 模块的结构(如瓶颈维度、插入位置、激活函数)通常靠人工设计或经验选择,不同任务的最优结构差异大。本文探索用神经架构搜索(NAS)自动学习 adapter 结构,替代手工设计。

Yuming Zhang,Peng Wang,Ming Tan,Wei Zhu
adapterpeftarchitecture-searchDOIDBLP
5
泛读FindingsACL 2023

FedPETuning: When Federated Learning Meets the Parameter-Efficient Tuning Methods of Pre-trained Language Models

在联邦学习(FL)场景下对预训练语言模型做参数高效微调(PEFT)时,不同 PEFT 方法(LoRA、Prefix Tuning、Adapter 等)在 FL 设置下的表现缺乏系统比较,也不清楚 FL 的通信效率和隐私约束如何影响 PEFT 的选择。

Zhuo Zhang,Yuanhang Yang,Yong Dai,Qifan Wang,Yue Yu,Lizhen Qu,Zenglin Xu
federated-learningpeftfine-tuningDOIDBLP
5
泛读LongACL 2023

Plug-and-Play Knowledge Injection for Pre-trained Language Models

预训练语言模型的知识更新困难——要注入新知识通常需要继续预训练或微调,代价高且可能遗忘旧知识。本文提出一种即插即用的知识注入方法,不修改模型参数即可注入外部知识。

Zhengyan Zhang,Zhiyuan Zeng,Yankai Lin,Huadong Wang,Deming Ye,Chaojun Xiao ... 省略 1 位作者 ... ,Zhiyuan Liu,Peng Li,Maosong Sun,Jie Zhou
Tsinghua Universityknowledge-injectionpretrained-modeladaptationDOIDBLP
5
泛读LongACL 2023

CHBias: Bias Evaluation and Mitigation of Chinese Conversational Language Models

中文对话语言模型的偏见评估缺乏系统化的 benchmark 和缓解方法。已有的偏见研究主要针对英文模型,中文场景下的偏见类型、表现形式和评估方法都需要专门构建。

Jiaxu Zhao,Meng Fang,Zijing Shi,Yitong Li,Ling Chen,Mykola Pechenizkiy
biasevaluationconversational-lmDOIDBLP
5
泛读LongACL 2023

Verify-and-Edit: A Knowledge-Enhanced Chain-of-Thought Framework

LLM 的 Chain-of-Thought (CoT) 推理虽然提升了复杂任务的准确率,但推理链中的事实错误会导致最终答案出错。此前缺乏系统的方法来验证和修正 CoT 中的事实性错误。

Ruochen Zhao,Xingxuan Li,Shafiq Joty,Chengwei Qin,Lidong Bing
Nanyang Technological UniversityAlibaba DAMO Academychain-of-thoughtverificationeditingDOIDBLP
5
泛读FindingsACL 2023

Define, Evaluate, and Improve Task-Oriented Cognitive Capabilities for Instruction Generation Models

指令生成模型(instruction generation models)在任务导向场景中需要具备认知能力(如空间推理、状态追踪),但目前缺乏系统的定义、评估框架和改进方法来衡量和提升这些认知能力。

Lingjun Zhao,Khanh Nguyen,Hal Daumé III
University of Marylandinstruction-generationevaluationcognitive-capabilityDOIDBLP
5
泛读FindingsACL 2023

Click: Controllable Text Generation with Sequence Likelihood Contrastive Learning

现有的可控文本生成方法(如 FUDGE 或 PPLM)多在 Token 级别进行干预,容易导致生成文本缺乏全局一致性且流畅度下降。

Chujie Zheng,Pei Ke,Zheng Zhang,Minlie Huang
Tsinghua Universitycontrollable-generationcontrastivesequence-likelihoodDOIDBLP
5
泛读FindingsACL 2023

AugESC: Dialogue Augmentation with Large Language Models for Emotional Support Conversation

情感支持对话(ESC)任务极度缺乏高质量、多轮次的训练数据,人工标注成本过高且难以规模化。

Chujie Zheng,Sahand Sabour,Jiaxin Wen,Zheng Zhang,Minlie Huang
Tsinghua Universitydata-augmentationdialoguellm-synthesisDOIDBLP
5
泛读FindingsACL 2023

Disentangling Reasoning Capabilities from Language Models with Compositional Reasoning Transformers

标准语言模型将事实知识与推理过程纠缠在相同的参数中,导致模型在面对未见过的实体或需要复杂组合推理时容易产生幻觉。

Wanjun Zhong,Tingting Ma,Jiahai Wang,Jian Yin,Tiejun Zhao,Chin-Yew Lin,Nan Duan
Sun Yat-sen UniversityMSRAreasoningmodulartransformerDOIDBLP
5
泛读FindingsACL 2023

Commonsense Knowledge Transfer for Pre-trained Language Models

小规模预训练语言模型缺乏常识推理能力,而直接扩大模型规模(Scaling up)成本过高且不适合边缘部署。

Wangchunshu Zhou,Ronan Le Bras,Yejin Choi
University of WashingtonAllen Institute for AI (AI2)commonsenseknowledge-transferpretrainDOIDBLP
5
泛读LongACL 2023

Bridging the Gap between Decision and Logits in Decision-based Knowledge Distillation for Pre-trained Language Models

基于 Logits 的知识蒸馏效果好但需要存储和传输庞大的概率分布数据;基于决策(Hard Labels)的蒸馏成本低,但存在严重的性能鸿沟(Gap)。

Qinhong Zhou,Zonghan Yang,Peng Li,Yang Liu
Tsinghua UniversitydistillationpretrainDOIDBLP
5
泛读ShortACL 2023

Revisiting Automated Prompting: Are We Actually Doing Better?

自动化提示搜索(Automated Prompting)方法常宣称能大幅超越手动提示,但这些结论往往建立在不公平的计算预算对比和较弱的基线模型之上。

Yulin Zhou,Yiren Zhao,Ilia Shumailov,Robert Mullins,Yarin Gal
University of CambridgeUniversity of OxfordpromptevaluationDOIDBLP
5
泛读LongACL 2023

Weaker Than You Think: A Critical Look at Weakly Supervised Learning

弱监督学习(WSL)方法常被认为在低资源 NLP 任务中优于标准监督学习,但这一结论往往是因为作为对比的监督学习基线没有得到充分的超参数调优。

Dawei Zhu,Xiaoyu Shen,Marius Mosbach,Andreas Stephan,Dietrich Klakow
Saarland Universityweak-supervisionanalysisDOIDBLP
5
泛读ACL 2023

Mixed Orthographic/Phonemic Language Modeling: Beyond Orthographically Restricted Transformers (BORT)

这篇论文的核心问题是:主流语言模型几乎完全建立在正字法文本上,这对语音、音系和临床语言分析并不够,因为拼写并不能稳定保留发音信息。过去 LLM 的预训练数据天然偏向书写系统,IPA 等音位/音素表示支持很弱,导致模型难以同时利用语义与发音层信息。

Robert Gale,Alexandra Salem,Gerasimos Fergadiotis,Steven Bedrick
language-modelingphonemetokenizationDOIDBLP
5
泛读ACL 2023

One does not fit all! On the Complementarity of Vision Encoders for Vision and Language Tasks

这篇论文要解决的问题是:V&L 任务里不存在一个对所有任务都最优的视觉编码器,不同视觉 encoder 学到的表征具有互补性,但实践里常被当作可互换部件。过去很多 VLM 工作固定一个视觉 backbone 再调文本侧或对齐头,默认视觉表征差异主要体现为强弱,而不是能力类型不同。

Gregor Geigle,Chen Liu,Jonas Pfeiffer,Iryna Gurevych
vision-encodervlmrepresentationDOIDBLP
4
ACL 2023

Grammatical information in BERT sentence embeddings as two-dimensional arrays

作者要回答的是:BERT 这类模型的句向量里是否真的“可读出”类似主谓一致这种规则性语法信息,以及为什么常见的一维 pooled embedding 往往读不出来。以往 probing 往往默认向量是 1D 并用线性分类器读特征,但这可能把“信息存在但读不出”和“信息根本不在表示里”混在一起。

Vivi Nastase,Paola Merlo
bertsentence-embeddingsyntaxDOIarXivDBLP
5
泛读ACL 2023

Probing Out-of-Distribution Robustness of Language Models with Parameter-Efficient Transfer Learning

Hyunsoo Cho,Choonghyun Park,Junyeob Kim,Hyuhng Joon Kim,Kang Min Yoo,Sang-goo Lee
ood-robustnesspefttransfer-learningDOIDBLP
5
泛读ACL 2023

Are Language Models Sensitive to Semantic Attraction? A Study on Surprisal

Yan Cong,Emmanuele Chersoni,Yu-Yin Hsu,Alessandro Lenci
surprisalsemanticsprobingDOIDBLP
5
泛读ACL 2023

Empirical Sufficiency Lower Bounds for Language Modeling with Locally-Bootstrapped Semantic Structures

Jakob Prange,Emmanuele Chersoni
language-modelingsample-efficiencysemanticsDOIDBLP
5
泛读ACL 2023

Analyzing Syntactic Generalization Capacity of Pre-trained Language Models on Japanese Honorific Conversion

Ryo Sekizawa,Hitomi Yanaka
syntactic-generalizationpretrained-lmjapaneseDOIDBLP
5
泛读ACL 2023

How Are Idioms Processed Inside Transformer Language Models?

Ye Tian,Isobel James,Hye Son
transformeridiomsrepresentationDOIDBLP
5
泛读ACL 2023

Syntax and Semantics Meet in the &quot;Middle&quot;: Probing the Syntax-Semantics Interface of LMs Through Agentivity

Lindia Tjuatja,Emmy Liu,Lori S. Levin,Graham Neubig
syntax-semanticsprobingagentivityDOIDBLP
5
泛读ACL 2023

Language models are not naysayers: an analysis of language models on negation benchmarks

Thinh Hung Truong,Timothy Baldwin,Karin Verspoor,Trevor Cohn
negationbenchmarkevaluationDOIDBLP