ChatPaper.aiChatPaper

SynthDetoxM:现代LLMs是少样本并行解毒数据注释者

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

February 10, 2025
作者: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko
cs.AI

摘要

现有的多语言文本净化方法受到平行多语言数据集稀缺的限制。在本研究中,我们介绍了一个用于生成多语言平行净化数据的流程。我们还推出了SynthDetoxM,这是一个手动收集和合成生成的多语言平行文本净化数据集,包括德语、法语、西班牙语和俄语,共包含16,000个高质量的净化句对。这些数据来自不同的毒性评估数据集,然后在少样本设置下,使用了九种现代开源LLM对其进行重写。我们的实验表明,在数据有限的情况下,训练在生成的合成数据集上的模型性能优于在人工注释的MultiParaDetox数据集上训练的模型。在少样本设置下,训练在SynthDetoxM上的模型胜过所有评估的LLM。我们发布了我们的数据集和代码,以帮助进一步研究多语言文本净化。
English
Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

Summary

AI-Generated Summary

PDF862February 11, 2025