SynthDetoxM:现代LLMs是少样本并行解毒数据注释者
SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators
February 10, 2025
作者: Daniil Moskovskiy, Nikita Sushko, Sergey Pletenev, Elena Tutubalina, Alexander Panchenko
cs.AI
摘要
现有的多语言文本净化方法受到平行多语言数据集稀缺的限制。在本研究中,我们介绍了一个用于生成多语言平行净化数据的流程。我们还推出了SynthDetoxM,这是一个手动收集和合成生成的多语言平行文本净化数据集,包括德语、法语、西班牙语和俄语,共包含16,000个高质量的净化句对。这些数据来自不同的毒性评估数据集,然后在少样本设置下,使用了九种现代开源LLM对其进行重写。我们的实验表明,在数据有限的情况下,训练在生成的合成数据集上的模型性能优于在人工注释的MultiParaDetox数据集上训练的模型。在少样本设置下,训练在SynthDetoxM上的模型胜过所有评估的LLM。我们发布了我们的数据集和代码,以帮助进一步研究多语言文本净化。
English
Existing approaches to multilingual text detoxification are hampered by the
scarcity of parallel multilingual datasets. In this work, we introduce a
pipeline for the generation of multilingual parallel detoxification data. We
also introduce SynthDetoxM, a manually collected and synthetically generated
multilingual parallel text detoxification dataset comprising 16,000
high-quality detoxification sentence pairs across German, French, Spanish and
Russian. The data was sourced from different toxicity evaluation datasets and
then rewritten with nine modern open-source LLMs in few-shot setting. Our
experiments demonstrate that models trained on the produced synthetic datasets
have superior performance to those trained on the human-annotated
MultiParaDetox dataset even in data limited setting. Models trained on
SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our
dataset and code to help further research in multilingual text detoxification.Summary
AI-Generated Summary