SynthDetoxM：现代LLMs是少样本并行解毒数据注释者

摘要

现有的多语言文本净化方法受到平行多语言数据集稀缺的限制。在本研究中，我们介绍了一个用于生成多语言平行净化数据的流程。我们还推出了SynthDetoxM，这是一个手动收集和合成生成的多语言平行文本净化数据集，包括德语、法语、西班牙语和俄语，共包含16,000个高质量的净化句对。这些数据来自不同的毒性评估数据集，然后在少样本设置下，使用了九种现代开源LLM对其进行重写。我们的实验表明，在数据有限的情况下，训练在生成的合成数据集上的模型性能优于在人工注释的MultiParaDetox数据集上训练的模型。在少样本设置下，训练在SynthDetoxM上的模型胜过所有评估的LLM。我们发布了我们的数据集和代码，以帮助进一步研究多语言文本净化。

English

Existing approaches to multilingual text detoxification are hampered by the scarcity of parallel multilingual datasets. In this work, we introduce a pipeline for the generation of multilingual parallel detoxification data. We also introduce SynthDetoxM, a manually collected and synthetically generated multilingual parallel text detoxification dataset comprising 16,000 high-quality detoxification sentence pairs across German, French, Spanish and Russian. The data was sourced from different toxicity evaluation datasets and then rewritten with nine modern open-source LLMs in few-shot setting. Our experiments demonstrate that models trained on the produced synthetic datasets have superior performance to those trained on the human-annotated MultiParaDetox dataset even in data limited setting. Models trained on SynthDetoxM outperform all evaluated LLMs in few-shot setting. We release our dataset and code to help further research in multilingual text detoxification.

SynthDetoxM：现代LLMs是少样本并行解毒数据注释者

SynthDetoxM: Modern LLMs are Few-Shot Parallel Detoxification Data Annotators

摘要

Summary

Support