mStyleDistance:多语言风格嵌入及其评估
mStyleDistance: Multilingual Style Embeddings and their Evaluation
February 21, 2025
作者: Justin Qiu, Jiacheng Zhu, Ajay Patel, Marianna Apidianaki, Chris Callison-Burch
cs.AI
摘要
风格嵌入对于风格分析和风格迁移具有重要价值;然而,此前仅提供了英语风格嵌入。我们推出了多语言风格距离模型(mStyleDistance),这是一种利用合成数据和对比学习训练的多语言风格嵌入模型。该模型在九种语言的数据上进行了训练,并创建了一个多语言STEL-or-Content基准(Wegmann等人,2022),用于评估嵌入质量。此外,我们还将这些嵌入应用于跨语言的作者验证任务中。实验结果表明,mStyleDistance嵌入在这些多语言风格基准测试中超越了现有模型,并且对未见过的特征和语言展现出良好的泛化能力。我们已将该模型公开发布于https://huggingface.co/StyleDistance/mstyledistance。
English
Style embeddings are useful for stylistic analysis and style transfer;
however, only English style embeddings have been made available. We introduce
Multilingual StyleDistance (mStyleDistance), a multilingual style embedding
model trained using synthetic data and contrastive learning. We train the model
on data from nine languages and create a multilingual STEL-or-Content benchmark
(Wegmann et al., 2022) that serves to assess the embeddings' quality. We also
employ our embeddings in an authorship verification task involving different
languages. Our results show that mStyleDistance embeddings outperform existing
models on these multilingual style benchmarks and generalize well to unseen
features and languages. We make our model publicly available at
https://huggingface.co/StyleDistance/mstyledistance .Summary
AI-Generated Summary