TeleAntiFraud-28k：面向电信诈骗检测的音频-文本慢思考数据集

摘要

电信诈骗检测面临重大挑战，主要源于缺乏高质量的多模态训练数据，这些数据需将音频信号与推理导向的文本分析相结合。为填补这一空白，我们推出了TeleAntiFraud-28k，这是首个专为自动化电信诈骗分析设计的开源音频-文本慢思考数据集。我们的数据集通过三种策略构建：(1) 使用自动语音识别(ASR)转录的通话记录（原始音频已匿名化），通过文本到语音(TTS)模型再生，确保现实世界一致性，生成隐私保护的文本真实样本；(2) 基于大语言模型(LLM)的自指令采样对真实ASR输出进行语义增强，以扩大场景覆盖范围；(3) 多智能体对抗合成，通过预定义的通信场景和诈骗类型模拟新兴诈骗手法。生成的数据集包含28,511对经过严格处理的语音-文本对，并附有详细的诈骗推理标注。数据集划分为三个任务：场景分类、诈骗检测、诈骗类型分类。此外，我们构建了TeleAntiFraud-Bench，一个标准化的评估基准，包含从数据集中按比例抽取的实例，以促进对电信诈骗检测任务模型性能的系统测试。我们还贡献了一个基于混合真实/合成数据训练的生产优化监督微调(SFT)模型，同时开源了数据处理框架，以支持社区驱动的数据集扩展。本工作为多模态反欺诈研究建立了基础框架，同时解决了数据隐私和场景多样性方面的关键挑战。项目将在https://github.com/JimmyMa99/TeleAntiFraud 发布。

English

The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

TeleAntiFraud-28k：面向电信诈骗检测的音频-文本慢思考数据集

TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

摘要

Summary

Support

Support