ChatPaper.aiChatPaper

TeleAntiFraud-28k:面向电信诈骗检测的音频-文本慢思考数据集

TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection

March 31, 2025
作者: Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
cs.AI

摘要

电信诈骗检测面临重大挑战,主要源于缺乏高质量的多模态训练数据,这些数据需将音频信号与推理导向的文本分析相结合。为填补这一空白,我们推出了TeleAntiFraud-28k,这是首个专为自动化电信诈骗分析设计的开源音频-文本慢思考数据集。我们的数据集通过三种策略构建:(1) 使用自动语音识别(ASR)转录的通话记录(原始音频已匿名化),通过文本到语音(TTS)模型再生,确保现实世界一致性,生成隐私保护的文本真实样本;(2) 基于大语言模型(LLM)的自指令采样对真实ASR输出进行语义增强,以扩大场景覆盖范围;(3) 多智能体对抗合成,通过预定义的通信场景和诈骗类型模拟新兴诈骗手法。生成的数据集包含28,511对经过严格处理的语音-文本对,并附有详细的诈骗推理标注。数据集划分为三个任务:场景分类、诈骗检测、诈骗类型分类。此外,我们构建了TeleAntiFraud-Bench,一个标准化的评估基准,包含从数据集中按比例抽取的实例,以促进对电信诈骗检测任务模型性能的系统测试。我们还贡献了一个基于混合真实/合成数据训练的生产优化监督微调(SFT)模型,同时开源了数据处理框架,以支持社区驱动的数据集扩展。本工作为多模态反欺诈研究建立了基础框架,同时解决了数据隐私和场景多样性方面的关键挑战。项目将在https://github.com/JimmyMa99/TeleAntiFraud 发布。
English
The detection of telecom fraud faces significant challenges due to the lack of high-quality multimodal training data that integrates audio signals with reasoning-oriented textual analysis. To address this gap, we present TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. Our dataset is constructed through three strategies: (1) Privacy-preserved text-truth sample generation using automatically speech recognition (ASR)-transcribed call recordings (with anonymized original audio), ensuring real-world consistency through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via large language model (LLM)-based self-instruction sampling on authentic ASR outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that simulates emerging fraud tactics through predefined communication scenarios and fraud typologies. The generated dataset contains 28,511 rigorously processed speech-text pairs, complete with detailed annotations for fraud reasoning. The dataset is divided into three tasks: scenario classification, fraud detection, fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from the dataset, to facilitate systematic testing of model performance on telecom fraud detection tasks. We also contribute a production-optimized supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while open-sourcing the data processing framework to enable community-driven dataset expansion. This work establishes a foundational framework for multimodal anti-fraud research while addressing critical challenges in data privacy and scenario diversity. The project will be released at https://github.com/JimmyMa99/TeleAntiFraud.

Summary

AI-Generated Summary

PDF122April 1, 2025