TeleAntiFraud-28k:面向电信诈骗检测的音频-文本慢思考数据集
TeleAntiFraud-28k: A Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection
March 31, 2025
作者: Zhiming Ma, Peidong Wang, Minhua Huang, Jingpeng Wang, Kai Wu, Xiangzhao Lv, Yachun Pang, Yin Yang, Wenjie Tang, Yuchen Kang
cs.AI
摘要
电信诈骗检测面临重大挑战,主要源于缺乏高质量的多模态训练数据,这些数据需将音频信号与推理导向的文本分析相结合。为填补这一空白,我们推出了TeleAntiFraud-28k,这是首个专为自动化电信诈骗分析设计的开源音频-文本慢思考数据集。我们的数据集通过三种策略构建:(1) 使用自动语音识别(ASR)转录的通话记录(原始音频已匿名化),通过文本到语音(TTS)模型再生,确保现实世界一致性,生成隐私保护的文本真实样本;(2) 基于大语言模型(LLM)的自指令采样对真实ASR输出进行语义增强,以扩大场景覆盖范围;(3) 多智能体对抗合成,通过预定义的通信场景和诈骗类型模拟新兴诈骗手法。生成的数据集包含28,511对经过严格处理的语音-文本对,并附有详细的诈骗推理标注。数据集划分为三个任务:场景分类、诈骗检测、诈骗类型分类。此外,我们构建了TeleAntiFraud-Bench,一个标准化的评估基准,包含从数据集中按比例抽取的实例,以促进对电信诈骗检测任务模型性能的系统测试。我们还贡献了一个基于混合真实/合成数据训练的生产优化监督微调(SFT)模型,同时开源了数据处理框架,以支持社区驱动的数据集扩展。本工作为多模态反欺诈研究建立了基础框架,同时解决了数据隐私和场景多样性方面的关键挑战。项目将在https://github.com/JimmyMa99/TeleAntiFraud 发布。
English
The detection of telecom fraud faces significant challenges due to the lack
of high-quality multimodal training data that integrates audio signals with
reasoning-oriented textual analysis. To address this gap, we present
TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset
specifically designed for automated telecom fraud analysis. Our dataset is
constructed through three strategies: (1) Privacy-preserved text-truth sample
generation using automatically speech recognition (ASR)-transcribed call
recordings (with anonymized original audio), ensuring real-world consistency
through text-to-speech (TTS) model regeneration; (2) Semantic enhancement via
large language model (LLM)-based self-instruction sampling on authentic ASR
outputs to expand scenario coverage; (3) Multi-agent adversarial synthesis that
simulates emerging fraud tactics through predefined communication scenarios and
fraud typologies. The generated dataset contains 28,511 rigorously processed
speech-text pairs, complete with detailed annotations for fraud reasoning. The
dataset is divided into three tasks: scenario classification, fraud detection,
fraud type classification. Furthermore, we construct TeleAntiFraud-Bench, a
standardized evaluation benchmark comprising proportionally sampled instances
from the dataset, to facilitate systematic testing of model performance on
telecom fraud detection tasks. We also contribute a production-optimized
supervised fine-tuning (SFT) model trained on hybrid real/synthetic data, while
open-sourcing the data processing framework to enable community-driven dataset
expansion. This work establishes a foundational framework for multimodal
anti-fraud research while addressing critical challenges in data privacy and
scenario diversity. The project will be released at
https://github.com/JimmyMa99/TeleAntiFraud.Summary
AI-Generated Summary