- [2026.04] We launched the TeleAntiFraud open-source community for telecom anti-fraud research: https://teleantifraud.github.io/ π
- [2026.04] SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning has been accepted by ACL 2026! π
- [2025] TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection has been accepted by ACM MM 2025! π
- [2026.04] A sanitized public release with dataset card, audio archives, and preview samples is now available on Hugging Face and ModelScope. π
- [Dataset] TeleAntiFraud-28k is available on ModelScope. π
TeleAntiFraud-28k is the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. This dataset integrates audio signals with reasoning-oriented textual analysis, providing high-quality multimodal training data for telecom fraud detection research. ππ‘
- Total Samples: 28,511 rigorously processed speech-text pairs π
- Total Audio Duration: 307 hours β±οΈ
- Unique Feature: Detailed annotations for fraud reasoning π§
- Task Categories: Scenario classification, fraud detection, fraud type classification π―
We provide a sanitized public release for direct download and benchmarking:
- Hugging Face dataset page
- ModelScope dataset page
- packaged
binary_classification.zip,sft.zip, andaudio.zip - preview audio samples directly playable on the dataset page
- normalized relative audio paths without machine-specific absolute paths
- Using ASR-transcribed call recordings (with anonymized original audio)
- Ensuring real-world consistency through TTS model regeneration
- Strict adherence to privacy protection standards
- LLM-based self-instruction sampling on authentic ASR outputs
- Expanding scenario coverage to improve model generalization
- Enriching the diversity of conversational contexts
- Simulation of emerging fraud tactics
- Generation through predefined communication scenarios and fraud typologies
- Enhancing dataset adaptability to new fraud techniques
We have constructed TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from TeleAntiFraud-28k, to facilitate systematic testing of model performance and reasoning capabilities on telecom fraud detection tasks. πβ
We provide sanitized evaluation scripts in evaluation/, including:
- classification metrics for scenario classification, fraud detection, and fraud type classification
- preparation of model outputs for reasoning assessment
- an OpenAI-compatible LM-as-judge runner
- the probability-based reasoning-quality judging prompt
The ACL 2026 paper repository SAFE-QAQ, which reports results on TeleAntiFraud, is available at Control-derek/SAFE-QAQ.
We contribute a production-optimized supervised fine-tuning (SFT) model based on Qwen2-Audio, trained on the TeleAntiFraud training set. π¨β‘
- Hugging Face model: JimmyMa99/AntiFraud-SFT
- ModelScope model: JimmyMa99/AntiFraud-SFT
- base model:
Qwen/Qwen2-Audio-7B-Instruct - model type: supervised fine-tuning for audio-text telecom fraud detection
Explore our dataset examples to better understand the telecom fraud detection capabilities: π
- Case 1: Normal Conversation Analysis - Detailed analysis of a legitimate phone conversation β
- Case 2: Fraud Conversation Analysis - Step-by-step reasoning for detecting a fraudulent call
β οΈ - Evaluation Sample - Representative sample from our evaluation benchmark π
- Model Output: Normal Conversation - Our model's reasoning process on a legitimate call π€β
- Model Output: Fraud Detection - Model's analysis and detection of a fraudulent call π€
β οΈ
To collect fraudulent conversation data: πΌ
- Insert your API key in
multi-agents-tools/AntiFraudMatrix/main.py(uses SiliconFlow API key) π - Run the following command to generate fraudulent dialog text:
python multi-agents-tools/AntiFraudMatrix/main.py
- Results will be saved in the
resultdirectory π
For normal conversation data: π¬
- Use
multi-agents-tools/AntiFraudMatrix-normal/main.pyfollowing the same process
To synthesize speech from the collected text: π
-
Install the necessary dependencies π¦
-
Run the API server:
fastapi dev ChatTTS/examples/api/main_new_new.py --host 0.0.0.0 --port 8006
-
Use any of the scripts in
ChatTTS/examples/api/normal_run*.shorChatTTS/examples/api/run*.shπModify the port in these scripts if needed, then run:
bash ChatTTS/examples/api/run*.sh
- TeleAntiFraud public dataset release (Hugging Face) π€
- TeleAntiFraud public dataset release (ModelScope) π
- AntiFraud-SFT model (Hugging Face) π€
- AntiFraud-SFT model (ModelScope) π€
- SAFE-QAQ code repository π§ͺ
- TeleAntiFraud-28k dataset π
- TeleAntiFraud-Bench evaluation benchmark π
- Evaluation and LM-as-judge utilities βοΈ
- Data processing framework (supporting community-driven dataset expansion) π§
- TeleAntiFraud-Qwen2-Audio SFT model π€
- Establishing a foundational framework for multimodal anti-fraud research ποΈ
- Addressing critical challenges in data privacy and scenario diversity π
- Providing high-quality training data for telecom fraud detection π
- Open-sourcing data processing tools to enable community collaboration π€
We would like to express our sincere gratitude to all the organizations and individuals who have provided invaluable support throughout this project: β€οΈ
- China Mobile Internet Company (δΈη§»δΊθη½) - For their industry expertise and technical guidance π’
- Intern Community (δΉ¦ηη€ΎεΊ) - For their open-source ecosystem support and collaboration π
- ModelScope Community (ιζη€ΎεΊ) - For their platform support and community resources πͺ
- SmartFlowAI Community (ζΊζΊζ΅η€ΎεΊ) - For their technical contributions and collaborative efforts π‘
- Control-derek - For his technical expertise and valuable contributions π¨βπ»
- vansin - For his dedicated support and assistance π€
- Jintao-Huang - For his valuable suggestions and contributions π
Their contributions have been instrumental in making this project a success and advancing the field of telecom fraud detection research. π
@inproceedings{ma2025teleantifraud,
title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
author={Ma, Zhiming and Wang, Peidong and Huang, Minhua and Wang, Jinpeng and Wu, Kai and Lv, Xiangzhao and Pang, Yachun and Yang, Yin and Tang, Wenjie and Kang, Yuchen},
booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
pages={5853--5862},
year={2025}
}
@article{wang2026safe,
title={SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning},
author={Wang, Peidong and Ma, Zhiming and Dai, Xin and Liu, Yongkang and Feng, Shi and Yang, Xiaocui and Hu, Wenxing and Wang, Zhihao and Pan, Mingjun and Yuan, Li and others},
journal={arXiv preprint arXiv:2601.01392},
year={2026}
}



