Skip to content

JimmyMa99/TeleAntiFraud

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

TeleAntiFraud-28k πŸ“žπŸ›‘οΈ

Hugging Face Dataset ModelScope Dataset

Hugging Face Model ModelScope Model

arXiv SAFE-QAQ arXiv

πŸŽ‰ News

TeleAntiFraud-28k is the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis. This dataset integrates audio signals with reasoning-oriented textual analysis, providing high-quality multimodal training data for telecom fraud detection research. πŸ”πŸ’‘

main_framework

πŸ“Š Dataset Overview

  • Total Samples: 28,511 rigorously processed speech-text pairs πŸ“‹
  • Total Audio Duration: 307 hours ⏱️
  • Unique Feature: Detailed annotations for fraud reasoning 🧠
  • Task Categories: Scenario classification, fraud detection, fraud type classification 🎯

Public Dataset Release

We provide a sanitized public release for direct download and benchmarking:

  • Hugging Face dataset page
  • ModelScope dataset page
  • packaged binary_classification.zip, sft.zip, and audio.zip
  • preview audio samples directly playable on the dataset page
  • normalized relative audio paths without machine-specific absolute paths

πŸ—οΈ Dataset Construction Strategies

dataflow

1. πŸ”’ Privacy-preserved Text-Truth Sample Generation

  • Using ASR-transcribed call recordings (with anonymized original audio)
  • Ensuring real-world consistency through TTS model regeneration
  • Strict adherence to privacy protection standards

2. πŸš€ Semantic Enhancement

  • LLM-based self-instruction sampling on authentic ASR outputs
  • Expanding scenario coverage to improve model generalization
  • Enriching the diversity of conversational contexts

3. πŸ€– Multi-agent Adversarial Synthesis

multiagent

  • Simulation of emerging fraud tactics
  • Generation through predefined communication scenarios and fraud typologies
  • Enhancing dataset adaptability to new fraud techniques

🎯 TeleAntiFraud-Bench

Telebench

We have constructed TeleAntiFraud-Bench, a standardized evaluation benchmark comprising proportionally sampled instances from TeleAntiFraud-28k, to facilitate systematic testing of model performance and reasoning capabilities on telecom fraud detection tasks. πŸ“βœ…

Evaluation Utilities

We provide sanitized evaluation scripts in evaluation/, including:

  • classification metrics for scenario classification, fraud detection, and fraud type classification
  • preparation of model outputs for reasoning assessment
  • an OpenAI-compatible LM-as-judge runner
  • the probability-based reasoning-quality judging prompt

The ACL 2026 paper repository SAFE-QAQ, which reports results on TeleAntiFraud, is available at Control-derek/SAFE-QAQ.

πŸ€– Model Contribution

We contribute a production-optimized supervised fine-tuning (SFT) model based on Qwen2-Audio, trained on the TeleAntiFraud training set. 🎨⚑

Released Model

πŸ“ Examples

Explore our dataset examples to better understand the telecom fraud detection capabilities: πŸ‘€

πŸ› οΈ Multi-Agent Data Collection

To collect fraudulent conversation data: πŸ’Ό

  1. Insert your API key in multi-agents-tools/AntiFraudMatrix/main.py (uses SiliconFlow API key) πŸ”‘
  2. Run the following command to generate fraudulent dialog text:
    python multi-agents-tools/AntiFraudMatrix/main.py
  3. Results will be saved in the result directory πŸ“

For normal conversation data: πŸ’¬

  • Use multi-agents-tools/AntiFraudMatrix-normal/main.py following the same process

πŸŽ™οΈ Voice Synthesis with ChatTTS

To synthesize speech from the collected text: πŸ”Š

  1. Install the necessary dependencies πŸ“¦

  2. Run the API server:

    fastapi dev ChatTTS/examples/api/main_new_new.py --host 0.0.0.0 --port 8006
  3. Use any of the scripts in ChatTTS/examples/api/normal_run*.sh or ChatTTS/examples/api/run*.sh πŸš€

    Modify the port in these scripts if needed, then run:

    bash ChatTTS/examples/api/run*.sh

🌟 Open-Source Resources

🎯 Key Contributions

  1. Establishing a foundational framework for multimodal anti-fraud research πŸ—οΈ
  2. Addressing critical challenges in data privacy and scenario diversity πŸ”
  3. Providing high-quality training data for telecom fraud detection πŸ“ˆ
  4. Open-sourcing data processing tools to enable community collaboration 🀝

πŸ™ Acknowledgements

We would like to express our sincere gratitude to all the organizations and individuals who have provided invaluable support throughout this project: ❀️

Their contributions have been instrumental in making this project a success and advancing the field of telecom fraud detection research. πŸš€

πŸ“„ Citation

@inproceedings{ma2025teleantifraud,
  title={TeleAntiFraud-28k: An Audio-Text Slow-Thinking Dataset for Telecom Fraud Detection},
  author={Ma, Zhiming and Wang, Peidong and Huang, Minhua and Wang, Jinpeng and Wu, Kai and Lv, Xiangzhao and Pang, Yachun and Yang, Yin and Tang, Wenjie and Kang, Yuchen},
  booktitle={Proceedings of the 33rd ACM International Conference on Multimedia},
  pages={5853--5862},
  year={2025}
}

@article{wang2026safe,
  title={SAFE-QAQ: End-to-End Slow-Thinking Audio-Text Fraud Detection via Reinforcement Learning},
  author={Wang, Peidong and Ma, Zhiming and Dai, Xin and Liu, Yongkang and Feng, Shi and Yang, Xiaocui and Hu, Wenxing and Wang, Zhihao and Pan, Mingjun and Yuan, Li and others},
  journal={arXiv preprint arXiv:2601.01392},
  year={2026}
}

About

[MM '25]TeleAntiFraud-28k, the first open-source audio-text slow-thinking dataset specifically designed for automated telecom fraud analysis.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors