Skip to content

ossiqn/PhantomAPI

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

24 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation



πŸ‘» PhantomAPI

Stealth WAF-bypass scraping engine with AI-powered structured data extraction.

Turn any website into a structured JSON API β€” no matter what WAF protects it.


What is PhantomAPI?

PhantomAPI is a production-grade REST API framework that turns any website into a structured data source β€” even if that site has no public API and is protected by Cloudflare, Datadome, or similar WAF layers.

It drives a real, fingerprint-spoofed Chrome browser, cleans the DOM, then feeds the content to GPT-4o which returns exactly the data you asked for as a clean JSON object. It supports both Synchronous (instant JSON return) and Asynchronous (Webhook delivery) extraction modes.


Flow

POST /api/v1/extract
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Stealth Chrome Engine          β”‚
β”‚  Β· undetected-chromedriver      β”‚
β”‚  Β· Advanced Stealth Flags       β”‚
β”‚  Β· Proxy rotation               β”‚
β”‚  Β· Exponential backoff retry    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BeautifulSoup DOM Cleaner      β”‚
β”‚  Β· script / style / svg removed β”‚
β”‚  Β· Attribute stripping          β”‚
β”‚  Β· 12 000 char token guard      β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  OpenAI GPT-4o                  β”‚
β”‚  Β· json_object response mode    β”‚
β”‚  Β· Zero-temperature extraction  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
        β”‚
        β–Ό
   Clean JSON Response (Sync)
            OR
   Webhook Delivery (Async)

Stack

Layer Technology
API FastAPI + Uvicorn
Scraping undetected-chromedriver + Selenium
DOM Parsing BeautifulSoup4 + lxml
AI Engine OpenAI GPT-4o
Validation Pydantic v2
Rate Limit SlowAPI + Asyncio Semaphore
Retries Tenacity + exponential backoff
Deployment Docker + Docker Compose
Logging colorlog

Setup

Option 1 β€” Docker (Recommended)

git clone https://github.com/ossiqn/PhantomAPI.git
cd PhantomAPI
cp .env.example .env
docker-compose up -d --build

Option 2 β€” Local Environment

git clone https://github.com/ossiqn/PhantomAPI.git
cd PhantomAPI
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
cp .env.example .env
python main.py

Preview

PhantomAPI Terminal Preview


Usage

1. Synchronous Extraction

Returns the extracted JSON directly in the HTTP response.

curl -X POST "http://localhost:8000/api/v1/extract" \
     -H "Content-Type: application/json" \
     -H "X-OpenAI-Key: sk-..." \
     -d '{
           "url": "https://target-site.com/products",
           "prompt": "Extract all product names and prices as a JSON array."
         }'

2. Asynchronous Webhook Extraction

Provide a webhook_url. The API immediately returns 202 Accepted with a task_id and processes the extraction in the background. Once complete, the result is POST'd to your webhook.

curl -X POST "http://localhost:8000/api/v1/extract" \
     -H "Content-Type: application/json" \
     -H "X-OpenAI-Key: sk-..." \
     -d '{
           "url": "https://target-site.com/products",
           "prompt": "Extract all product names and prices as a JSON array.",
           "webhook_url": "https://your-server.com/webhook/receive"
         }'

Request Body

Field Type Required Description
url string yes Full URL of the target page
prompt string yes What data to extract and how to structure it
wait_for_selector string no CSS selector to wait for before capturing the DOM
javascript string no Custom JS to execute after page load (max 2000c)
webhook_url string no Target URL to receive the async extraction result

Headers

Header Required Description
X-OpenAI-Key yes Your OpenAI API key

Response β€” Synchronous

{
  "success": true,
  "url": "https://target-site.com/products",
  "extracted_data": {
    "products": [
      { "name": "Product A", "price": "$19.99" },
      { "name": "Product B", "price": "$34.99" }
    ]
  },
  "tokens_used": 812,
  "proxy_used": "http://1.2.3.4:8080",
  "elapsed_ms": 7430.21
}

Proxy Support

Create a proxies.txt file in the project root:

# Lines starting with # are ignored
http://user:pass@1.2.3.4:8080
socks5://9.10.11.12:1080
http://5.6.7.8:3128
  • Proxies are selected randomly on each request.
  • Bad proxies are auto-removed from the rotation pool on failure.
  • If the file does not exist, PhantomAPI runs on your direct IP without interruption.

Endpoints

Method Path Description
POST /api/v1/extract Run extraction
GET /api/v1/health Engine health check
GET /docs Swagger UI
GET /redoc ReDoc UI

Environment Variables

Variable Default Description
APP_HOST 0.0.0.0 Server bind host
APP_PORT 8000 Server port
APP_ENV production Environment label
PAGE_LOAD_TIMEOUT 30 Seconds before browser timeout
RETRY_ATTEMPTS 3 Max browser retry count
RETRY_DELAY 2 Base delay between retries in seconds
MAX_CONTENT_CHARS 12000 Max chars forwarded to OpenAI
PROXY_FILE_PATH proxies.txt Path to proxy list file
RATE_LIMIT_PER_MINUTE 30 Max requests per minute per IP
MAX_CONCURRENT_TASKS 5 Max simultaneous browser instances (Queue cap)
ADVANCED_STEALTH_MODE true Enable extreme WAF bypass Chrome flags

Error Codes

Status Meaning
401 Missing or invalid X-OpenAI-Key header
408 Target page timed out after all retry attempts
422 Validation error or empty page content
429 Rate limit exceeded
503 WAF bypass failed, OpenAI unreachable, or server full
500 Unexpected internal error

Project Structure

PhantomAPI/
β”œβ”€β”€ main.py
β”œβ”€β”€ requirements.txt
β”œβ”€β”€ Dockerfile
β”œβ”€β”€ docker-compose.yml
β”œβ”€β”€ .env.example
β”œβ”€β”€ .gitignore
└── src/
    β”œβ”€β”€ api/
    β”‚   β”œβ”€β”€ routes.py
    β”‚   └── middleware.py
    β”œβ”€β”€ core/
    β”‚   β”œβ”€β”€ config.py
    β”‚   β”œβ”€β”€ schemas.py
    β”‚   └── exceptions.py
    β”œβ”€β”€ services/
    β”‚   β”œβ”€β”€ scraper.py
    β”‚   └── ai_parser.py
    └── utils/
        β”œβ”€β”€ proxy_manager.py
        β”œβ”€β”€ rate_limiter.py
        └── logger.py

Security

  • API keys are never stored, logged, or hardcoded β€” passed per-request via header only.
  • Rate limiting is enforced per IP via SlowAPI.
  • Smart Queue (Semaphore) prevents server overload by capping concurrent Chrome instances.
  • Custom JavaScript input is capped at 2000 characters to prevent abuse.
  • All exception traces are server-side only β€” clients receive sanitized error messages.

License

This project is licensed under the MIT License. See the LICENSE file for details.


🌐 Community & Support

Platform Link
πŸ“’ Telegram t.me/ossiqn
πŸ“¦ Telegram Archive t.me/ossiqnarsiv
🌍 Website ossiqn.com.tr
πŸ“Έ Instagram instagram.com/ossiqnstwo
πŸ›‘οΈ Forum blueshield.com.tr


Built with πŸ‘» by Ossiqn β€” PhantomAPI is intended for legal use only. Always ensure you have permission to scrape a target website.