A tool-calling stress test for vLLM-served Qwen3 (default target: qwen3.6-35b-a3b-ex).
Drives the OpenAI-compatible Responses API at BASE_URL/v1/responses, walks a
matrix of scenarios × context sizes × reasoning modes, and grades each run.
The harness is intentionally adversarial — distractor tools, strict schemas, needle-in-haystack arguments, parallel and sequential dependent calls, refusal traps, and 60K-token padding tiers.
A .env file in the working directory is loaded automatically. Real environment variables
take precedence.
MODEL=qwen3.6-35b-a3b-ex
BASE_URL=http://alab:8000/
API_KEY=xxx| Variable | Purpose |
|---|---|
MODEL |
Model id passed to vLLM in every request. |
BASE_URL |
Root of the vLLM server. /v1/responses is appended. |
API_KEY |
Sent as Authorization: Bearer <key> (empty = no header). |
make build # → ./bin/discombobulator
./bin/discombobulator # full matrix, reports in ./results/
make smoke # 1K context only, fast sanity
make quick # skip the 60K tierCLI flags (also visible via -h):
| Flag | Default | Notes |
|---|---|---|
-contexts |
1024,2048,4096,8192,16384,32768,61440 |
Token padding tiers. |
-scenarios |
(all) | Comma-separated scenario IDs. |
-reasoning |
on,off |
on, off, or both. Toggles Qwen enable_thinking. |
-temp |
0.7 |
Sampling temperature. |
-max-output |
4096 |
Per-call output token cap. |
-max-rounds |
8 |
Tool-call rounds within one user turn. |
-seed |
0xC0FFEE |
PRNG seed for deterministic padding. |
-results |
results |
Output directory for JSON + markdown reports. |
-v |
off | Print failing checks per run on stdout. |
-env |
.env |
Path to env file (missing file is silently ignored). |
Default matrix is 15 scenarios × 7 contexts × 2 reasoning modes = 210 runs.
Qwen3's chat template honors enable_thinking. The runner passes
chat_template_kwargs: {"enable_thinking": <bool>} on every Responses request.
vLLM forwards this kwarg verbatim to the chat template — no SDK-side handling needed.
Each scenario lives in internal/scenario/scenarios.go.
Tools are catalogued in internal/scenario/tools.go.
| ID | What it stresses |
|---|---|
basic_weather |
Single tool, single arg — baseline. |
parallel_weather |
Three cities → exactly three parallel calls in one turn. |
sequential_lookup_send |
Numeric user_id from call N feeds call N+1 (must not pass email as id). |
strict_schema_calendar |
Nested attendees array, RFC3339 UTC timestamps, location enum, required flags. |
enum_constraint_iso |
"Polish" must become pl (regex ^[a-z]{2}$); avoid deprecated translate_legacy. |
distractor_haystack |
21 tools, only get_weather is correct; close-name distractors present. |
conditional_no_call |
Trivial fact + calculator available — must NOT call. |
conditional_must_call |
Exact integer arithmetic — must call calculator and return correct result. |
multi_turn_filesystem |
3 turns: list → read (path from prior result) → summarize (no call). |
needle_secret_lookup |
secret_id is buried in padding; tool refuses lookup-by-name. |
unicode_args |
Cyrillic, Polish diacritics, emoji must round-trip into call args verbatim. |
hallucination_resistance |
Slack post requested but no slack tool — must answer in text, no invented calls. |
system_user_conflict |
System says celsius, user demands fahrenheit — follow the user. |
pagination_loop |
Follow next_page_token until empty; report total count correctly. |
refusal_with_tool |
Bulk-spam request — must refuse, must NOT call send_message. |
Each run writes two files into -results:
results-YYYYMMDD-HHMMSS.json— full machine-readable transcript: every check, every call's parsed arguments, token usage, durations.results-YYYYMMDD-HHMMSS.md— human summary with a pass-matrix pivot (rows = scenarios, columns =<ctx> think | <ctx> no-think) and per-failure detail.
Exit status is non-zero when any run fails, so CI can gate on it.
cmd/discombobulator/main.go CLI: env loading, matrix walk, reports
internal/client/responses.go Minimal HTTP client for /v1/responses
internal/padding/padding.go Deterministic prose padding + needle insertion
internal/scenario/scenario.go Scenario / Turn / Expectation types + Evaluate
internal/scenario/tools.go Tool catalog (real + distractors)
internal/scenario/scenarios.go The 15 scenario definitions
internal/scenario/checks.go Argument-validation helpers
internal/runner/runner.go Multi-round tool-call driver, per-turn grading
internal/report/report.go JSON + markdown writers
- Add a constructor to
internal/scenario/scenarios.go. - Append it to the slice returned by
All(). - Reuse tools from
tools.goor add new ones. - Use
Expectation.MustCall/MustNotCall/NoCall/MinCalls/MaxCalls/OrderMatters/FinalRegex.ArgCheckis afunc(map[string]any) error— return a useful message; it lands in the markdown report verbatim.
If you need synthetic tool results to drive multi-step plans, set Turn.ToolResultFn. The
runner injects results for every call in the round, then re-issues, up to -max-rounds.
Padding is plain technical prose, deterministic per -seed, framed as
"reference material — ignore unless required." The model has to actually search it for
needle_secret_lookup and ignore it for everything else. Tier targets assume ~3.6 chars/token
(slightly conservative); padding always meets or slightly exceeds the requested token count.