discombobulator

A tool-calling stress test for vLLM-served Qwen3 (default target: qwen3.6-35b-a3b-ex). Drives the OpenAI-compatible Responses API at BASE_URL/v1/responses, walks a matrix of scenarios × context sizes × reasoning modes, and grades each run.

The harness is intentionally adversarial — distractor tools, strict schemas, needle-in-haystack arguments, parallel and sequential dependent calls, refusal traps, and 60K-token padding tiers.

Configuration

A .env file in the working directory is loaded automatically. Real environment variables take precedence.

MODEL=qwen3.6-35b-a3b-ex
BASE_URL=http://alab:8000/
API_KEY=xxx

Variable	Purpose
`MODEL`	Model id passed to vLLM in every request.
`BASE_URL`	Root of the vLLM server. `/v1/responses` is appended.
`API_KEY`	Sent as `Authorization: Bearer <key>` (empty = no header).

Build & run

make build                  # → ./bin/discombobulator
./bin/discombobulator       # full matrix, reports in ./results/

make smoke                  # 1K context only, fast sanity
make quick                  # skip the 60K tier

CLI flags (also visible via -h):

Flag	Default	Notes
`-contexts`	`1024,2048,4096,8192,16384,32768,61440`	Token padding tiers.
`-scenarios`	(all)	Comma-separated scenario IDs.
`-reasoning`	`on,off`	`on`, `off`, or both. Toggles Qwen `enable_thinking`.
`-temp`	`0.7`	Sampling temperature.
`-max-output`	`4096`	Per-call output token cap.
`-max-rounds`	`8`	Tool-call rounds within one user turn.
`-seed`	`0xC0FFEE`	PRNG seed for deterministic padding.
`-results`	`results`	Output directory for JSON + markdown reports.
`-v`	off	Print failing checks per run on stdout.
`-env`	`.env`	Path to env file (missing file is silently ignored).

Default matrix is 15 scenarios × 7 contexts × 2 reasoning modes = 210 runs.

Reasoning toggle

Qwen3's chat template honors enable_thinking. The runner passes chat_template_kwargs: {"enable_thinking": <bool>} on every Responses request. vLLM forwards this kwarg verbatim to the chat template — no SDK-side handling needed.

Scenarios

Each scenario lives in internal/scenario/scenarios.go. Tools are catalogued in internal/scenario/tools.go.

ID	What it stresses
`basic_weather`	Single tool, single arg — baseline.
`parallel_weather`	Three cities → exactly three parallel calls in one turn.
`sequential_lookup_send`	Numeric `user_id` from call N feeds call N+1 (must not pass email as id).
`strict_schema_calendar`	Nested `attendees` array, RFC3339 UTC timestamps, location enum, required flags.
`enum_constraint_iso`	"Polish" must become `pl` (regex `^[a-z]{2}$`); avoid deprecated `translate_legacy`.
`distractor_haystack`	21 tools, only `get_weather` is correct; close-name distractors present.
`conditional_no_call`	Trivial fact + calculator available — must NOT call.
`conditional_must_call`	Exact integer arithmetic — must call calculator and return correct result.
`multi_turn_filesystem`	3 turns: list → read (path from prior result) → summarize (no call).
`needle_secret_lookup`	`secret_id` is buried in padding; tool refuses lookup-by-name.
`unicode_args`	Cyrillic, Polish diacritics, emoji must round-trip into call args verbatim.
`hallucination_resistance`	Slack post requested but no slack tool — must answer in text, no invented calls.
`system_user_conflict`	System says celsius, user demands fahrenheit — follow the user.
`pagination_loop`	Follow `next_page_token` until empty; report total count correctly.
`refusal_with_tool`	Bulk-spam request — must refuse, must NOT call `send_message`.

Reports

Each run writes two files into -results:

results-YYYYMMDD-HHMMSS.json — full machine-readable transcript: every check, every call's parsed arguments, token usage, durations.
results-YYYYMMDD-HHMMSS.md — human summary with a pass-matrix pivot (rows = scenarios, columns = <ctx> think | <ctx> no-think) and per-failure detail.

Exit status is non-zero when any run fails, so CI can gate on it.

Layout

cmd/discombobulator/main.go           CLI: env loading, matrix walk, reports
internal/client/responses.go          Minimal HTTP client for /v1/responses
internal/padding/padding.go           Deterministic prose padding + needle insertion
internal/scenario/scenario.go         Scenario / Turn / Expectation types + Evaluate
internal/scenario/tools.go            Tool catalog (real + distractors)
internal/scenario/scenarios.go        The 15 scenario definitions
internal/scenario/checks.go           Argument-validation helpers
internal/runner/runner.go             Multi-round tool-call driver, per-turn grading
internal/report/report.go             JSON + markdown writers

Adding a scenario

Add a constructor to internal/scenario/scenarios.go.
Append it to the slice returned by All().
Reuse tools from tools.go or add new ones.
Use Expectation.MustCall / MustNotCall / NoCall / MinCalls / MaxCalls / OrderMatters / FinalRegex. ArgCheck is a func(map[string]any) error — return a useful message; it lands in the markdown report verbatim.

If you need synthetic tool results to drive multi-step plans, set Turn.ToolResultFn. The runner injects results for every call in the round, then re-issues, up to -max-rounds.

Notes on context tiers

Padding is plain technical prose, deterministic per -seed, framed as "reference material — ignore unless required." The model has to actually search it for needle_secret_lookup and ignore it for everything else. Tier targets assume ~3.6 chars/token (slightly conservative); padding always meets or slightly exceeds the requested token count.

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
cmd/discombobulator		cmd/discombobulator
internal		internal
.env.example		.env.example
.gitignore		.gitignore
Makefile		Makefile
README.md		README.md
go.mod		go.mod

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

discombobulator

Configuration

Build & run

Reasoning toggle

Scenarios

Reports

Layout

Adding a scenario

Notes on context tiers

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

discombobulator

Configuration

Build & run

Reasoning toggle

Scenarios

Reports

Layout

Adding a scenario

Notes on context tiers

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages