Skip to content

AlgoNode/discombobulator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

discombobulator

A tool-calling stress test for vLLM-served Qwen3 (default target: qwen3.6-35b-a3b-ex). Drives the OpenAI-compatible Responses API at BASE_URL/v1/responses, walks a matrix of scenarios × context sizes × reasoning modes, and grades each run.

The harness is intentionally adversarial — distractor tools, strict schemas, needle-in-haystack arguments, parallel and sequential dependent calls, refusal traps, and 60K-token padding tiers.

Configuration

A .env file in the working directory is loaded automatically. Real environment variables take precedence.

MODEL=qwen3.6-35b-a3b-ex
BASE_URL=http://alab:8000/
API_KEY=xxx
Variable Purpose
MODEL Model id passed to vLLM in every request.
BASE_URL Root of the vLLM server. /v1/responses is appended.
API_KEY Sent as Authorization: Bearer <key> (empty = no header).

Build & run

make build                  # → ./bin/discombobulator
./bin/discombobulator       # full matrix, reports in ./results/

make smoke                  # 1K context only, fast sanity
make quick                  # skip the 60K tier

CLI flags (also visible via -h):

Flag Default Notes
-contexts 1024,2048,4096,8192,16384,32768,61440 Token padding tiers.
-scenarios (all) Comma-separated scenario IDs.
-reasoning on,off on, off, or both. Toggles Qwen enable_thinking.
-temp 0.7 Sampling temperature.
-max-output 4096 Per-call output token cap.
-max-rounds 8 Tool-call rounds within one user turn.
-seed 0xC0FFEE PRNG seed for deterministic padding.
-results results Output directory for JSON + markdown reports.
-v off Print failing checks per run on stdout.
-env .env Path to env file (missing file is silently ignored).

Default matrix is 15 scenarios × 7 contexts × 2 reasoning modes = 210 runs.

Reasoning toggle

Qwen3's chat template honors enable_thinking. The runner passes chat_template_kwargs: {"enable_thinking": <bool>} on every Responses request. vLLM forwards this kwarg verbatim to the chat template — no SDK-side handling needed.

Scenarios

Each scenario lives in internal/scenario/scenarios.go. Tools are catalogued in internal/scenario/tools.go.

ID What it stresses
basic_weather Single tool, single arg — baseline.
parallel_weather Three cities → exactly three parallel calls in one turn.
sequential_lookup_send Numeric user_id from call N feeds call N+1 (must not pass email as id).
strict_schema_calendar Nested attendees array, RFC3339 UTC timestamps, location enum, required flags.
enum_constraint_iso "Polish" must become pl (regex ^[a-z]{2}$); avoid deprecated translate_legacy.
distractor_haystack 21 tools, only get_weather is correct; close-name distractors present.
conditional_no_call Trivial fact + calculator available — must NOT call.
conditional_must_call Exact integer arithmetic — must call calculator and return correct result.
multi_turn_filesystem 3 turns: list → read (path from prior result) → summarize (no call).
needle_secret_lookup secret_id is buried in padding; tool refuses lookup-by-name.
unicode_args Cyrillic, Polish diacritics, emoji must round-trip into call args verbatim.
hallucination_resistance Slack post requested but no slack tool — must answer in text, no invented calls.
system_user_conflict System says celsius, user demands fahrenheit — follow the user.
pagination_loop Follow next_page_token until empty; report total count correctly.
refusal_with_tool Bulk-spam request — must refuse, must NOT call send_message.

Reports

Each run writes two files into -results:

  • results-YYYYMMDD-HHMMSS.json — full machine-readable transcript: every check, every call's parsed arguments, token usage, durations.
  • results-YYYYMMDD-HHMMSS.md — human summary with a pass-matrix pivot (rows = scenarios, columns = <ctx> think | <ctx> no-think) and per-failure detail.

Exit status is non-zero when any run fails, so CI can gate on it.

Layout

cmd/discombobulator/main.go           CLI: env loading, matrix walk, reports
internal/client/responses.go          Minimal HTTP client for /v1/responses
internal/padding/padding.go           Deterministic prose padding + needle insertion
internal/scenario/scenario.go         Scenario / Turn / Expectation types + Evaluate
internal/scenario/tools.go            Tool catalog (real + distractors)
internal/scenario/scenarios.go        The 15 scenario definitions
internal/scenario/checks.go           Argument-validation helpers
internal/runner/runner.go             Multi-round tool-call driver, per-turn grading
internal/report/report.go             JSON + markdown writers

Adding a scenario

  1. Add a constructor to internal/scenario/scenarios.go.
  2. Append it to the slice returned by All().
  3. Reuse tools from tools.go or add new ones.
  4. Use Expectation.MustCall / MustNotCall / NoCall / MinCalls / MaxCalls / OrderMatters / FinalRegex. ArgCheck is a func(map[string]any) error — return a useful message; it lands in the markdown report verbatim.

If you need synthetic tool results to drive multi-step plans, set Turn.ToolResultFn. The runner injects results for every call in the round, then re-issues, up to -max-rounds.

Notes on context tiers

Padding is plain technical prose, deterministic per -seed, framed as "reference material — ignore unless required." The model has to actually search it for needle_secret_lookup and ignore it for everything else. Tier targets assume ~3.6 chars/token (slightly conservative); padding always meets or slightly exceeds the requested token count.

About

Custom LLM tool usage tester.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors