A benchmark for measuring how well LLMs follow JSON response format instructions across RAG-inspired tasks. Supports OpenAI, Anthropic, Google, Ollama, and Modal/vLLM providers out of the box.
You can find our research paper on ArXiv!
uv syncSet your API key and configure structured_rag/configs/benchmark.yaml:
export OPENAI_API_KEY=sk-...provider: openai # openai | anthropic | google | ollama | ollama_cloud | modal_vllm
model: gpt-5.4-nano
api_key_env: OPENAI_API_KEY # which env var to read the key from
strategy: fstring # fstring | fstring_structured | dspy | dspy_opro | all
tasks:
- AssessAnswerability # or "all" for all 7 tasks
save_dir: resultsRun:
uv run python -m structured_rag.scripts.run_benchmarkOr point to a custom config:
uv run python -m structured_rag.scripts.run_benchmark path/to/custom.yamlThe benchmark tests 7 RAG-inspired structured output tasks across different JSON complexity levels:
| Output Type | Task | Example |
|---|---|---|
string |
GenerateAnswer | {"answer": "The National Gallery of Art..."} |
integer |
RateContext | {"context_score": 5} |
boolean |
AssessAnswerability | {"answerable_question": true} |
List[string] |
ParaphraseQuestions | {"paraphrased_questions": ["...", "...", "..."]} |
composite |
GenerateAnswerWithConfidence | {"answer": "...", "confidence": 5} |
List[composite] |
GenerateAnswersWithConfidence | [{"answer": "...", "confidence": 5}, ...] |
composite |
RAGAS | {"faithfulness_score": 2.5, "answer_relevance_score": 1.0, ...} |
class GenerateAnswerWithConfidence(BaseModel):
answer: str
confidence: int
class RAGASMetrics(BaseModel):
faithfulness_score: float
answer_relevance_score: float
context_relevance_score: float| Strategy | Description |
|---|---|
fstring |
f-string prompting with inline JSON format instructions |
fstring_structured |
f-string prompting with provider-native structured outputs (OpenAI, Google) |
dspy |
DSPy Follow-the-Format (FF) prompting |
dspy_opro |
DSPy with OPRO-optimized JSON signature |
all |
Run all 4 strategies |
| Provider | Config value | API key env var |
|---|---|---|
| OpenAI | openai |
OPENAI_API_KEY |
| Anthropic | anthropic |
ANTHROPIC_API_KEY |
| Google Gemini | google |
GOOGLE_API_KEY |
| Ollama (local) | ollama |
-- |
| Ollama Cloud | ollama_cloud |
OLLAMA_API_KEY |
| Modal vLLM | modal_vllm |
MODAL_API_KEY |
The benchmark reports two separate scores:
- JSON Format Success Rate -- did the LLM produce valid, parseable JSON matching the expected schema?
- Task Accuracy -- for tasks with ground truth (e.g. AssessAnswerability), did the LLM get the right answer?
The codebase follows hexagonal (ports & adapters) architecture:
structured_rag/
core/
domain/ # Pydantic models, task definitions, validation metrics
ports/ # Abstract interfaces (LLMPort, PromptingStrategy)
services/ # Experiment runner, result saving
adapters/
llm/ # One adapter per provider (OpenAI, Anthropic, Google, Ollama, Modal/vLLM)
prompting/ # Strategy implementations (f-string, DSPy)
configs/ # benchmark.yaml
scripts/ # run_benchmark.py entry point
Adding a new LLM provider requires creating one adapter file implementing LLMPort and registering it in adapters/llm/registry.py.
The WikiQuestions dataset contains 112 samples built from Wikipedia title-abstract pairs with generated answerable/unanswerable questions. Also available on HuggingFace Datasets.
- Weaviate Podcast #119 with Will Kurt and Cameron Pfiffer from dottxt.ai -- YouTube | Spotify
- Weaviate Podcast #108 with Zhi Rui Tam on "Let Me Speak Freely?" -- YouTube | Spotify
@misc{shorten2024,
title={StructuredRAG: JSON Response Formatting with Large Language Models},
author={Connor Shorten and Charles Pierse and Thomas Benjamin Smith and Erika Cardenas and Akanksha Sharma and John Trengrove and Bob van Luijt},
year={2024},
eprint={2408.11061},
archivePrefix={arXiv},
primaryClass={cs.CL},
url={https://arxiv.org/abs/2408.11061},
}