StructuredRAG: JSON Response Formatting with Large Language Models

A benchmark for measuring how well LLMs follow JSON response format instructions across RAG-inspired tasks. Supports OpenAI, Anthropic, Google, Ollama, and Modal/vLLM providers out of the box.

You can find our research paper on ArXiv!

Quick Start

uv sync

Set your API key and configure structured_rag/configs/benchmark.yaml:

export OPENAI_API_KEY=sk-...

provider: openai          # openai | anthropic | google | ollama | ollama_cloud | modal_vllm
model: gpt-5.4-nano
api_key_env: OPENAI_API_KEY  # which env var to read the key from

strategy: fstring          # fstring | fstring_structured | dspy | dspy_opro | all
tasks:
  - AssessAnswerability    # or "all" for all 7 tasks

save_dir: results

Run:

uv run python -m structured_rag.scripts.run_benchmark

Or point to a custom config:

uv run python -m structured_rag.scripts.run_benchmark path/to/custom.yaml

Tasks

The benchmark tests 7 RAG-inspired structured output tasks across different JSON complexity levels:

Output Type	Task	Example
`string`	GenerateAnswer	`{"answer": "The National Gallery of Art..."}`
`integer`	RateContext	`{"context_score": 5}`
`boolean`	AssessAnswerability	`{"answerable_question": true}`
`List[string]`	ParaphraseQuestions	`{"paraphrased_questions": ["...", "...", "..."]}`
`composite`	GenerateAnswerWithConfidence	`{"answer": "...", "confidence": 5}`
`List[composite]`	GenerateAnswersWithConfidence	`[{"answer": "...", "confidence": 5}, ...]`
`composite`	RAGAS	`{"faithfulness_score": 2.5, "answer_relevance_score": 1.0, ...}`

Composite Models

class GenerateAnswerWithConfidence(BaseModel):
    answer: str
    confidence: int

class RAGASMetrics(BaseModel):
    faithfulness_score: float
    answer_relevance_score: float
    context_relevance_score: float

Prompting Strategies

Strategy	Description
`fstring`	f-string prompting with inline JSON format instructions
`fstring_structured`	f-string prompting with provider-native structured outputs (OpenAI, Google)
`dspy`	DSPy Follow-the-Format (FF) prompting
`dspy_opro`	DSPy with OPRO-optimized JSON signature
`all`	Run all 4 strategies

Supported Providers

Provider	Config value	API key env var
OpenAI	`openai`	`OPENAI_API_KEY`
Anthropic	`anthropic`	`ANTHROPIC_API_KEY`
Google Gemini	`google`	`GOOGLE_API_KEY`
Ollama (local)	`ollama`	--
Ollama Cloud	`ollama_cloud`	`OLLAMA_API_KEY`
Modal vLLM	`modal_vllm`	`MODAL_API_KEY`

Metrics

The benchmark reports two separate scores:

JSON Format Success Rate -- did the LLM produce valid, parseable JSON matching the expected schema?
Task Accuracy -- for tasks with ground truth (e.g. AssessAnswerability), did the LLM get the right answer?

Architecture

The codebase follows hexagonal (ports & adapters) architecture:

structured_rag/
  core/
    domain/       # Pydantic models, task definitions, validation metrics
    ports/        # Abstract interfaces (LLMPort, PromptingStrategy)
    services/     # Experiment runner, result saving
  adapters/
    llm/          # One adapter per provider (OpenAI, Anthropic, Google, Ollama, Modal/vLLM)
    prompting/    # Strategy implementations (f-string, DSPy)
  configs/        # benchmark.yaml
  scripts/        # run_benchmark.py entry point

Adding a new LLM provider requires creating one adapter file implementing LLMPort and registering it in adapters/llm/registry.py.

Dataset

The WikiQuestions dataset contains 112 samples built from Wikipedia title-abstract pairs with generated answerable/unanswerable questions. Also available on HuggingFace Datasets.

News

Weaviate Podcast #119 with Will Kurt and Cameron Pfiffer from dottxt.ai -- YouTube | Spotify
Weaviate Podcast #108 with Zhi Rui Tam on "Let Me Speak Freely?" -- YouTube | Spotify

Citation

@misc{shorten2024,
      title={StructuredRAG: JSON Response Formatting with Large Language Models}, 
      author={Connor Shorten and Charles Pierse and Thomas Benjamin Smith and Erika Cardenas and Akanksha Sharma and John Trengrove and Bob van Luijt},
      year={2024},
      eprint={2408.11061},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2408.11061}, 
}

Name		Name	Last commit message	Last commit date
Latest commit History 182 Commits
data		data
docs		docs
notebooks		notebooks
structured_rag		structured_rag
StructuredRAG.pdf		StructuredRAG.pdf
experiment-log.md		experiment-log.md
pyproject.toml		pyproject.toml
readme.md		readme.md
related-works.md		related-works.md
test-cost.md		test-cost.md
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

StructuredRAG: JSON Response Formatting with Large Language Models

Quick Start

Tasks

Composite Models

Prompting Strategies

Supported Providers

Metrics

Architecture

Dataset

News

Citation

About

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

StructuredRAG: JSON Response Formatting with Large Language Models

Quick Start

Tasks

Composite Models

Prompting Strategies

Supported Providers

Metrics

Architecture

Dataset

News

Citation

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Uh oh!

Contributors

Uh oh!

Languages