GitHub - JustinV200/ReGen: An automated pipeline that ingests documents or web pages, extracts structured data using LLMs, and generates polished statistical reports rendered with Quarto.

Ingest documents. Extract intelligence. Generate reports.

An automated pipeline that ingests documents or web pages, extracts structured data using LLMs, analyzes cross-source patterns, and generates polished reports rendered with Quarto.

How It Works

Sources (URLs/files)
  → Parse → Chunk
  → LLM Extract (map-reduce)
  → LLM Analyze (per-source + clustering + cross-source synthesis)
  → LLM Report Writer (section-by-section)
  → Quarto .qmd → HTML/PDF/DOCX

Input Ingestion — Detects whether each source is a URL or local file, downloads if needed, identifies the file type, and dispatches to the appropriate parser.
Chunking — Splits parsed text into overlapping chunks sized for LLM context windows. Tables are kept as separate chunks. Handles paragraph-less documents with newline/word-split fallbacks.
Extraction (Map-Reduce) — Each chunk is sent to an LLM for structured data extraction (entities, statistics, claims). Results are batch-reduced into a single JSON extraction per source.
Analysis & Synthesis — Per-source deep analysis, source clustering by topic similarity, and cross-source synthesis that identifies themes, connections, contradictions, and key takeaways.
Report Generation — Section-by-section LLM calls build a complete Quarto .qmd with narrative prose, charts (matplotlib/seaborn), and data-driven visualizations.
Rendering — Quarto compiles the .qmd into a self-contained HTML file (or PDF/DOCX). Embedded resources — no extra folders needed.

Supported Input Formats

Format	Parser
Web pages (URL)	trafilatura
PDF	PyMuPDF + pdfplumber
Word (.docx)	python-docx
Excel (.xlsx)	pandas + openpyxl
CSV	pandas
Plain text	built-in

Setup

Prerequisites:

Python 3.10+
Quarto installed and on PATH
An OpenAI API key (or any provider supported by litellm)

Install dependencies:

pip install -r requirements.txt

Configure environment:

Create a .env file in the project root:

OPENAI_API_KEY=your-key-here

If Quarto picks up the wrong Python interpreter, set:

$env:QUARTO_PYTHON = "C:\path\to\python.exe"

Usage

python ReGen.py [sources...] [options]

Examples:

# Single URL, standard mode
python ReGen.py https://example.com/article

# Multiple sources with detailed mode, auto-render to HTML
python ReGen.py https://example.com/article data/study.pdf -m detailed --render

# Sources from a text file (one URL/path per line, # comments ignored)
python ReGen.py sources.txt -m brief --name my_report

# PDF output with custom model and verbose logging
python ReGen.py paper.pdf -o pdf --model gpt-4o -v

# Quiet mode — only errors and final path printed
python ReGen.py https://example.com -q --render

Options:

Flag	Description	Default
`sources`	URLs, file paths, or `.txt` files (one source per line)	(required)
`-m`, `--mode`	Report detail level: `brief`, `standard`, `detailed`	`standard`
`-o`, `--output`	Output format: `html`, `pdf`, `docx`	`html`
`--name`	Output filename (without extension)	`report`
`--model`	LLM model name (any litellm-supported model)	`gpt-3.5-turbo`
`--render`	Auto-render the `.qmd` with Quarto after generation	off
`-v`, `--verbose`	Show chunk-level extraction and reduce progress	off
`-q`, `--quiet`	Suppress all output except errors and final path	off
`-n`, `--notion`	Convert the report to `.md` and export to Notion Database	off

Modes:

Mode	Description
`brief`	Quick summary, minimal sections, single-call generation
`standard`	Themes, cross-source findings, clusters — section-by-section
`detailed`	Everything in standard + per-source deep-dives, more themes/takeaways

The generated .qmd is saved to reports/. Rendered HTML is fully self-contained — open it on any machine, no extra files needed.

Project Structure

report_generator/
├── ReGen.py                         # Pipeline orchestrator + CLI entry point
├── input_processing/
│   ├── reader.py                    # Source detection, download, MIME routing
│   ├── chunker.py                   # Paragraph-based chunking with overlap
│   └── parsers/
│       ├── text_parser.py
│       ├── csv_parser.py
│       ├── docx_parser.py
│       ├── pdf_parser.py
│       ├── excelParser.py
│       └── web_parser.py
├── models/
│   └── model.py                     # LLM wrapper (litellm, provider-agnostic)
├── extractor/
│   └── extractor.py                 # Map-reduce extraction pipeline
├── analyzer/
│   └── analyzer.py                  # Per-source analysis, clustering, synthesis
├── reportgenerator/
│   └── reportMaker.py               # Section-by-section Quarto .qmd generation
├── reports/                         # Generated reports output directory
├── assets/                          # README banner and other assets
├── requirements.txt
└── .env                             # API keys (not committed)

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
assets		assets
core		core
example		example
src		src
.gitattributes		.gitattributes
.gitignore		.gitignore
README.md		README.md
ReGen.py		ReGen.py
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

How It Works

Supported Input Formats

Setup

Usage

Project Structure

Roadmap

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

How It Works

Supported Input Formats

Setup

Usage

Project Structure

Roadmap

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages