Ingest documents. Extract intelligence. Generate reports.
An automated pipeline that ingests documents or web pages, extracts structured data using LLMs, analyzes cross-source patterns, and generates polished reports rendered with Quarto.
Sources (URLs/files)
→ Parse → Chunk
→ LLM Extract (map-reduce)
→ LLM Analyze (per-source + clustering + cross-source synthesis)
→ LLM Report Writer (section-by-section)
→ Quarto .qmd → HTML/PDF/DOCX
- Input Ingestion — Detects whether each source is a URL or local file, downloads if needed, identifies the file type, and dispatches to the appropriate parser.
- Chunking — Splits parsed text into overlapping chunks sized for LLM context windows. Tables are kept as separate chunks. Handles paragraph-less documents with newline/word-split fallbacks.
- Extraction (Map-Reduce) — Each chunk is sent to an LLM for structured data extraction (entities, statistics, claims). Results are batch-reduced into a single JSON extraction per source.
- Analysis & Synthesis — Per-source deep analysis, source clustering by topic similarity, and cross-source synthesis that identifies themes, connections, contradictions, and key takeaways.
- Report Generation — Section-by-section LLM calls build a complete Quarto
.qmdwith narrative prose, charts (matplotlib/seaborn), and data-driven visualizations. - Rendering — Quarto compiles the
.qmdinto a self-contained HTML file (or PDF/DOCX). Embedded resources — no extra folders needed.
| Format | Parser |
|---|---|
| Web pages (URL) | trafilatura |
| PyMuPDF + pdfplumber | |
| Word (.docx) | python-docx |
| Excel (.xlsx) | pandas + openpyxl |
| CSV | pandas |
| Plain text | built-in |
Prerequisites:
Install dependencies:
pip install -r requirements.txtConfigure environment:
Create a .env file in the project root:
OPENAI_API_KEY=your-key-here
If Quarto picks up the wrong Python interpreter, set:
$env:QUARTO_PYTHON = "C:\path\to\python.exe"python ReGen.py [sources...] [options]Examples:
# Single URL, standard mode
python ReGen.py https://example.com/article
# Multiple sources with detailed mode, auto-render to HTML
python ReGen.py https://example.com/article data/study.pdf -m detailed --render
# Sources from a text file (one URL/path per line, # comments ignored)
python ReGen.py sources.txt -m brief --name my_report
# PDF output with custom model and verbose logging
python ReGen.py paper.pdf -o pdf --model gpt-4o -v
# Quiet mode — only errors and final path printed
python ReGen.py https://example.com -q --renderOptions:
| Flag | Description | Default |
|---|---|---|
sources |
URLs, file paths, or .txt files (one source per line) |
(required) |
-m, --mode |
Report detail level: brief, standard, detailed |
standard |
-o, --output |
Output format: html, pdf, docx |
html |
--name |
Output filename (without extension) | report |
--model |
LLM model name (any litellm-supported model) | gpt-3.5-turbo |
--render |
Auto-render the .qmd with Quarto after generation |
off |
-v, --verbose |
Show chunk-level extraction and reduce progress | off |
-q, --quiet |
Suppress all output except errors and final path | off |
-n, --notion |
Convert the report to .md and export to Notion Database |
off |
Modes:
| Mode | Description |
|---|---|
brief |
Quick summary, minimal sections, single-call generation |
standard |
Themes, cross-source findings, clusters — section-by-section |
detailed |
Everything in standard + per-source deep-dives, more themes/takeaways |
The generated .qmd is saved to reports/. Rendered HTML is fully self-contained — open it on any machine, no extra files needed.
report_generator/
├── ReGen.py # Pipeline orchestrator + CLI entry point
├── input_processing/
│ ├── reader.py # Source detection, download, MIME routing
│ ├── chunker.py # Paragraph-based chunking with overlap
│ └── parsers/
│ ├── text_parser.py
│ ├── csv_parser.py
│ ├── docx_parser.py
│ ├── pdf_parser.py
│ ├── excelParser.py
│ └── web_parser.py
├── models/
│ └── model.py # LLM wrapper (litellm, provider-agnostic)
├── extractor/
│ └── extractor.py # Map-reduce extraction pipeline
├── analyzer/
│ └── analyzer.py # Per-source analysis, clustering, synthesis
├── reportgenerator/
│ └── reportMaker.py # Section-by-section Quarto .qmd generation
├── reports/ # Generated reports output directory
├── assets/ # README banner and other assets
├── requirements.txt
└── .env # API keys (not committed)
- Multi-source support — accept a list of URLs/files, extract each independently, synthesize across all
- Report modes —
brief,standard,detailedwith scaling themes, takeaways, and section depth - Analyzer layer — per-source analysis, topic clustering, cross-source synthesis
- Section-by-section generation — avoids LLM output token limits on longer reports
- Self-contained HTML —
embed-resourcesfor portable single-file reports - CLI interface — argparse-based CLI with sources, mode, output format, render, verbose/quiet flags
- Image handling
- Edit Mode Agent that will reference saved JSONS from the development process and edit the final report by user request
- Local fine-tuned models — swap cloud LLMs for locally-hosted models for cost, privacy, and offline use
- Research mode — given a topic, auto-search the web for relevant sources and feed the best ones into the pipeline --optional