Docunetic

This repository provides a small Python service that watches a directory for new or updated documents, converts them to Docling JSON, and stores the resulting representation in a PostgreSQL database.

Features

Watches a directory (recursively by default) for changes using watchdog.
Converts supported documents via Docling.
Debounces rapid file modifications to avoid duplicate conversions.
Stores Docling JSON, file hashes, and timestamps in PostgreSQL.
Provides a CLI with environment variable and command-line configuration.
Optional semantic search integration using Semantic Kernel.
Background PubMed enrichment that augments extracted references with metadata.

Requirements

Python 3.11+
PostgreSQL 13+
The docling Python package and its system dependencies. Installing Docling will download sizeable machine learning models (PyTorch), so budget sufficient disk space and time.

Installation

Create and activate a virtual environment, then install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Docling can require additional system libraries depending on which features you use (OCR, vision models, etc.). Consult the Docling documentation for platform-specific guidance.

Configuration

The application is primarily configured through environment variables:

Variable	Description	Required	Default
`DATABASE_URL`	PostgreSQL connection string (e.g. `postgresql://user:pass@localhost/db`).	✅	—
`WATCH_DIRECTORY`	Directory to monitor for document changes.	✅	—
`WATCH_RECURSIVE`	Watch subdirectories (`true`/`false`).	❌	`true`
`WATCH_DEBOUNCE`	Seconds to wait before converting a touched file.	❌	`0.5`
`WATCH_EXTENSIONS`	Comma-separated list of file extensions (e.g. `.pdf,.docx`).	❌	`.pdf,.docx,.pptx,.xlsx,.rtf,.txt,.md`
`SEMANTIC_SEARCH_PROVIDER`	Provider for semantic search embeddings (`openai` or `azure-openai`).	❌	`openai`
`SEMANTIC_SEARCH_MODEL`	Embedding model identifier (defaults to provider-specific sensible value).	❌	`text-embedding-3-small`
`SEMANTIC_SEARCH_MAX_DOCUMENTS`	Number of documents to embed per semantic query.	❌	`200`
`SEMANTIC_SEARCH_MAX_DOCUMENT_CHARS`	Maximum characters to embed from each document.	❌	`4000`
`PUBMED_ENABLED`	Enable or disable PubMed enrichment (`true`/`false`).	❌	`true`
`PUBMED_EMAIL`	Contact email sent to the E-utilities API (recommended by NCBI).	❌	—
`PUBMED_API_KEY`	Optional NCBI API key to increase rate limits.	❌	—
`PUBMED_MAX_RESULTS`	Maximum PubMed matches to store per reference.	❌	`3`
`PUBMED_TIMEOUT`	Timeout, in seconds, for PubMed HTTP requests.	❌	`10.0`
`PUBMED_REQUEST_INTERVAL`	Minimum seconds between PubMed API calls (throttling).	❌	`0.34`

You can also provide overrides via CLI arguments.

An example .env file is provided in .env.example to document these settings.

Usage

Run the watcher after configuring your environment variables:

export DATABASE_URL="postgresql://user:pass@localhost:5432/docunetic"
export WATCH_DIRECTORY="/path/to/watch"
python -m docunetic

Alternatively, override options via CLI flags:

python -m docunetic --directory ./incoming --extensions .pdf,.docx --debounce 1.5

Press Ctrl+C (SIGINT) to stop the service gracefully.

Model Context Protocol (MCP) server

Docunetic can expose stored conversions through the Model Context Protocol for use with MCP-capable clients. The server is packaged in the repository and can be started after installing the Python dependencies:

pip install -r requirements.txt
export DATABASE_URL="postgresql://user:pass@host:5432/docunetic"
python -m docunetic.mcp_server

The mcp/ directory contains a manifest (mcp/servers/docunetic.json) that clients can load directly. See mcp/README.md for the full list of available tools. When Semantic Kernel credentials are provided, the MCP server also exposes a semantic_search_documents tool that performs embedding-powered retrieval across stored Docling conversions.

Docker image

A Dockerfile is provided to build a container image of the watcher. The image sets WATCH_DIRECTORY=/data by default and requires a DATABASE_URL at runtime.

docker build -t docunetic .
docker run \
  -e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
  -e WATCH_DIRECTORY=/data \
  -v $(pwd)/incoming:/data \
  docunetic

You can override environment variables or CLI flags via the Docker --env and -- arguments when starting the container.

Podman deployment with a local PDF directory

Podman can run the same container image without requiring the Docker daemon. To watch a local directory of PDFs from Podman, mount the host folder into the container at /data (or whichever directory you configure in WATCH_DIRECTORY).

# Build the image (only needed once; Podman understands the Dockerfile syntax)
podman build -t docunetic .

# Launch the watcher against a local directory of PDFs
podman run \
  --rm \
  --name docunetic \
  -e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
  -e WATCH_DIRECTORY=/data \
  -v /absolute/path/to/pdfs:/data:Z \
  docunetic

/absolute/path/to/pdfs is the host directory containing PDFs (and other supported files) that you want Docunetic to monitor.
The :Z suffix relabels the mount for SELinux-enabled hosts. If you are not using SELinux, you can omit it or replace it with :z for shared labels.
Adjust the environment variables or add CLI flags after docunetic to change watcher behaviour (e.g., set WATCH_EXTENSIONS or WATCH_DEBOUNCE).

When the container starts, Docunetic will watch the mounted directory for new or updated PDFs and write the converted Docling JSON to the configured PostgreSQL database.

Managed PostgreSQL on Railway

Instructions for provisioning a managed PostgreSQL instance on Railway are available in deploy/railway/README.md. After obtaining the connection string, set DATABASE_URL in your environment or platform configuration and the watcher will create the required schema on start-up.

Database Schema

The service manages a single table docling_documents with the following columns:

id: auto-incrementing identifier.
file_path: absolute path to the source document (unique).
sha256: SHA-256 hash of the file contents.
last_modified: filesystem modification timestamp.
processed_at: timestamp when the conversion was stored.
docling_json: JSONB column containing the Docling representation.

Development Notes

The queue-based worker serialises conversions to keep Docling operations thread-safe.
Debounce timers reduce redundant work when files are written multiple times in rapid succession.
Logging defaults to INFO level; pass --verbose for additional diagnostics.

Testing

No automated tests are included yet. The watcher interacts with external processes (filesystem events and PostgreSQL), so integration testing typically requires provisioning those services.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Docunetic

Features

Requirements

Installation

Configuration

Usage

Model Context Protocol (MCP) server

Docker image

Podman deployment with a local PDF directory

Managed PostgreSQL on Railway

Database Schema

Development Notes

Testing

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 17 Commits
deploy/railway		deploy/railway
docs		docs
docunetic		docunetic
mcp		mcp
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
requirements.txt		requirements.txt

Folders and files

Latest commit

History

Repository files navigation

Docunetic

Features

Requirements

Installation

Configuration

Usage

Model Context Protocol (MCP) server

Docker image

Podman deployment with a local PDF directory

Managed PostgreSQL on Railway

Database Schema

Development Notes

Testing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages