This repository provides a small Python service that watches a directory for new or updated documents, converts them to Docling JSON, and stores the resulting representation in a PostgreSQL database.
- Watches a directory (recursively by default) for changes using
watchdog. - Converts supported documents via Docling.
- Debounces rapid file modifications to avoid duplicate conversions.
- Stores Docling JSON, file hashes, and timestamps in PostgreSQL.
- Provides a CLI with environment variable and command-line configuration.
- Optional semantic search integration using Semantic Kernel.
- Background PubMed enrichment that augments extracted references with metadata.
- Python 3.11+
- PostgreSQL 13+
- The
doclingPython package and its system dependencies. Installing Docling will download sizeable machine learning models (PyTorch), so budget sufficient disk space and time.
Create and activate a virtual environment, then install dependencies:
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txtDocling can require additional system libraries depending on which features you use (OCR, vision models, etc.). Consult the Docling documentation for platform-specific guidance.
The application is primarily configured through environment variables:
| Variable | Description | Required | Default |
|---|---|---|---|
DATABASE_URL |
PostgreSQL connection string (e.g. postgresql://user:pass@localhost/db). |
✅ | — |
WATCH_DIRECTORY |
Directory to monitor for document changes. | ✅ | — |
WATCH_RECURSIVE |
Watch subdirectories (true/false). |
❌ | true |
WATCH_DEBOUNCE |
Seconds to wait before converting a touched file. | ❌ | 0.5 |
WATCH_EXTENSIONS |
Comma-separated list of file extensions (e.g. .pdf,.docx). |
❌ | .pdf,.docx,.pptx,.xlsx,.rtf,.txt,.md |
SEMANTIC_SEARCH_PROVIDER |
Provider for semantic search embeddings (openai or azure-openai). |
❌ | openai |
SEMANTIC_SEARCH_MODEL |
Embedding model identifier (defaults to provider-specific sensible value). | ❌ | text-embedding-3-small |
SEMANTIC_SEARCH_MAX_DOCUMENTS |
Number of documents to embed per semantic query. | ❌ | 200 |
SEMANTIC_SEARCH_MAX_DOCUMENT_CHARS |
Maximum characters to embed from each document. | ❌ | 4000 |
PUBMED_ENABLED |
Enable or disable PubMed enrichment (true/false). |
❌ | true |
PUBMED_EMAIL |
Contact email sent to the E-utilities API (recommended by NCBI). | ❌ | — |
PUBMED_API_KEY |
Optional NCBI API key to increase rate limits. | ❌ | — |
PUBMED_MAX_RESULTS |
Maximum PubMed matches to store per reference. | ❌ | 3 |
PUBMED_TIMEOUT |
Timeout, in seconds, for PubMed HTTP requests. | ❌ | 10.0 |
PUBMED_REQUEST_INTERVAL |
Minimum seconds between PubMed API calls (throttling). | ❌ | 0.34 |
You can also provide overrides via CLI arguments.
An example .env file is provided in .env.example to document these settings.
Run the watcher after configuring your environment variables:
export DATABASE_URL="postgresql://user:pass@localhost:5432/docunetic"
export WATCH_DIRECTORY="/path/to/watch"
python -m docuneticAlternatively, override options via CLI flags:
python -m docunetic --directory ./incoming --extensions .pdf,.docx --debounce 1.5Press Ctrl+C (SIGINT) to stop the service gracefully.
Docunetic can expose stored conversions through the Model Context Protocol for use with MCP-capable clients. The server is packaged in the repository and can be started after installing the Python dependencies:
pip install -r requirements.txt
export DATABASE_URL="postgresql://user:pass@host:5432/docunetic"
python -m docunetic.mcp_serverThe mcp/ directory contains a manifest (mcp/servers/docunetic.json) that
clients can load directly. See mcp/README.md for the full list of available tools.
When Semantic Kernel credentials are provided, the MCP server also exposes a
semantic_search_documents tool that performs embedding-powered retrieval across
stored Docling conversions.
A Dockerfile is provided to build a container image of the watcher. The image
sets WATCH_DIRECTORY=/data by default and requires a DATABASE_URL at runtime.
docker build -t docunetic .
docker run \
-e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
-e WATCH_DIRECTORY=/data \
-v $(pwd)/incoming:/data \
docuneticYou can override environment variables or CLI flags via the Docker --env and
-- arguments when starting the container.
Podman can run the same container image without requiring the Docker daemon.
To watch a local directory of PDFs from Podman, mount the host folder into the
container at /data (or whichever directory you configure in WATCH_DIRECTORY).
# Build the image (only needed once; Podman understands the Dockerfile syntax)
podman build -t docunetic .
# Launch the watcher against a local directory of PDFs
podman run \
--rm \
--name docunetic \
-e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
-e WATCH_DIRECTORY=/data \
-v /absolute/path/to/pdfs:/data:Z \
docunetic/absolute/path/to/pdfsis the host directory containing PDFs (and other supported files) that you want Docunetic to monitor.- The
:Zsuffix relabels the mount for SELinux-enabled hosts. If you are not using SELinux, you can omit it or replace it with:zfor shared labels. - Adjust the environment variables or add CLI flags after
docuneticto change watcher behaviour (e.g., setWATCH_EXTENSIONSorWATCH_DEBOUNCE).
When the container starts, Docunetic will watch the mounted directory for new or updated PDFs and write the converted Docling JSON to the configured PostgreSQL database.
Instructions for provisioning a managed PostgreSQL instance on Railway are
available in deploy/railway/README.md. After obtaining the connection string,
set DATABASE_URL in your environment or platform configuration and the watcher
will create the required schema on start-up.
The service manages a single table docling_documents with the following columns:
id: auto-incrementing identifier.file_path: absolute path to the source document (unique).sha256: SHA-256 hash of the file contents.last_modified: filesystem modification timestamp.processed_at: timestamp when the conversion was stored.docling_json: JSONB column containing the Docling representation.
- The queue-based worker serialises conversions to keep Docling operations thread-safe.
- Debounce timers reduce redundant work when files are written multiple times in rapid succession.
- Logging defaults to INFO level; pass
--verbosefor additional diagnostics.
No automated tests are included yet. The watcher interacts with external processes (filesystem events and PostgreSQL), so integration testing typically requires provisioning those services.