Skip to content

digitalnelson/docunetic

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Docunetic

This repository provides a small Python service that watches a directory for new or updated documents, converts them to Docling JSON, and stores the resulting representation in a PostgreSQL database.

Features

  • Watches a directory (recursively by default) for changes using watchdog.
  • Converts supported documents via Docling.
  • Debounces rapid file modifications to avoid duplicate conversions.
  • Stores Docling JSON, file hashes, and timestamps in PostgreSQL.
  • Provides a CLI with environment variable and command-line configuration.
  • Optional semantic search integration using Semantic Kernel.
  • Background PubMed enrichment that augments extracted references with metadata.

Requirements

  • Python 3.11+
  • PostgreSQL 13+
  • The docling Python package and its system dependencies. Installing Docling will download sizeable machine learning models (PyTorch), so budget sufficient disk space and time.

Installation

Create and activate a virtual environment, then install dependencies:

python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

Docling can require additional system libraries depending on which features you use (OCR, vision models, etc.). Consult the Docling documentation for platform-specific guidance.

Configuration

The application is primarily configured through environment variables:

Variable Description Required Default
DATABASE_URL PostgreSQL connection string (e.g. postgresql://user:pass@localhost/db).
WATCH_DIRECTORY Directory to monitor for document changes.
WATCH_RECURSIVE Watch subdirectories (true/false). true
WATCH_DEBOUNCE Seconds to wait before converting a touched file. 0.5
WATCH_EXTENSIONS Comma-separated list of file extensions (e.g. .pdf,.docx). .pdf,.docx,.pptx,.xlsx,.rtf,.txt,.md
SEMANTIC_SEARCH_PROVIDER Provider for semantic search embeddings (openai or azure-openai). openai
SEMANTIC_SEARCH_MODEL Embedding model identifier (defaults to provider-specific sensible value). text-embedding-3-small
SEMANTIC_SEARCH_MAX_DOCUMENTS Number of documents to embed per semantic query. 200
SEMANTIC_SEARCH_MAX_DOCUMENT_CHARS Maximum characters to embed from each document. 4000
PUBMED_ENABLED Enable or disable PubMed enrichment (true/false). true
PUBMED_EMAIL Contact email sent to the E-utilities API (recommended by NCBI).
PUBMED_API_KEY Optional NCBI API key to increase rate limits.
PUBMED_MAX_RESULTS Maximum PubMed matches to store per reference. 3
PUBMED_TIMEOUT Timeout, in seconds, for PubMed HTTP requests. 10.0
PUBMED_REQUEST_INTERVAL Minimum seconds between PubMed API calls (throttling). 0.34

You can also provide overrides via CLI arguments.

An example .env file is provided in .env.example to document these settings.

Usage

Run the watcher after configuring your environment variables:

export DATABASE_URL="postgresql://user:pass@localhost:5432/docunetic"
export WATCH_DIRECTORY="/path/to/watch"
python -m docunetic

Alternatively, override options via CLI flags:

python -m docunetic --directory ./incoming --extensions .pdf,.docx --debounce 1.5

Press Ctrl+C (SIGINT) to stop the service gracefully.

Model Context Protocol (MCP) server

Docunetic can expose stored conversions through the Model Context Protocol for use with MCP-capable clients. The server is packaged in the repository and can be started after installing the Python dependencies:

pip install -r requirements.txt
export DATABASE_URL="postgresql://user:pass@host:5432/docunetic"
python -m docunetic.mcp_server

The mcp/ directory contains a manifest (mcp/servers/docunetic.json) that clients can load directly. See mcp/README.md for the full list of available tools. When Semantic Kernel credentials are provided, the MCP server also exposes a semantic_search_documents tool that performs embedding-powered retrieval across stored Docling conversions.

Docker image

A Dockerfile is provided to build a container image of the watcher. The image sets WATCH_DIRECTORY=/data by default and requires a DATABASE_URL at runtime.

docker build -t docunetic .
docker run \
  -e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
  -e WATCH_DIRECTORY=/data \
  -v $(pwd)/incoming:/data \
  docunetic

You can override environment variables or CLI flags via the Docker --env and -- arguments when starting the container.

Podman deployment with a local PDF directory

Podman can run the same container image without requiring the Docker daemon. To watch a local directory of PDFs from Podman, mount the host folder into the container at /data (or whichever directory you configure in WATCH_DIRECTORY).

# Build the image (only needed once; Podman understands the Dockerfile syntax)
podman build -t docunetic .

# Launch the watcher against a local directory of PDFs
podman run \
  --rm \
  --name docunetic \
  -e DATABASE_URL="postgresql://user:pass@host:5432/docunetic" \
  -e WATCH_DIRECTORY=/data \
  -v /absolute/path/to/pdfs:/data:Z \
  docunetic
  • /absolute/path/to/pdfs is the host directory containing PDFs (and other supported files) that you want Docunetic to monitor.
  • The :Z suffix relabels the mount for SELinux-enabled hosts. If you are not using SELinux, you can omit it or replace it with :z for shared labels.
  • Adjust the environment variables or add CLI flags after docunetic to change watcher behaviour (e.g., set WATCH_EXTENSIONS or WATCH_DEBOUNCE).

When the container starts, Docunetic will watch the mounted directory for new or updated PDFs and write the converted Docling JSON to the configured PostgreSQL database.

Managed PostgreSQL on Railway

Instructions for provisioning a managed PostgreSQL instance on Railway are available in deploy/railway/README.md. After obtaining the connection string, set DATABASE_URL in your environment or platform configuration and the watcher will create the required schema on start-up.

Database Schema

The service manages a single table docling_documents with the following columns:

  • id: auto-incrementing identifier.
  • file_path: absolute path to the source document (unique).
  • sha256: SHA-256 hash of the file contents.
  • last_modified: filesystem modification timestamp.
  • processed_at: timestamp when the conversion was stored.
  • docling_json: JSONB column containing the Docling representation.

Development Notes

  • The queue-based worker serialises conversions to keep Docling operations thread-safe.
  • Debounce timers reduce redundant work when files are written multiple times in rapid succession.
  • Logging defaults to INFO level; pass --verbose for additional diagnostics.

Testing

No automated tests are included yet. The watcher interacts with external processes (filesystem events and PostgreSQL), so integration testing typically requires provisioning those services.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors