Skip to content

droher/boxball

GitHub release CI Docker Pulls

Update: I have released a new project, baseball.computer, which is designed as the successor to boxball. It is much easier to use (no Docker required, runs entirely in your browser/program) and includes many more tables, features, and quality controls. The event schema is different, which will be the main migration pain point. I aim to continue Boxball maintenence and updates as long as people are still using it, and I may try to rebase boxball on top of the new project to make maintaining both easier. Please let me know if there are things you can do in Boxball that you can't do yet in baseball.computer by filing an issue on the repo or reaching me at david.roher@baseball.computer.

Introduction

Boxball creates prepopulated databases of the two most significant open source baseball datasets: Retrosheet and the Baseball Databank. Retrosheet contains information on every major-league pitch since 2000, every play since 1928, every box score since 1901, and every game since 1871. The Databank (based on the Lahman Database) contains yearly summaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:

  • Docker for repeatable builds and easy distribution
  • SQLAlchemy for abstracting away DDL differences between databases
  • Chadwick for translating Retrosheet's complex event files into a relational format

Follow the instructions below to install your distribution of choice. The full set of images is also available on Docker Hub.

The Retrosheet schema is extensively documented in the code; see the source here until I find a prettier solution.

If you find the project useful, please consider donating to:

Feel free to contact me with questions or comments!

Requirements

  • Docker (v18.06, earlier versions may not work)
  • 2-20GB Disk space (depends on distribution choice)
  • 500MB-8GB RAM available to Docker (depends on distribution choice)

Distributions

Column-Oriented Databases

Postgres columnar (Recommended)

This distribution uses Citus' native USING columnar table access method to turn PostgreSQL into a column-oriented database. This means that you get the rich featureset of Postgres, but with a large improvement in query speed and disk usage on the wide play-by-play tables. Citus columnar replaces the older cstore_fdw extension that this project used to ship; see ADR 0001 for the rationale. To install and run the database server:

docker run --name postgres-columnar -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-columnar:/var/lib/postgresql/data doublewick/boxball:postgres-columnar-latest

Roughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5433 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres-columnar (~1.5GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Note: Citus columnar tables are append-only — UPDATE, DELETE, and foreign keys are not supported on them. Use the plain postgres target if you need a mutable copy of the data.

arm64 note: This image is amd64-only. Citus does not publish arm64 packages, so on arm64 hosts (Apple Silicon, AWS Graviton) the image runs under emulation (slow) — or use the plain postgres target for native arm64.

Clickhouse

Clickhouse is a database developed by Yandex with some very impressive performance benchmarks. It uses less disk space than Postgres columnar, but significantly more RAM (~5GB). I've yet to run any query performance comparisons. To install and run the database server:

docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-latest

15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching the container and using the clickhouse-client CLI or by using a local database client on port 8123 as the user default. The data will be persisted on your machine in ~/boxball/clickhouse (~700MB), which means you can stop/remove the container without having to reload the data when you turn it back on.

Traditional (Row-oriented) Databases

Note: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantly more disk space than their columnar counterparts.

Postgres

Similar configuration to the columnar version above, but stored in the conventional row-oriented way (and supports UPDATE/DELETE/foreign keys).

docker run --name postgres -d -p 5432:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-latest

Roughly 90 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user postgres with password postgres on port 5432 (either using the psql command line tool or a database client of your choice). The data will be persisted on your machine in ~/boxball/postgres (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

MySQL

To install and run:

docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest

Roughly two hours after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the user root on port 3306. The data will be persisted on your machine in ~/boxball/mysql (~12GB), which means you can stop/remove the container without having to reload the data when you turn it back on.

SQLite (with web UI)

To install and run:

docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-latest

Roughly two minutes after the image is downloaded, the data will be fully loaded into the database. localhost:8080 will provide a web UI where you can write queries and perform schema exploration.

Flat File Downloads

Parquet

Parquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas, and many other frameworks. OneDrive

CSV

The original CSVs from the extract step (each CSV file is compressed in the ZSTD format). OneDrive

Development

Host-side dev workflow uses uv for environment management:

uv sync                      # provision dev env from pyproject.toml + uv.lock
uv run pytest --cov          # tests
uv run ruff check .          # lint
uv run basedpyright          # type check (advisory; baseline-only)
make ci                      # full CI pipeline locally via act

Local builds

Full chain (extracttransform → every load/<target>) is single-platform amd64. amd64 matches CI (ubuntu-latest GH runners) and the published artifact for postgres-columnar (Citus has no arm64 packages); pinning amd64 across the whole chain keeps the local image store coherent so each downstream FROM doublewick/boxball:<stage>-${VERSION} resolves locally instead of falling back to the registry's previous-release tag. arm64 hosts (Apple Silicon, Graviton) emulate amd64 via QEMU — slow but coherent.

BUILD_ENV=test make build-local      # full chain, fixture data (~CI smoke)
make build-local                     # full chain, real data (long)
docker compose build postgres        # single target, host-arch (native arm64 OK)

make build-local exports DOCKER_DEFAULT_PLATFORM=linux/amd64 and runs three serialized waves:

docker compose build extract
docker compose build ddl parquet csv
docker compose build clickhouse postgres postgres-columnar mysql sqlite

Serialization matters: docker compose build (no args) builds services in parallel, which races — downstream FROM <tag> lookups beat the upstream tag landing in the local image store, so buildx falls back to the registry.

Multi-arch (release)

Multi-arch images ship from docker-bake.hcl, layered on top of docker-compose.yml. bake overrides platforms per target (most are linux/amd64 + linux/arm64; postgres-columnar stays amd64-only). Compose itself stays single-platform because buildx cannot --load a manifest list: multi-arch must --push to a registry.

Local dry-run (validates the bake graph; no build):

make bake-print

Multi-arch push (release pipeline; needs docker login against $REPO):

make buildx-bootstrap    # one-time: create docker-container builder "boxball"
make bake-push           # builds + pushes manifest lists for every target

Until PLE-358 wires bake-push into a release workflow, multi-arch validation is via docker manifest inspect on bases + make bake-print. CI's e2e-test job validates the host-arch (amd64) chain end-to-end via make build-local.

Build-time logging

Pipeline scripts log via stdlib logging (configured by extract/parsers/_logging.py and transform/src/_logging.py). The extract, parquet, and ddl Dockerfiles default to BOXBALL_LOG_LEVEL=INFO and tag every log line with a stage label (BOXBALL_STAGE). Override the level from the host shell — Compose forwards it as a build arg:

BOXBALL_LOG_LEVEL=DEBUG docker compose build extract
BOXBALL_LOG_LEVEL=WARNING docker compose build parquet ddl

Acknowledgements

Ted Turocy's Chadwick Bureau developed the tools and repos that made this project possible. I am also grateful to Sean Lahman for creating his database, which I have been using for over 15 years. I was able to develop and host this project for free thanks to the generous open-source plans of Jetbrains, Github, and Docker Hub.

Retrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry. I hope Boxball facilitates more historical research to continue this tradition.

Licence(s)

All code is released under the Apache 2.0 license. Baseball Databank data is distributed under the CC-SA 4.0 license. Retrosheet data is released under the condition that the below text appear prominently:

The information used here was obtained free of
charge from and is copyrighted by Retrosheet.  Interested
parties may contact Retrosheet at "www.retrosheet.org".

About

Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.

Topics

Resources

License

Code of conduct

Contributing

Stars

Watchers

Forks

Packages

 
 
 

Contributors