- Notifications
You must be signed in to change notification settings - Fork18
Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.
License
droher/boxball
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Update: I have released a new project,baseball.computer, which is designedas the successor to boxball. It is much easier to use (no Docker required, runs entirely in your browser/program)and includes many more tables, features, and quality controls. The event schema is different, which will be the main migration pain point.I aim to continue Boxball maintenence and updates as long as people are still using it, and I may try to rebaseboxball on top of the new project to make maintaining both easier. Please let me know if there are things you can do in Boxball that you can't do yet in baseball.computer by filing an issue on therepo or reaching me atdavid.roher@baseball.computer.
Boxball creates prepopulated databases of the two most significant open source baseball datasets:Retrosheet and theBaseball Databank.Retrosheet contains information on every major-league pitch since 2000, every play since 1928,every box score since 1901, and every game since 1871.The Databank (based on theLahman Database) contains yearlysummaries for every player and team in history. In addition to the data and databases themselves, Boxball relies on the following tools:
- Docker for repeatable builds and easy distribution
- SQLAlchemy for abstracting away DDL differences between databases
- Chadwick for translating Retrosheet's complex event files into a relational format
Follow the instructions below to install your distribution of choice. The full set of images is also available onDocker Hub.
The Retrosheet schema is extensively documented in the code; see the sourcehereuntil I find a prettier solution.
If you find the project useful, please consider donating to:
- TheAli Forney Center for homeless LGBTQ youth
- 350.org, a grassroots international climate change organization
Feel free tocontact me with questions or comments!
- Docker (v18.06, earlier versions may not work)
- 2-20GB Disk space (depends on distribution choice)
- 500MB-8GB RAM available to Docker (depends on distribution choice)
This distribution uses thecstore_fdw extension to turn PostgreSQLinto a column-oriented database. This means that you get the rich featureset of Postgres,but with a huge improvement in speed and disk usage. To install and run the database server:
docker run --name postgres-cstore-fdw -d -p 5433:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres-cstore-fdw:/var/lib/postgresql/data doublewick/boxball:postgres-cstore-fdw-latest
Roughly an hour after the image is downloaded, the data will be fully loaded into the database, and you can connect to it as the userpostgres
with passwordpostgres
on port5433
(either using thepsql
command line tool or a database client of your choice). The data will be persisted on your machine in~/boxball/postgres-cstore-fdw
(~1.5GB), which means you can stop/remove the container without having to reload the datawhen you turn it back on.
Clickhouse is a database developed by Yandex with some very impressive performance benchmarks. It uses lessdisk space than Postgres cstore_fdw, but significantly more RAM (~5GB). I've yet to run any query performance comparisons.To install and run the database server:
docker run --name clickhouse -d -p 8123:8123 -v ~/boxball/clickhouse:/var/lib/clickhouse doublewick/boxball:clickhouse-latest
15-30 minutes after the image is downloaded, the data will be fully loaded into the database, and you can connect to it either by attaching thecontainer and using theclickhouse-client
CLI or by using a local database client on port8123
as the userdefault
.The data will be persisted on your machine in~/boxball/clickhouse
(~700MB), which means you can stop/remove the container without having to reload the datawhen you turn it back on.
Drill is a framework that allows for SQL queries directly on files, without having to declare any schema.It is usually used on a computing cluster with massive datasets, but we use a single-node setup. To install and run:
docker run --name drill -id -p 8047:8047 -p 31010:31010 -v ~/boxball/drill:/data doublewick/boxball:drill-latest
Data will be immediately available to query after the image is downloaded. Use port8047
to access the Web UI(which includes a SQL runner) and port31010
to connect via a database client.You may also attach the container and query from the command line.The data will be persisted on your machine in~/boxball/drill
(~700MB).
Note: these frameworks are likely to be prohibitively slow when querying play-by-play data, and they take up significantlymore disk space than their columnar counterparts.
Similar configuration to the cstore_fdw extended version above, but stored in the conventional way.
docker run --name postgres -d -p 5432:5432 -e POSTGRES_PASSWORD="postgres" -v ~/boxball/postgres:/var/lib/postgresql/data doublewick/boxball:postgres-latest
Roughly 90 minutes after the image is downloaded, the data will be fully loaded into the database,and you can connect to it as the userpostgres
with passwordpostgres
on port5432
(either using thepsql
command line tool or a database client of your choice). The data will be persisted on your machine in~/boxball/postgres
(~12GB), which means you can stop/remove the container without having to reload the datawhen you turn it back on.
To install and run:
docker run --name mysql -d -p 3306:3306 -v ~/boxball/mysql:/var/lib/mysql doublewick/boxball:mysql-latest
Roughly two hours after the image is downloaded, the data will be fully loaded into the database,and you can connect to it as the userroot
on port3306
. The data will be persisted on your machine in~/boxball/mysql
(~12GB), which means you can stop/remove the container without having to reload the datawhen you turn it back on.
To install and run:
docker run --name sqlite -d -p 8080:8080 -v ~/boxball/sqlite:/db doublewick/boxball:sqlite-latest
Roughly two minutes after the image is downloaded, the data will be fully loaded into the database.localhost:8080
will provide aweb UI where you can write queries and perform schema exploration.
Parquet is a columnar data format originally developed for the Hadoop ecosystem. It has solid support in Spark, Pandas,and many other frameworks.OneDrive
The original CSVs from the extract step (each CSV file is compressed in the ZSTD format).OneDrive
Ted Turocy'sChadwick Bureau developed the tools and repos that made this project possible. I am also grateful toSeanLahman for creating his database, which I have been using for over 15 years. I was ableto develop and host this project for free thanks to the generous open-source plans ofJetbrains, CircleCI, Github, and Docker Hub.
Retrosheet represents the collective effort of thousands of baseball fans over 150 years of scorekeeping and data entry.I hope Boxball facilitates more historical research to continue this tradition.
All code is released under the Apache 2.0 license. Baseball Databank data is distributed under theCC-SA 4.0license. Retrosheet data is released under the condition that the below text appear prominently:
The information used here was obtained free ofcharge from and is copyrighted by Retrosheet. Interestedparties may contact Retrosheet at "www.retrosheet.org".
About
Prebuilt Docker images with Retrosheet's complete baseball history data for many analytical frameworks. Includes Postgres, cstore_fdw, MySQL, SQLite, Clickhouse, Drill, Parquet, and CSV.