- Notifications
You must be signed in to change notification settings - Fork0
Knowledge graph of the English language between 1800 and 2019 using open source data, Python 3, and ffmpeg.
GraphTechnologyDevelopers/english-words-knowledge-graph
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
Watch the entire English language blossom from Wiktionary + Google Books N-grams, rendered as a living, breathing prefix galaxy.
- Zero-config takeover –
./setup.shspins up the virtualenv, fetches every dataset, caches the heavy lifts, and ships final MP4/GIF output. - Radial growth cinematics – the trie erupts from the core alphabet, framing decades of linguistic evolution as a neon fractal.
- Repeatable science – every artifact (lemmata, first-year inference, trie counts, layouts) checkpoints to disk and into a reusable tarball for instant re-renders.
- Battle-tested – streams 26 full 1-gram shards, handles 1.4GB Wiktionary dumps, and renders 220 frames in glorious 1080p.
Share it, remix it, drop it in your next data-viz thread.
cd /Users/grey/Projects/graph-visualizationsbash setup.shThe script will:
- Create/upgrade
venv/with Python 3. - Download Wiktionary + Google Books 1-gram shards (
a–z). - Extract English lemmas, infer first-use years, aggregate prefix counts.
- Render 220 radial frames (
outputs/frames/frame-0000.png→frame-0219.png). - Encode
outputs/english_trie_timelapse.mp4and a share-ready GIF.
Rerun the script anytime—artifact caching means future passes jump straight to rendering.
| Stage | Script | Output |
|---|---|---|
| Lemma extraction | src/ingest/wiktionary_extract.py | artifacts/lemmas/lemmas.tsv |
| First-year inference | src/ingest/ngram_first_year.py | artifacts/years/first_years.tsv |
| Prefix aggregation | src/build/build_prefix_trie.py | artifacts/trie/prefix_counts.jsonl |
| Layout generation | src/viz/layout.py | artifacts/layout/prefix_positions.json (legacy back-compat) |
| Frame rendering | src/viz/render_frames.py | outputs/frames/ |
| Encoding | src/viz/encode.py | outputs/english_trie_timelapse.mp4 +.gif |
source venv/bin/activatepython -m src.viz.render_frames artifacts/trie/prefix_counts.jsonl outputs/framespython -m src.viz.encode outputs/frames outputs/english_trie_timelapse.mp4 outputs/english_trie_timelapse.gifUse flags such as--min-radius,--max-radius,--base-edge-alpha, or--start-progress to tune the vibe.
Loadartifacts/years/first_years.tsv to explore in Neo4j (Community & Enterprise safe):
:parambatch=> $rows;UNWIND $rowsASrowWITHrowWHERErow.wordISNOTNULLANDrow.word<>""MERGE (w:Word{text:row.word})SETw.first_year=CASEWHENrow.first_year=""THENNULLELSEtoInteger(row.first_year)END;
Full documentation is available at theproject documentation site.
To run the documentation site locally:
cd docsbundle install --path vendor/bundlebundleexec jekyll serve --baseurl""
Visithttp://localhost:4000 to view the site locally.
- Drop the GIF in language history threads (#linguistics #dataart).
- Remix the radial layout with alternative color ramps or depth cutoffs.
- Pair the timelapse with poetry readings for maximum feels.
- Wiktionary community & Google Books N-gram team for open data.
- You, for showing the world how beautifully language grows.
For more open source software and content on Knowledge Graphs, GNNs, and Graph Databases,Join our community on X!
About
Knowledge graph of the English language between 1800 and 2019 using open source data, Python 3, and ffmpeg.
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.