You signed in with another tab or window.Reload to refresh your session.You signed out in another tab or window.Reload to refresh your session.You switched accounts on another tab or window.Reload to refresh your session.Dismiss alert
Surface your highest-value records withinformation gain,novelty, andquality scoring. A universal SDK + CLI that ranks and subsetstext, JSONL, CSV, logs, and mixed corpora so you see the signal first. Built on submodular selection (facility location), stable embeddings, diversity, and fast heuristics.
Company:The World’s Data Company • Product:The World’s Data Filter™
✨ What it does
Universal features — pluggable extractors for text, JSON/CSV/tabular, and generic blobs.
Information Gain — greedyfacility‑location selection to cover the dataset with minimal redundancy.
Novelty — distances from dataset centroid / past cache to prioritize new signal.
Quality filters — language/length heuristics for text; null/variance checks for tabular; duplicate/similarity suppression.
Explainable — scores per item:coverage_gain,novelty,quality, and avalue_score aggregate.
SDK & CLI — embed in Python or run aswdf from the terminal.
Deterministic — stable SHA‑256–based embeddings by default (swap for your own encoder at any time).
No heavy models — NumPy/Scipy core; scikit‑learn is optional ([text] extra) for TF‑IDF.
Year 2 roadmap:The World’s Data Index (persistent vector/metadata store) — this repo stays the stateless filter/selector.
🚀 Quickstart (Windows / macOS / Linux)
# 1) Create a virtualenv (Python 3.10+)python -m venv .venv# Windows.\.venv\Scripts\Activate.ps1# macOS/Linux# source .venv/bin/activate# 2) Installpip install -U pippip install -e .[dev]# add [text] for TF-IDF utilities if you like# 3) Run the demowdf score examples/news.jsonl --text-field text --out scores.csvwdf filter examples/news.jsonl --text-field text --k 10 --out selected.jsonl --explain
Each item yields a vectorx_i (unit‑normalized) and auxiliary quality features.
Scoring
Facility Location (coverage) (F(S)=\sum_j \max_{i\in S} \text{sim}(x_i, x_j)) — select items that best cover the rest. Greedy selection approximates the optimum and doubles as aredundancy filter.
Novelty Distance from dataset centroid (orpast cache) highlights unusual / new items.
Quality Text heuristics (language guess, length, printable ratio), tabular health (missing‑ness, low variance), duplicate checks.
# Score a JSONL corpus (one object per line) with a 'text' fieldwdf score examples/news.jsonl --text-field text --out scores.csv# Filter top-K by value score (explain is on by default)wdfselectexamples/news.jsonl --text-field text --k 50 --out selected.jsonl# Prefer compact JSONL (disable explanations)wdfselectexamples/news.jsonl --text-field text --k 50 --out selected.jsonl --no-explain# From a CSV (choose a text column)wdf score examples/sample.csv --csv --text-field body --id-field id --out scores.csv# Tune weights + disable noveltywdf filter examples/news.jsonl --text-field text --k 20 --w-cov 0.8 --w-nov 0.0 --w-qual 0.2 --out selected.jsonl
Input types supported today
.jsonl (id, text, and/or arbitrary fields)
.csv (choose columns)
Directory of.txt files (--dir)
Anything else you can adapt via a custom extractor (seeworlddatafilter/extractors/base.py).
You can register your own extractor in ~20 lines — the SDK passes throughmeta andtext to downstream systems.
📦 Python SDK
fromworlddatafilterimportWorldDataFilter,loadersdocs=loaders.load_jsonl("examples/news.jsonl",text_field="text")wdf=WorldDataFilter()scores=wdf.score(docs)# list of ItemScoreselected=wdf.select(docs,k=25,weights=dict(cov=0.7,nov=0.2,qual=0.1))