dushyantkhosla/cli4dsPublic

NotificationsYou must be signed in to change notification settings
Fork0
Star0

Learn how to crunch fits-on-disk data using Open Source CLI Tools!

License

View license

0 stars 0 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
.Trash-0		.Trash-0
data		data
docker		docker
docs		docs
site		site
tutorials		tutorials
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
mkdocs.yml		mkdocs.yml
nohup.out		nohup.out

Repository files navigation

Introduction

Say a you have a.csv file with a few columns. You have to run basic descriptive statistics on these columns, and maybe a few group-by operations (okay, pivot-tables.) If the file is a few thousand rows (under 100MB in size), you will probably double-click on it straight away and run the analysis in Excel.

Give yourself a pat on the back, you chose the right tool for the right job.

Who's a good analyst? Yes, you are!

But what if the file was 750MB? Assuming you have enough RAM (8GB or more), of course you'll usedplyr in R, orpandas in Python or (gasp) write a data step inSAS.

Right? Excellent.

But what if the file was 6GB?

15GB?

If the wordHadoop is stuck in your throat, I implore you to swallow it.This repository focues on open-source command-line utilities that can do the same.

Yes, there are Python libraries that allow you to work with larger-than-RAM files on a single machine (Spark,Dask and perhaps some more), but we'll keep that for later.

Why

Because I've met too many 'data scientists' who
- have a complete lack of awareness of the limits of their own hardware.
  - SeeDon't use Hadoop - your data isn't that big
- are forgetting Statistics! Sometimes you can fit a model on a (representative)sample of data, and you might not need distributed ML.
  - SeeDon't use deep learning your data isn't that big
  - Also seelearning curves
Because there is an entire ecosystem of wonderful open-source software for data analysis
Because renting servers with more RAM or more cores is now easier and cheaper than ever.
Because too many businesses do not have massive data and are spending money and resources trying to solve their problems with the wrong (and expensive) tools
- The closest analogy I can think of is someone trying to break a pebble with a sledgehammer. Of course, the pebble will break, but wouldn't you rather first try using the hammer hanging in your toolshed?
But mostly, because I like to teach! 😇

Some Quotes

In forecasting applications, wenever observe the whole population. The problem is to forecast froma finite sample. Hence statistics such as means and standard deviations must be estimated with error.

"At Facebook, 90% of the jobs have input sizes under 100GB."

"For workloads that process multi-GB rather than multi-TB, a big memory server will provide better performance-per-dollar than a cluster."

Tools

GNU Coreutils everyday tools likegrep,sed,cut,shuf andcat for working on text-files
GNU awk, aprogramming language designed for text processing and typically used as a data extraction and reporting tool
GNU Datamash, a command-line program which performs basic numeric, textual and statistical operations on input textual data files.
xsv, a fast CSV toolkit written in Rust
csvkit, a suite of command-line tools for converting to and working with CSV. Written in Python
Miller is like awk, sed, cut, join, and sort forname-indexed data such as CSV, TSV, and tabular JSON. Written in C.
csvtk A cross-platform, efficient, practical and pretty CSV/TSV toolkit in Golang.
textql Execute SQL against structured text like CSV or TSV. Written in Golang.
- SQLlite-likedatetime support!
q allows direct execution of SQL-like queries on CSVs/TSVs (and any other tabular text files)

Docker Image

I've created a Docker image with all of these tools, and tutorials on how to use them.
It also contains
- Miniconda3
- A conda environmentds-py3 configured with the PyData stack (pandas,scikit-learn ...)
Build (or pull) the docker image

# clone this repo, and cd docker/docker build -t cli-4-ds .# ordocker pull dushyantkhosla/cli-4-ds:latest

Run a container with the image

docker run -it --privileged \           -v $(pwd):/home \           -p 8888:8888 \           -p 5000:5000 \           -p 3128:3128 \           dushyantkhosla/cli-4-ds:latest

Learn how to use these tools using the notebooks intutorials/
- There is a dedicated notebook for each of the tools above
Run thestart.sh script to see helpful messages

bash /root/start.sh

Get the Data

To generate data for these tutorials,cd into thedata/ directory and
- Run theget-csvs.sh script to downloadflightDelays andKDDCup datasets
- PS: This will download ~1.5GB data

cd data/bash get-csvs.shpython make-data.py

Run themake-data.py to create a synthetic dataset with 10 million rows

Part 2: SQL Analytics withMetabase

You might want to try outMetabase, which has a nice front-end for writing SQL

docker pull metabase/metabase:v0.19.0

I recommend this version against the latest because it works with SQLite
If you want to run other DBs like Postgresql, you can get the latest image instead
Then, run a container

docker run -d -v $(pwd):/tmp -p 3000:3000 metabase/metabase:v0.19.0

The-d switch is for running the container indetached mode
Navigate tolocalhost:3000, connect to a.db file or run another DB and connect to it

Appendix

There are no rules, of thumb; but we can try

Data Size	Remedy
up to 1GB	Pandas
up to 10GB	Get more RAM. Try Spark/Dask.
up to 100GB	Postgres
500GB+	Hadoop

tl;dr

As long as your data fits-on-disk (ie, a few hundred GBs or less,)

Forfilter ortransform jobs (likeWHERE andCASE WHEN) , use
- cli-tools or python scripts (line-by-line or stream processing)
- break files into chunks, use pandas (chunk processing)
Forreductions orgroupby jobs (likeAVERAGE andPIVOT),
- think deeply about your data, draw representative samples
  - you're a better statistician than a programmer afterall, aren't you?
- use bootstrap measures to quantify uncertainty
Formachine-learning jobs,
- Cluster your data, then pull samples from each group. (stratify)
- Fit your first model on a 10% sample.
  - Build a Learning Curve.
  - Use cross-validated measures to quantify uncertainty

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Introduction

Why

Some Quotes

Tools

Docker Image

Get the Data

Part 2: SQL Analytics withMetabase

Appendix

tl;dr

Further Reading

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Languages

Movatterモバイル変換

License

dushyantkhosla/cli4ds

Folders and files

Latest commit

History

Repository files navigation

Introduction

Why

Some Quotes

Tools

Docker Image

Get the Data

Part 2: SQL Analytics withMetabase

Appendix

tl;dr

Further Reading

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Languages

Packages