Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A Python library for iterative and interactive data wrangling at laptop-scale.

License

NotificationsYou must be signed in to change notification settings

jbn/vaquero

Repository files navigation

https://travis-ci.org/jbn/vaquero.svg?branch=masterhttps://ci.appveyor.com/api/projects/status/bbs3p2osllgohxco?svg=truehttps://coveralls.io/repos/github/jbn/vaquero/badge.svg?branch=master

What is Vaquero?

vaquero logo

TL;DR

It's a library for iterative and interactive data wrangling atlaptop-scale. If you spend a lot of time in aJupyternotebook, trying to clean dirty, raw data, it'sprobably useful.

It would be nice if it were possible to write data cleaning codecorrectly. But, the people who pay you to do data analysis don't do dataanalysis and don't understand how dangerous dirty data are, so yourarely get the luxury of feeling secure with what you extract. Vaquerotries to find a balance between "business" demands and good hygiene.Borrowing from Larry Wall, it tries "to make the easy things easy, andthe hard things possible." In this context, "hard things" refers tothose wonderfully fun situations where, you write some code that youknow will break in the future but you have no time to fix it; then,three months later, it breaks and you have no idea what your code does.

See also:On DisappearingCode

An Example

It's easier to get a sense of "why" by looking at a notebook.

Expecting Exceptions

Vaqueroexpects exceptions, making them pretty unexceptional. But,Python's exception handling is cheap, so that's fine (i.e. EAFP --Easier to ask for forgiveness than permission). Plus, with dirty data,you know it will probably fail for some records. During development,rather than halting each time, vaquero continues on its merry way, up tosome failure limit. For each failure, the library logs the exception,including the name of the fileand the arguments which resulted in afailure.

After you have processed all the documents, you can then inspect theerrors. This helps you scan for error patterns, rather than programmingby the coincidence of the first error raised. Moreover, since you theoffending function and its arguments, it is easy to update the newfunction, ensuring it passes with the prior bad example. Vaquero reloadsthe pipeline for you. (Or, at least tries to, because reloading istricky.)

Modules as Pipelines

Namespaces are one honking great idea -- let's do more of those!

Programmers use namespaces everywhere to organize their code. Yet, whenwriting data cleaning code, everything ends up in a big file with lotsof poorly-named functions. Think:from hellishlib import *. Theperfectionist in me says, "this is awful, and I should write itproperly, as a full library with lots of unit tests!" But, for"perfectionists with deadlines," that's not possible.

Furthermore, the single-file-of-functions pattern emerges not onlybecause of time constraints; it's a reflection of the problem! ELT codeisinherently tightly-coupled. Code that extracts this variableprobably depends on that one which in turn also depends on some otherone. This leads to a tree of transformations, encapsulated by functioncalls.

Recognizing this,vaquero doesn't try to move you away fromcollecting all your ELT code in a single file. It's going to happenanyway. Instead, it makes it safer with some conventions.

  1. A module represents a single encapsulated pipeline. It should processa well-defined document.
  2. The function definition order is meaningful. Functions at the top ofthe file execute before those above them. Again, it's a pipeline.
  3. As per pythonic convention, functions prefixed with_ areprivate. Here, that means, the pipeline constructor ignores it whencompiling the pipeline. This gives you nice helper functions.
  4. You're probably not going to use unit tests -- you don't have time.But, since it's a module, pepper it with assertions. And, using the_-prefix, you can actually write namespaced tests (e.g._my_test()), and immediately call them in the module. (I actuallywrite a lot of my code withunittest in the pipeline module andit gets called right before the module fully imports.) Then, when youbreak something, you can't even start pipeline processing. It failsfast. (You can deviate from this pattern -- but, in general, don't.)

Installation

pip install vaquero

Tips

Thef(src, dst) Pattern

For most of my pipelines, I tend to write functions that look like,

deff(src_d,dst_d):dst_d['age']=int(src_d['AGE1'])

Coming from functional languages, I'd prefer immutable objects. But, inPython, that tends to be painfully slow. This pattern represents acompromise that usually works well. On the one side (dst_d) you havealready processed elements; on the other, the raw data.

Hidden field pattern

  • Assume you are processing a pipeline with a dict destinationdocument. Use '_key_name' fields for intermediary results in adocument. You can delete them at the end of the pipeline (easily, viavaquero.transformations.remove_private_keys), but in the interim,you'll see these fields on failure.

Disclaimer

I have this big monstrous library called vaquero on my computer. It's acollection of lots of functions I've written over (entirely too) manydata munging projects. I use it often, and keep telling myself "once Ifind the time, I'll release it!" And, that never happens. It's too bigto clean up in a way that makes me comfortable. Instead, I'll bereleasing little bits of code in a ad-hoc, just-in-time fashion. When Iabsolutely need some feature of the big library going forward, I'llextract it and put it here.

That makes me wildly uncomfortable, but...I'm starving for time.

In any case, library-user beware. Things will break.

About

A Python library for iterative and interactive data wrangling at laptop-scale.

Topics

Resources

License

Stars

Watchers

Forks

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp