- Notifications
You must be signed in to change notification settings - Fork160
A Python package for manipulating 2-dimensional tabular data structures
License
h2oai/datatable
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
This is a Python package for manipulating 2-dimensional tabular data structures(aka data frames). It is close in spirit topandas orSFrame; however weput specific emphasis on speed and big data support. As the name suggests, thepackage is closely related to R'sdata.table and attempts to mimic its corealgorithms and API.
Requirements: Python 3.6+ (64 bit) and pip 20.3+.
datatable
started in 2017 as a toolkit for performing big data (up to 100GB)operations on a single-node machine, at the maximum speed possible. Suchrequirements are dictated by modern machine-learning applications, which needto process large volumes of data and generate many features in order toachieve the best model accuracy. The first user ofdatatable
wasDriverless.ai.
The set of features that we want to implement withdatatable
is at leastthe following:
Column-oriented data storage.
Native-C implementation for all datatypes, including strings. Packages suchas pandas and numpy already do that for numeric columns, but not forstrings.
Support for date-time and categorical types. Object type is also supported,but promotion into object discouraged.
All types should support null values, with as little overhead as possible.
Data should be stored on disk in the same format as in memory. This willallow us to memory-map data on disk and work on out-of-memory datasetstransparently.
Work with memory-mapped datasets to avoid loading into memory more data thannecessary for each particular operation.
Fast data reading from CSV and other formats.
Multi-threaded data processing: time-consuming operations should attempt toutilize all cores for maximum efficiency.
Efficient algorithms for sorting/grouping/joining.
Expressive query syntax (similar todata.table).
Minimal amount of data copying, copy-on-write semantics for shared data.
Use "rowindex" views in filtering/sorting/grouping/joining operators toavoid unnecessary data copying.
Interoperability with pandas / numpy / pyarrow / pure python: the usersshould have the ability to convert to another data-processing frameworkwith ease.
On macOS, Linux and Windows systems installing datatable is as easy as
pip install datatable
On all other platforms a source distribution will be needed. For moreinformation seeBuild instructions.
About
A Python package for manipulating 2-dimensional tabular data structures