Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 2,249 Commits
.github		.github
ci		ci
docs		docs
src		src
tests		tests
tests_random		tests_random
.gitattributes		.gitattributes
.gitignore		.gitignore
.readthedocs.yaml		.readthedocs.yaml
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
VERSION.txt		VERSION.txt
pyproject.toml		pyproject.toml
requirements_docs.txt		requirements_docs.txt
requirements_extra.txt		requirements_extra.txt
requirements_tests.txt		requirements_tests.txt
setup.cfg		setup.cfg

Repository files navigation

datatable

This is a Python package for manipulating 2-dimensional tabular data structures(aka data frames). It is close in spirit topandas orSFrame; however weput specific emphasis on speed and big data support. As the name suggests, thepackage is closely related to R'sdata.table and attempts to mimic its corealgorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB)operations on a single-node machine, at the maximum speed possible. Suchrequirements are dictated by modern machine-learning applications, which needto process large volumes of data and generate many features in order toachieve the best model accuracy. The first user ofdatatable wasDriverless.ai.

The set of features that we want to implement withdatatable is at leastthe following:

Column-oriented data storage.
Native-C implementation for all datatypes, including strings. Packages suchas pandas and numpy already do that for numeric columns, but not forstrings.
Support for date-time and categorical types. Object type is also supported,but promotion into object discouraged.
All types should support null values, with as little overhead as possible.
Data should be stored on disk in the same format as in memory. This willallow us to memory-map data on disk and work on out-of-memory datasetstransparently.
Work with memory-mapped datasets to avoid loading into memory more data thannecessary for each particular operation.
Fast data reading from CSV and other formats.
Multi-threaded data processing: time-consuming operations should attempt toutilize all cores for maximum efficiency.
Efficient algorithms for sorting/grouping/joining.
Expressive query syntax (similar todata.table).
Minimal amount of data copying, copy-on-write semantics for shared data.
Use "rowindex" views in filtering/sorting/grouping/joining operators toavoid unnecessary data copying.
Interoperability with pandas / numpy / pyarrow / pure python: the usersshould have the ability to convert to another data-processing frameworkwith ease.