Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

A Python package for manipulating 2-dimensional tabular data structures

License

NotificationsYou must be signed in to change notification settings

h2oai/datatable

Repository files navigation

PyPi versionLicenseBuild StatusDocumentation StatusCodacy Badge

This is a Python package for manipulating 2-dimensional tabular data structures(aka data frames). It is close in spirit topandas orSFrame; however weput specific emphasis on speed and big data support. As the name suggests, thepackage is closely related to R'sdata.table and attempts to mimic its corealgorithms and API.

Requirements: Python 3.6+ (64 bit) and pip 20.3+.

Project goals

datatable started in 2017 as a toolkit for performing big data (up to 100GB)operations on a single-node machine, at the maximum speed possible. Suchrequirements are dictated by modern machine-learning applications, which needto process large volumes of data and generate many features in order toachieve the best model accuracy. The first user ofdatatable wasDriverless.ai.

The set of features that we want to implement withdatatable is at leastthe following:

  • Column-oriented data storage.

  • Native-C implementation for all datatypes, including strings. Packages suchas pandas and numpy already do that for numeric columns, but not forstrings.

  • Support for date-time and categorical types. Object type is also supported,but promotion into object discouraged.

  • All types should support null values, with as little overhead as possible.

  • Data should be stored on disk in the same format as in memory. This willallow us to memory-map data on disk and work on out-of-memory datasetstransparently.

  • Work with memory-mapped datasets to avoid loading into memory more data thannecessary for each particular operation.

  • Fast data reading from CSV and other formats.

  • Multi-threaded data processing: time-consuming operations should attempt toutilize all cores for maximum efficiency.

  • Efficient algorithms for sorting/grouping/joining.

  • Expressive query syntax (similar todata.table).

  • Minimal amount of data copying, copy-on-write semantics for shared data.

  • Use "rowindex" views in filtering/sorting/grouping/joining operators toavoid unnecessary data copying.

  • Interoperability with pandas / numpy / pyarrow / pure python: the usersshould have the ability to convert to another data-processing frameworkwith ease.

Installation

On macOS, Linux and Windows systems installing datatable is as easy as

pip install datatable

On all other platforms a source distribution will be needed. For moreinformation seeBuild instructions.

See also


[8]ページ先頭

©2009-2025 Movatter.jp