Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

python implementation of the parquet columnar file format.

License

NotificationsYou must be signed in to change notification settings

dask/fastparquet

 
 

Repository files navigation

https://readthedocs.org/projects/fastparquet/badge/?version=latest

fastparquet is a python implementation of theparquetformat, aiming integrateinto python-based big data work-flows. It is used implicitly bythe projects Dask, Pandas and intake-parquet.

We offer a high degree of support for the features of the parquet format, andvery competitive performance, in a small install size and codebase.

Details of this project, how to use it and comparisons to other work can be found in thedocumentation.

Requirements

(all development is against recent versions in the default anaconda channelsand/or conda-forge)

Required:

  • numpy
  • pandas
  • cython >= 0.29.23 (if building from pyx files)
  • cramjam
  • fsspec

Supported compression algorithms:

  • Available by default:
    • gzip
    • snappy
    • brotli
    • lz4
    • zstandard
  • Optionally supported

Installation

Install using conda, to get the latest compiled version:

conda install -c conda-forge fastparquet

or install from PyPI:

pip install fastparquet

You may wish to install numpy first, to help pip's resolver.This may install an appropriate wheel, or compile from source. For the latter,you will need a suitable C compiler toolchain on your system.

You can also install latest version from github:

pip install git+https://github.com/dask/fastparquet

in which case you should also havecython to be able to rebuild the C files.

Usage

Please refer to thedocumentation.

Reading

fromfastparquetimportParquetFilepf=ParquetFile('myfile.parq')df=pf.to_pandas()df2=pf.to_pandas(['col1','col2'],categories=['col1'])

You may specify which columns to load, which of those to keep as categoricals(if the data uses dictionary encoding). The file-path can be a single file,a metadata file pointing to other data files, or a directory (tree) containingdata files. The latter is what is typically output by hive/spark.

Writing

fromfastparquetimportwritewrite('outfile.parq',df)write('outfile2.parq',df,row_group_offsets=[0,10000,20000],compression='GZIP',file_scheme='hive')

The default is to produce a single output file with a single row-group(i.e., logical segment) and no compression. At the moment, only simpledata-types and plain encoding are supported, so expect performance to besimilar tonumpy.savez.

History

This project forked in October 2016 fromparquet-python, which was not designedfor vectorised loading of big data or parallel access.

About

python implementation of the parquet columnar file format.

Topics

Resources

License

Security policy

Stars

Watchers

Forks

Packages

No packages published

Languages

  • Python85.0%
  • Cython8.1%
  • Thrift6.9%

[8]ページ先頭

©2009-2025 Movatter.jp