Movatterモバイル変換


[0]ホーム

URL:


Python Pandas Tutorial

Python Pandas - Parquet File Format



Parquet File Format in Pandas

The parquet file format in Pandas is binary columnar file format designed for efficient serialization and deserialization of Pandas DataFrames. It supports all Pandas data types, including extension types such as categorical and timezone-aware datetime types. The format is based onApache Arrow's memory specification, enabling high-performance I/O operations.

Apache Parquet is a popular, open-source, column-oriented storage format designed for efficient reading and writing of DataFrames and can provide the options for easily sharing data across data analysis languages. It supports multiple compression methods to reduce file size while ensuring efficient reading performance.

Pandas provides robust support for Parquet file format, enabling efficient data serialization and de-serialization. In this tutorial, we will learn how to handle Parquet file format using Python's Pandas library.

Important Considerations

When working with parquet files in Pandas, you need to consider the following key points in mind −

  • Column Name Restrictions: Duplicate column name and non-string column names are not supported. If index level names are specified, they must also be strings.

  • Choosing an Engine: Supported engines includepyarrow,fastparquet, or auto. If no engine is specified, Pandas uses the pd.options.io.parquet.engine setting. If set toauto, Pandas tries to usepyarrow first and falls back tofastparquet if necessary.

  • Index Handling:pyarrow engine writes the index by default, whilefastparquet only writes non-default indexes. This difference can cause issues for non-Pandas consumers. Use the index argument to control this explicitly.

  • Categorical Data Types: Thepyarrow engine supports categorical data types, including theordered flag for string categories. Whereas thefastparquet engine supports categorical types but does not preserve theordered flag.

  • Unsupported Data Types: Data types likeInterval andobject types are not supported and will raise serialization errors.

  • Extension Data Types: The pyarrow engine preserves Pandas extension types like nullable integer and string data types (starting from pyarrow version 0.16.0). These types must implement the required protocols for serialization.

These considerations ensure that smooth data serialization and deserialization when working with Parquet files in Pandas.

Saving a Pandas DataFrame to a parquet File

To save a Pandas DataFrame to a parquet file, you can use theDataFrame.to_parquet() method, which saves data of the Pandas DataFrame to a file in parquet format.

Note: Before saving or retrieving the data from a parquet file you need to ensure that either the 'pyarrow' or 'fastparquet' libraries are installed. These are the optional Python dependency libraries that needs to be installed it by using the following commands −
pip install pyarrow.pip install fastparquet

Example

This example shows how to save DataFrames to Parquet file format using theDataFrame.to_parquet() method, here we are saving it with the "pyarrow" engine.

import pandas as pdimport numpy as np# Create a sample DataFramedf = pd.DataFrame({"a": list("abc"),"b": list(range(1, 4)),"c": np.arange(3, 6).astype("u1"),"d": np.arange(4.0, 7.0),"e": [True, False, True],"f": pd.Categorical(list("abc")),"g": pd.date_range("20240101", periods=3)})print("Original DataFrame:")print(df)# Save the DataFrame as a parquet filedf.to_parquet("df_parquet_file.parquet", engine="pyarrow")print("\nDataFrame is successfully saved as a parquet file.")

When we run above program, it produces following result −

Original DataFrame:
abcdefg
0a134.0Truea2024-01-01
1b245.0Falseb2024-01-02
2c356.0Truec2024-01-01
DataFrame is successfully saved as a parquet file.
If you visit the folder where the parquet files are saved, you can observe the generated parquet file.

Reading Data from a parquet File

For reading a parquet file data into the Pandas object, you can use the Pandasread_parquet() method. This method provides more options for reading parquet file from a variety of storage backends, including local files, URLs, and cloud storage services.

Example

This example reads the Pandas DataFrame from a parquet file using the Pandasread_parquet() method.

import pandas as pdimport numpy as np# Create a sample DataFramedf = pd.DataFrame({"a": list("abc"),"b": list(range(1, 4)),"c": np.arange(3, 6).astype("u1"),"d": np.arange(4.0, 7.0),"e": [True, False, True],"f": pd.Categorical(list("abc")),"g": pd.date_range("20240101", periods=3)})# Save the DataFrame as a parquet filedf.to_parquet("df_parquet_file.parquet")# Load the parquet fileresult = pd.read_parquet("df_parquet_file.parquet")# Display the DataFrameprint('Loaded DataFrame:')print(result)# Verify data typesprint("\nData Type of the each column:")print(result.dtypes)

While executing the above code we get the following output −

Loaded DataFrame:
abcdefg
0a134.0Truea2024-01-01
1b245.0Falseb2024-01-02
2c356.0Truec2024-01-03
Data Type of the each column:a objectb int64c uint8d float64e boolf categoryg datetime64[ns]dtype: object

Reading and Writing Parquet Files In-Memory

You can also store and retrieve the parquet format data in Python in-memory. In-memory files store data in RAM instead of writing to disk, making them ideal for temporary data processing while avoiding file I/O operations. Python provides several types of in-memory files, here we will use theBytesIO for reading and write the parquet format data.

Example

This example demonstrates reading and writing a DataFrame as a parquet format In-Memory using theread_parquet() andDataFrame.to_parquet() methods with the help of theBytesIO library.

import pandas as pdimport io# Create a DataFramedf = pd.DataFrame({"Col_1": range(5), "Col_2": range(5, 10)})print("Original DataFrame:")print(df)# Save the DataFrame as In-Memory parquetbuf = io.BytesIO()df.to_parquet(buf)# Read the DataFrame from the in-memory bufferloaded_df = pd.read_parquet(buf)print("\nDataFrame Loaded from In-Memory parquet:")print(loaded_df)

Following is an output of the above code −

Original DataFrame:
Col_1Col_2
005
116
227
338
449
DataFrame Loaded from In-Memory parquet:
Col_1Col_2
005
116
227
338
449
Print Page
Advertisements

[8]ページ先頭

©2009-2025 Movatter.jp