Scaling to large datasets #

pandas provides data structures for in-memory analytics, which makes using pandasto analyze datasets that are larger than memory datasets somewhat tricky. Even datasetsthat are a sizable fraction of memory become unwieldy, as some pandas operations needto make intermediate copies.

This document provides a few recommendations for scaling your analysis to larger datasets.It’s a complement toEnhancing performance, which focuses on speeding up analysisfor datasets that fit in memory.

Load less data#

Suppose our raw dataset on disk has many columns.

In [1]:importpandasaspdIn [2]:importnumpyasnpIn [3]:defmake_timeseries(start="2000-01-01",end="2000-12-31",freq="1D",seed=None):   ...:index=pd.date_range(start=start,end=end,freq=freq,name="timestamp")   ...:n=len(index)   ...:state=np.random.RandomState(seed)   ...:columns={   ...:"name":state.choice(["Alice","Bob","Charlie"],size=n),   ...:"id":state.poisson(1000,size=n),   ...:"x":state.rand(n)*2-1,   ...:"y":state.rand(n)*2-1,   ...:}   ...:df=pd.DataFrame(columns,index=index,columns=sorted(columns))   ...:ifdf.index[-1]==end:   ...:df=df.iloc[:-1]   ...:returndf   ...:In [4]:timeseries=[   ...:make_timeseries(freq="1min",seed=i).rename(columns=lambdax:f"{x}_{i}")   ...:foriinrange(10)   ...:]   ...:In [5]:ts_wide=pd.concat(timeseries,axis=1)In [6]:ts_wide.head()Out[6]:                     id_0 name_0       x_0  ...   name_9       x_9       y_9timestamp                                   ...2000-01-01 00:00:00   977  Alice -0.821225  ...  Charlie -0.957208 -0.7575082000-01-01 00:01:00  1018    Bob -0.219182  ...    Alice -0.414445 -0.1002982000-01-01 00:02:00   927  Alice  0.660908  ...  Charlie -0.325838  0.5818592000-01-01 00:03:00   997    Bob -0.852458  ...      Bob  0.992033 -0.6866922000-01-01 00:04:00   965    Bob  0.717283  ...  Charlie -0.924556 -0.184161[5 rows x 40 columns]In [7]:ts_wide.to_parquet("timeseries_wide.parquet")

To load the columns we want, we have two options.Option 1 loads in all the data and then filters to what we need.

In [8]:columns=["id_0","name_0","x_0","y_0"]In [9]:pd.read_parquet("timeseries_wide.parquet")[columns]Out[9]:                     id_0 name_0       x_0       y_0timestamp2000-01-01 00:00:00   977  Alice -0.821225  0.9062222000-01-01 00:01:00  1018    Bob -0.219182  0.3508552000-01-01 00:02:00   927  Alice  0.660908 -0.7985112000-01-01 00:03:00   997    Bob -0.852458  0.7352602000-01-01 00:04:00   965    Bob  0.717283  0.393391...                   ...    ...       ...       ...2000-12-30 23:56:00  1037    Bob -0.814321  0.6128362000-12-30 23:57:00   980    Bob  0.232195 -0.6188282000-12-30 23:58:00   965  Alice -0.231131  0.0263102000-12-30 23:59:00   984  Alice  0.942819  0.8531282000-12-31 00:00:00  1003  Alice  0.201125 -0.136655[525601 rows x 4 columns]

Option 2 only loads the columns we request.

In [10]:pd.read_parquet("timeseries_wide.parquet",columns=columns)Out[10]:                     id_0 name_0       x_0       y_0timestamp2000-01-01 00:00:00   977  Alice -0.821225  0.9062222000-01-01 00:01:00  1018    Bob -0.219182  0.3508552000-01-01 00:02:00   927  Alice  0.660908 -0.7985112000-01-01 00:03:00   997    Bob -0.852458  0.7352602000-01-01 00:04:00   965    Bob  0.717283  0.393391...                   ...    ...       ...       ...2000-12-30 23:56:00  1037    Bob -0.814321  0.6128362000-12-30 23:57:00   980    Bob  0.232195 -0.6188282000-12-30 23:58:00   965  Alice -0.231131  0.0263102000-12-30 23:59:00   984  Alice  0.942819  0.8531282000-12-31 00:00:00  1003  Alice  0.201125 -0.136655[525601 rows x 4 columns]

If we were to measure the memory usage of the two calls, we’d see that specifyingcolumns uses about 1/10th the memory in this case.

Withpandas.read_csv(), you can specifyusecols to limit the columnsread into memory. Not all file formats that can be read by pandas provide an optionto read a subset of columns.

Use efficient datatypes#

The default pandas data types are not the most memory efficient. This isespecially true for text data columns with relatively few unique values (commonlyreferred to as “low-cardinality” data). By using more efficient data types, youcan store larger datasets in memory.

In [11]:ts=make_timeseries(freq="30s",seed=0)In [12]:ts.to_parquet("timeseries.parquet")In [13]:ts=pd.read_parquet("timeseries.parquet")In [14]:tsOut[14]:                       id     name         x         ytimestamp2000-01-01 00:00:00  1041    Alice  0.889987  0.2810112000-01-01 00:00:30   988      Bob -0.455299  0.4881532000-01-01 00:01:00  1018    Alice  0.096061  0.5804732000-01-01 00:01:30   992      Bob  0.142482  0.0416652000-01-01 00:02:00   960      Bob -0.036235  0.802159...                   ...      ...       ...       ...2000-12-30 23:58:00  1022    Alice  0.266191  0.8755792000-12-30 23:58:30   974    Alice -0.009826  0.4136862000-12-30 23:59:00  1028  Charlie  0.307108 -0.6567892000-12-30 23:59:30  1002    Alice  0.202602  0.5413352000-12-31 00:00:00   987    Alice  0.200832  0.615972[1051201 rows x 4 columns]

Now, let’s inspect the data types and memory usage to see where we should focus ourattention.

In [15]:ts.dtypesOut[15]:id        int64name     objectx       float64y       float64dtype: object

In [16]:ts.memory_usage(deep=True)# memory usage in bytesOut[16]:Index     8409608id        8409608name     65176434x         8409608y         8409608dtype: int64

Thename column is taking up much more memory than any other. It has just afew unique values, so it’s a good candidate for converting to apandas.Categorical. With apandas.Categorical, we store each unique name once and usespace-efficient integers to know which specific name is used in each row.

In [17]:ts2=ts.copy()In [18]:ts2["name"]=ts2["name"].astype("category")In [19]:ts2.memory_usage(deep=True)Out[19]:Index    8409608id       8409608name     1051495x        8409608y        8409608dtype: int64

We can go a bit further and downcast the numeric columns to their smallest typesusingpandas.to_numeric().

In [20]:ts2["id"]=pd.to_numeric(ts2["id"],downcast="unsigned")In [21]:ts2[["x","y"]]=ts2[["x","y"]].apply(pd.to_numeric,downcast="float")In [22]:ts2.dtypesOut[22]:id        uint16name    categoryx        float32y        float32dtype: object

In [23]:ts2.memory_usage(deep=True)Out[23]:Index    8409608id       2102402name     1051495x        4204804y        4204804dtype: int64

In [24]:reduction=ts2.memory_usage(deep=True).sum()/ts.memory_usage(deep=True).sum()In [25]:print(f"{reduction:0.2f}")0.20

In all, we’ve reduced the in-memory footprint of this dataset to 1/5 of itsoriginal size.

SeeCategorical data for more onpandas.Categorical anddtypesfor an overview of all of pandas’ dtypes.

Use chunking#

Some workloads can be achieved with chunking by splitting a large problem into a bunch of small problems. For example,converting an individual CSV file into a Parquet file and repeating that for each file in a directory. As long as each chunkfits in memory, you can work with datasets that are much larger than memory.

Note

Chunking works well when the operation you’re performing requires zero or minimalcoordination between chunks. For more complicated workflows, you’re better offusing other libraries.

Suppose we have an even larger “logical dataset” on disk that’s a directory of parquetfiles. Each file in the directory represents a different year of the entire dataset.

In [26]:importpathlibIn [27]:N=12In [28]:starts=[f"20{i:>02d}-01-01"foriinrange(N)]In [29]:ends=[f"20{i:>02d}-12-13"foriinrange(N)]In [30]:pathlib.Path("data/timeseries").mkdir(exist_ok=True)In [31]:fori,(start,end)inenumerate(zip(starts,ends)):   ....:ts=make_timeseries(start=start,end=end,freq="1min",seed=i)   ....:ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet")   ....:

data└── timeseries    ├── ts-00.parquet    ├── ts-01.parquet    ├── ts-02.parquet    ├── ts-03.parquet    ├── ts-04.parquet    ├── ts-05.parquet    ├── ts-06.parquet    ├── ts-07.parquet    ├── ts-08.parquet    ├── ts-09.parquet    ├── ts-10.parquet    └── ts-11.parquet

Now we’ll implement an out-of-corepandas.Series.value_counts(). The peak memory usage of thisworkflow is the single largest chunk, plus a small series storing the unique valuecounts up to this point. As long as each individual file fits in memory, this willwork for arbitrary-sized datasets.

In [32]:%%time   ....:files=pathlib.Path("data/timeseries/").glob("ts*.parquet")   ....:counts=pd.Series(dtype=int)   ....:forpathinfiles:   ....:df=pd.read_parquet(path)   ....:counts=counts.add(df["name"].value_counts(),fill_value=0)   ....:counts.astype(int)   ....:CPU times: user 1.01 s, sys: 37.3 ms, total: 1.04 sWall time: 1.03 sOut[32]:nameAlice      1994645Bob        1993692Charlie    1994875dtype: int64

Some readers, likepandas.read_csv(), offer parameters to control thechunksize when reading a single file.

Manually chunking is an OK option for workflows that don’trequire too sophisticated of operations. Some operations, likepandas.DataFrame.groupby(), aremuch harder to do chunkwise. In these cases, you may be better switching to adifferent library that implements these out-of-core algorithms for you.

Use Other Libraries#

There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,and can give you the ability to scale your large dataset processing and analyticsby parallel runtime, distributed memory, clustering, etc. You can find more informationinthe ecosystem page.

On this page

Show Source

Movatterモバイル変換

Scaling to large datasets#

Load less data#

Use efficient datatypes#

Use chunking#

Use Other Libraries#

Scaling to large datasets #