- User Guide
- Scaling to...
Scaling to large datasets#
pandas provides data structures for in-memory analytics, which makes using pandasto analyze datasets that are larger than memory datasets somewhat tricky. Even datasetsthat are a sizable fraction of memory become unwieldy, as some pandas operations needto make intermediate copies.
This document provides a few recommendations for scaling your analysis to larger datasets.It’s a complement toEnhancing performance, which focuses on speeding up analysisfor datasets that fit in memory.
Load less data#
Suppose our raw dataset on disk has many columns.
In [1]:importpandasaspdIn [2]:importnumpyasnpIn [3]:defmake_timeseries(start="2000-01-01",end="2000-12-31",freq="1D",seed=None): ...:index=pd.date_range(start=start,end=end,freq=freq,name="timestamp") ...:n=len(index) ...:state=np.random.RandomState(seed) ...:columns={ ...:"name":state.choice(["Alice","Bob","Charlie"],size=n), ...:"id":state.poisson(1000,size=n), ...:"x":state.rand(n)*2-1, ...:"y":state.rand(n)*2-1, ...:} ...:df=pd.DataFrame(columns,index=index,columns=sorted(columns)) ...:ifdf.index[-1]==end: ...:df=df.iloc[:-1] ...:returndf ...:In [4]:timeseries=[ ...:make_timeseries(freq="1min",seed=i).rename(columns=lambdax:f"{x}_{i}") ...:foriinrange(10) ...:] ...:In [5]:ts_wide=pd.concat(timeseries,axis=1)In [6]:ts_wide.head()Out[6]: id_0 name_0 x_0 ... name_9 x_9 y_9timestamp ...2000-01-01 00:00:00 977 Alice -0.821225 ... Charlie -0.957208 -0.7575082000-01-01 00:01:00 1018 Bob -0.219182 ... Alice -0.414445 -0.1002982000-01-01 00:02:00 927 Alice 0.660908 ... Charlie -0.325838 0.5818592000-01-01 00:03:00 997 Bob -0.852458 ... Bob 0.992033 -0.6866922000-01-01 00:04:00 965 Bob 0.717283 ... Charlie -0.924556 -0.184161[5 rows x 40 columns]In [7]:ts_wide.to_parquet("timeseries_wide.parquet")
To load the columns we want, we have two options.Option 1 loads in all the data and then filters to what we need.
In [8]:columns=["id_0","name_0","x_0","y_0"]In [9]:pd.read_parquet("timeseries_wide.parquet")[columns]Out[9]: id_0 name_0 x_0 y_0timestamp2000-01-01 00:00:00 977 Alice -0.821225 0.9062222000-01-01 00:01:00 1018 Bob -0.219182 0.3508552000-01-01 00:02:00 927 Alice 0.660908 -0.7985112000-01-01 00:03:00 997 Bob -0.852458 0.7352602000-01-01 00:04:00 965 Bob 0.717283 0.393391... ... ... ... ...2000-12-30 23:56:00 1037 Bob -0.814321 0.6128362000-12-30 23:57:00 980 Bob 0.232195 -0.6188282000-12-30 23:58:00 965 Alice -0.231131 0.0263102000-12-30 23:59:00 984 Alice 0.942819 0.8531282000-12-31 00:00:00 1003 Alice 0.201125 -0.136655[525601 rows x 4 columns]
Option 2 only loads the columns we request.
In [10]:pd.read_parquet("timeseries_wide.parquet",columns=columns)Out[10]: id_0 name_0 x_0 y_0timestamp2000-01-01 00:00:00 977 Alice -0.821225 0.9062222000-01-01 00:01:00 1018 Bob -0.219182 0.3508552000-01-01 00:02:00 927 Alice 0.660908 -0.7985112000-01-01 00:03:00 997 Bob -0.852458 0.7352602000-01-01 00:04:00 965 Bob 0.717283 0.393391... ... ... ... ...2000-12-30 23:56:00 1037 Bob -0.814321 0.6128362000-12-30 23:57:00 980 Bob 0.232195 -0.6188282000-12-30 23:58:00 965 Alice -0.231131 0.0263102000-12-30 23:59:00 984 Alice 0.942819 0.8531282000-12-31 00:00:00 1003 Alice 0.201125 -0.136655[525601 rows x 4 columns]
If we were to measure the memory usage of the two calls, we’d see that specifyingcolumns
uses about 1/10th the memory in this case.
Withpandas.read_csv()
, you can specifyusecols
to limit the columnsread into memory. Not all file formats that can be read by pandas provide an optionto read a subset of columns.
Use efficient datatypes#
The default pandas data types are not the most memory efficient. This isespecially true for text data columns with relatively few unique values (commonlyreferred to as “low-cardinality” data). By using more efficient data types, youcan store larger datasets in memory.
In [11]:ts=make_timeseries(freq="30s",seed=0)In [12]:ts.to_parquet("timeseries.parquet")In [13]:ts=pd.read_parquet("timeseries.parquet")In [14]:tsOut[14]: id name x ytimestamp2000-01-01 00:00:00 1041 Alice 0.889987 0.2810112000-01-01 00:00:30 988 Bob -0.455299 0.4881532000-01-01 00:01:00 1018 Alice 0.096061 0.5804732000-01-01 00:01:30 992 Bob 0.142482 0.0416652000-01-01 00:02:00 960 Bob -0.036235 0.802159... ... ... ... ...2000-12-30 23:58:00 1022 Alice 0.266191 0.8755792000-12-30 23:58:30 974 Alice -0.009826 0.4136862000-12-30 23:59:00 1028 Charlie 0.307108 -0.6567892000-12-30 23:59:30 1002 Alice 0.202602 0.5413352000-12-31 00:00:00 987 Alice 0.200832 0.615972[1051201 rows x 4 columns]
Now, let’s inspect the data types and memory usage to see where we should focus ourattention.
In [15]:ts.dtypesOut[15]:id int64name objectx float64y float64dtype: object
In [16]:ts.memory_usage(deep=True)# memory usage in bytesOut[16]:Index 8409608id 8409608name 65176434x 8409608y 8409608dtype: int64
Thename
column is taking up much more memory than any other. It has just afew unique values, so it’s a good candidate for converting to apandas.Categorical
. With apandas.Categorical
, we store each unique name once and usespace-efficient integers to know which specific name is used in each row.
In [17]:ts2=ts.copy()In [18]:ts2["name"]=ts2["name"].astype("category")In [19]:ts2.memory_usage(deep=True)Out[19]:Index 8409608id 8409608name 1051495x 8409608y 8409608dtype: int64
We can go a bit further and downcast the numeric columns to their smallest typesusingpandas.to_numeric()
.
In [20]:ts2["id"]=pd.to_numeric(ts2["id"],downcast="unsigned")In [21]:ts2[["x","y"]]=ts2[["x","y"]].apply(pd.to_numeric,downcast="float")In [22]:ts2.dtypesOut[22]:id uint16name categoryx float32y float32dtype: object
In [23]:ts2.memory_usage(deep=True)Out[23]:Index 8409608id 2102402name 1051495x 4204804y 4204804dtype: int64
In [24]:reduction=ts2.memory_usage(deep=True).sum()/ts.memory_usage(deep=True).sum()In [25]:print(f"{reduction:0.2f}")0.20
In all, we’ve reduced the in-memory footprint of this dataset to 1/5 of itsoriginal size.
SeeCategorical data for more onpandas.Categorical
anddtypesfor an overview of all of pandas’ dtypes.
Use chunking#
Some workloads can be achieved with chunking by splitting a large problem into a bunch of small problems. For example,converting an individual CSV file into a Parquet file and repeating that for each file in a directory. As long as each chunkfits in memory, you can work with datasets that are much larger than memory.
Note
Chunking works well when the operation you’re performing requires zero or minimalcoordination between chunks. For more complicated workflows, you’re better offusing other libraries.
Suppose we have an even larger “logical dataset” on disk that’s a directory of parquetfiles. Each file in the directory represents a different year of the entire dataset.
In [26]:importpathlibIn [27]:N=12In [28]:starts=[f"20{i:>02d}-01-01"foriinrange(N)]In [29]:ends=[f"20{i:>02d}-12-13"foriinrange(N)]In [30]:pathlib.Path("data/timeseries").mkdir(exist_ok=True)In [31]:fori,(start,end)inenumerate(zip(starts,ends)): ....:ts=make_timeseries(start=start,end=end,freq="1min",seed=i) ....:ts.to_parquet(f"data/timeseries/ts-{i:0>2d}.parquet") ....:
data└── timeseries ├── ts-00.parquet ├── ts-01.parquet ├── ts-02.parquet ├── ts-03.parquet ├── ts-04.parquet ├── ts-05.parquet ├── ts-06.parquet ├── ts-07.parquet ├── ts-08.parquet ├── ts-09.parquet ├── ts-10.parquet └── ts-11.parquet
Now we’ll implement an out-of-corepandas.Series.value_counts()
. The peak memory usage of thisworkflow is the single largest chunk, plus a small series storing the unique valuecounts up to this point. As long as each individual file fits in memory, this willwork for arbitrary-sized datasets.
In [32]:%%time ....:files=pathlib.Path("data/timeseries/").glob("ts*.parquet") ....:counts=pd.Series(dtype=int) ....:forpathinfiles: ....:df=pd.read_parquet(path) ....:counts=counts.add(df["name"].value_counts(),fill_value=0) ....:counts.astype(int) ....:CPU times: user 985 ms, sys: 45.9 ms, total: 1.03 sWall time: 1.02 sOut[32]:nameAlice 1994645Bob 1993692Charlie 1994875dtype: int64
Some readers, likepandas.read_csv()
, offer parameters to control thechunksize
when reading a single file.
Manually chunking is an OK option for workflows that don’trequire too sophisticated of operations. Some operations, likepandas.DataFrame.groupby()
, aremuch harder to do chunkwise. In these cases, you may be better switching to adifferent library that implements these out-of-core algorithms for you.
Use Other Libraries#
There are other libraries which provide similar APIs to pandas and work nicely with pandas DataFrame,and can give you the ability to scale your large dataset processing and analyticsby parallel runtime, distributed memory, clustering, etc. You can find more informationinthe ecosystem page.