How do I read a large csv file with pandas?

Question 1

I am trying to read a large csv file (aprox. 6 GB) in pandas and i am getting a memory error:

MemoryError                               Traceback (most recent call last)<ipython-input-58-67a72687871b> in <module>()----> 1 data=pd.read_csv('aphro.csv',sep=';')...MemoryError:

Any help on this?

Question 2

Curiously, a very similarquestion was asked almost a year before this one...

Question 3

Possible duplicate ofReading large text files with Pandas

Question 4

Does this answer your question?"Large data" work flows using pandas

Question 5

The error shows that the machine does not have enough memory to read the entireCSV into a DataFrame at one time. Assuming you do not need the entire dataset inmemory all at one time, one way to avoid the problem would be toprocess the CSV inchunks (by specifying thechunksize parameter):

chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize):    # chunk is a DataFrame. To "process" the rows in the chunk:    for index, row in chunk.iterrows():        print(row)

Thechunksize parameter specifies the number of rows per chunk.(The last chunk may contain fewer thanchunksize rows, of course.)

pandas >= 1.2

read_csv withchunksize returns a context manager, to be used like so:

chunksize = 10 ** 6with pd.read_csv(filename, chunksize=chunksize) as reader:    for chunk in reader:        process(chunk)

SeeGH38225

Question 6

you generally need 2X the final memory to read in something (from csv, though other formats are better at having lower memory requirements). FYI this is true for trying to do almost anything all at once. Much better to chunk it (which has a constant memory usage).

Question 7

@altabq: The problem here is that we don't have enough memory to build a single DataFrame holding all the data. The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one chunk at a time -- thus saving memory. Whatever you do, DO NOT callDF.append(chunk) inside the loop. That will useO(N^2) copying operations. It is better to append the aggregated datato a list, and then build the DataFrame from the list withone call topd.DataFrame orpd.concat (depending on the type of aggregated data).

Question 8

@altabq: CallingDF.append(chunk) in a loop requiresO(N^2) copying operations whereN is the size of the chunks, because each call toDF.append returns a new DataFrame. Callingpd.DataFrame orpd.concatonce outside the loop reduces the amount of copying toO(N).

Question 9

@Pyderman: Yes, thechunksize parameter refers to the number of rows per chunk. The last chunk may contain fewer thanchunksize rows, of course.

Question 10

@Pyderman: Yes; callingpd.concat([list_of_dfs])once after the loop is much faster than callingpd.concat ordf.append many times within the loop. Of course, you'll need a considerable amount of memory to hold the entire 6GB csv as one DataFrame.

Question 11

Chunking shouldn't always be the first port of call for this problem.

Is the file large due to repeated non-numeric data or unwanted columns?
If so, you can sometimes see massive memory savings byreading in columns as categories and selecting required columns viapd.read_csvusecols parameter.
Does your workflow require slicing, manipulating, exporting?
If so, you can usedask.dataframe to slice, perform your calculations and export iteratively. Chunking is performed silently by dask, which also supports a subset of pandas API.
If all else fails, read line by line via chunks.
Chunkvia pandas or viacsv library as a last resort.

Question 12

It looks likechunks have the same meaning of "the number of lines", right?

Question 13

@Belter, ..yes.

Question 14

For large data l recommend you use the library "dask"
e.g:

# Dataframes implement the Pandas APIimport dask.dataframe as dddf = dd.read_csv('s3://.../2018-*-*.csv')

You can read more from the documentationhere.

Another great alternative would be to usemodin because all the functionality is identical to pandas yet it leverages on distributed dataframe libraries such as dask.

From my projects another superior library isdatatables.

# Datatable python libraryimport datatable as dtdf = dt.fread("s3://.../2018-*-*.csv")

Question 15

Any benefits over pandas, could appreciate adding a few more pointers

Question 16

I haven't used Dask for very long but the main advantages in my use cases were that Dask can run parallel on multiple machines, it can also fit data as slices into memory.

Question 17

thanks! is dask a replacement for pandas or does it work on top of pandas as a layer

Question 18

Welcome, it works as a wrapper for Numpy, Pandas, and Scikit-Learn.

Question 19

I've tried to face several problems with Dask and always throws an error for everything. Even with chunks It throws Memory errors too. Seestackoverflow.com/questions/59865572/…

Question 20

I proceeded like this:

chunks=pd.read_table('aphro.csv',chunksize=1000000,sep=';',\       names=['lat','long','rf','date','slno'],index_col='slno',\       header=None,parse_dates=['date'])df=pd.DataFrame()%time df=pd.concat(chunk.groupby(['lat','long',chunk['date'].map(lambda x: x.year)])['rf'].agg(['sum']) for chunk in chunks)

Question 21

Is there a reason you switched fromread_csv toread_table?

Question 22

You can read in the data as chunks and save each chunk as pickle.

import pandas as pd import picklein_path = "" #Path where the large file isout_path = "" #Path to save the pickle files tochunk_size = 400000 #size of chunks relies on your available memoryseparator = "~"reader = pd.read_csv(in_path,sep=separator,chunksize=chunk_size,                     low_memory=False)    for i, chunk in enumerate(reader):    out_file = out_path + "/data_{}.pkl".format(i+1)    with open(out_file, "wb") as f:        pickle.dump(chunk,f,pickle.HIGHEST_PROTOCOL)

In the next step you read in the pickles and append each pickle to your desired dataframe.

import globpickle_path = "" #Same Path as out_path i.e. where the pickle files aredata_p_files=[]for name in glob.glob(pickle_path + "/data_*.pkl"):   data_p_files.append(name)df = pd.DataFrame([])for i in range(len(data_p_files)):    df = df.append(pd.read_pickle(data_p_files[i]),ignore_index=True)

Question 23

If your finaldf fits entirely in memory (as implied) and contains the same amount of data as your input, surely you don't need to chunk at all?

Question 24

You would need to chunk in this case if, for example, your file is very wide (like greater than 100 columns with a lot of string columns). This increases the memory needed to hold the df in memory. Even a 4GB file like this could end up using between 20 and 30 GB of RAM on a box with 64 GB RAM.

Question 25

I want to make a more comprehensive answer based off of the most of the potential solutions that are already provided. I also want to point out one more potential aid that may help reading process.

Option 1: dtypes

"dtypes" is a pretty powerful parameter that you can use to reduce the memory pressure ofread methods. Seethis andthis answer. Pandas, on default, try to infer dtypes of the data.

Referring to data structures, every data stored, a memory allocation takes place. At a basic level refer to the values below (The table below illustrates values for C programming language):

The maximum value of UNSIGNED CHAR = 255                                    The minimum value of SHORT INT = -32768                                     The maximum value of SHORT INT = 32767                                      The minimum value of INT = -2147483648                                      The maximum value of INT = 2147483647                                       The minimum value of CHAR = -128                                            The maximum value of CHAR = 127                                             The minimum value of LONG = -9223372036854775808                            The maximum value of LONG = 9223372036854775807

Refer tothis page to see the matching between NumPy and C types.

Let's say you have an array of integers ofdigits. You can both theoretically and practically assign, say array of 16-bit integer type, but you would then allocate more memory than you actually need to store that array. To prevent this, you can setdtype option onread_csv. You do not want to store the array items as long integer where actually you can fit them with 8-bit integer (np.int8 ornp.uint8).

Observe the following dtype map.

Source:https://pbpython.com/pandas_dtypes.html

You can passdtype parameter as a parameter on pandas methods as dict onread like {column: type}.

import numpy as npimport pandas as pddf_dtype = {        "column_1": int,        "column_2": str,        "column_3": np.int16,        "column_4": np.uint8,        ...        "column_n": np.float32}df = pd.read_csv('path/to/file', dtype=df_dtype)

Option 2: Read by Chunks

Reading the data in chunks allows you to access a part of the data in-memory, and you can apply preprocessing on your data and preserve the processed data rather than raw data. It'd be much better if you combine this option with the first one,dtypes.

I want to point out the pandas cookbook sections for that process, where you can find ithere. Note those two sections there;

Option 3: Dask

Dask is a framework that is defined inDask's website as:

Dask provides advanced parallelism for analytics, enabling performance at scale for the tools you love

It was born to cover the necessary parts where pandas cannot reach. Dask is a powerful framework that allows you much more data access by processing it in a distributed way.

You can use dask to preprocess your data as a whole, Dask takes care of the chunking part, so unlike pandas you can just define your processing steps and let Dask do the work. Dask does not apply the computations before it is explicitly pushed bycompute and/orpersist (see the answerhere for the difference).

Other Aids (Ideas)

ETL flow designed for the data. Keeping only what is needed from the raw data.
- First, apply ETL to whole data with frameworks like Dask or PySpark, and export the processed data.
- Then see if the processed data can be fit in the memory as a whole.
Consider increasing your RAM.
Consider working with that data on a cloud platform.

Question 26

Before using chunksize option if you want to be sure about the process function that you want to write inside the chunking for-loop as mentioned by @unutbu you can simply use nrows option.

small_df = pd.read_csv(filename, nrows=100)

Once you are sure that the process block is ready, you can put that in the chunking for loop for the entire dataframe.

Question 27

The function read_csv and read_table is almost the same. But you must assign the delimiter “，” when you use the function read_table in your program.

def get_from_action_data(fname, chunk_size=100000):    reader = pd.read_csv(fname, header=0, iterator=True)    chunks = []    loop = True    while loop:        try:            chunk = reader.get_chunk(chunk_size)[["user_id", "type"]]            chunks.append(chunk)        except StopIteration:            loop = False            print("Iteration is stopped")    df_ac = pd.concat(chunks, ignore_index=True)

Question 28

It would help if stated what your question is in this post. Like "What is the difference between read_csv and read_table?" or "Why does read table need a delimiter?"

Question 29

It depends how your file looks. Some files have common delimiters such as "," or "|" or "\t" but you may see other files with delimiters such as 0x01, 0x02 (making this one up) etc. So read_table is more suited to uncommon delimiters but read_csv can do the same job just as good.

Question 30

Solution 1:

Using pandas with large data

Solution 2:

TextFileReader = pd.read_csv(path, chunksize=1000)  # the number of rows per chunkdfList = []for df in TextFileReader:    dfList.append(df)df = pd.concat(dfList,sort=False)

Question 31

Here again we are loading the 6 GB file totally to the memory, Is there any options, we can process the current chunk and then read the next chunk

Question 32

just don't dodfList.append, just process each chunk (df ) separately

Question 33

Here follows an example:

chunkTemp = []queryTemp = []query = pd.DataFrame()for chunk in pd.read_csv(file, header=0, chunksize=<your_chunksize>, iterator=True, low_memory=False):    #REPLACING BLANK SPACES AT COLUMNS' NAMES FOR SQL OPTIMIZATION    chunk = chunk.rename(columns = {c: c.replace(' ', '') for c in chunk.columns})    #YOU CAN EITHER:     #1)BUFFER THE CHUNKS IN ORDER TO LOAD YOUR WHOLE DATASET     chunkTemp.append(chunk)    #2)DO YOUR PROCESSING OVER A CHUNK AND STORE THE RESULT OF IT    query = chunk[chunk[<column_name>].str.startswith(<some_pattern>)]       #BUFFERING PROCESSED DATA    queryTemp.append(query)#!  NEVER DO pd.concat OR pd.DataFrame() INSIDE A LOOPprint("Database: CONCATENATING CHUNKS INTO A SINGLE DATAFRAME")chunk = pd.concat(chunkTemp)print("Database: LOADED")#CONCATENATING PROCESSED DATAquery = pd.concat(queryTemp)print(query)

Question 34

You can try sframe, that have the same syntax as pandas but allows you to manipulate files that are bigger than your RAM.

Question 35

Link to SFrame docs:turi.com/products/create/docs/generated/graphlab.SFrame.html

Question 36

"The data in SFrame is stored column-wise on the GraphLab Server side" is it a service or a package?

Question 37

If you use pandas read large file into chunk and then yield row by row, here is what I have done

import pandas as pddef chunck_generator(filename, header=False,chunk_size = 10 ** 5):   for chunk in pd.read_csv(filename,delimiter=',', iterator=True, chunksize=chunk_size, parse_dates=[1] ):         yield (chunk)def _generator( filename, header=False,chunk_size = 10 ** 5):    chunk = chunck_generator(filename, header=False,chunk_size = 10 ** 5)    for row in chunk:        yield rowif __name__ == "__main__":filename = r'file.csv'        generator = generator(filename=filename)        while True:           print(next(generator))

Question 38

In case someone is still looking for something like this, I found that this new library calledmodin can help. It uses distributed computing that can help with the read. Here's a nicearticle comparing its functionality with pandas. It essentially uses the same functions as pandas.

import modin.pandas as pdpd.read_csv(CSV_FILE_NAME)

Question 39

Can you comment on how this new modulemodin compares with the well-establisheddask.dataframe? For example, seemove from pandas to dask to utilize all local cpu cores.

Question 40

In addition to the answers above, for those who want to process CSV and then export to csv, parquet or SQL,d6tstack is another good option. You can load multiple files and it deals with data schema changes (added/removed columns). Chunked out of core support is already built in.

def apply(dfg):    # do stuff    return dfgc = d6tstack.combine_csv.CombinerCSV([bigfile.csv], apply_after_read=apply, sep=',', chunksize=1e6)# orc = d6tstack.combine_csv.CombinerCSV(glob.glob('*.csv'), apply_after_read=apply, chunksize=1e6)# output to various formats, automatically chunked to reduce memory consumptionc.to_csv_combine(filename='out.csv')c.to_parquet_combine(filename='out.pq')c.to_psql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # fast for postgresc.to_mysql_combine('mysql+mysqlconnector://usr:pwd@localhost/db', 'tablename') # fast for mysqlc.to_sql_combine('postgresql+psycopg2://usr:pwd@localhost/db', 'tablename') # slow but flexible

Question 41

If you have`csv` file with`millions` of data entry and you want to load full dataset you should use`dask_cudf`,

import dask_cudf as dcdf = dc.read_csv("large_data.csv")

Question 42

def read_csv_with_progress(file_path, sep):    import pandas as pd    from tqdm import tqdm    chunk_size = 50000  # Number of lines to read in each iteration    # Get the total number of lines in the CSV file    print("Calculating average line length + getting file size")    counter = 0    total_length = 0    num_to_sample = 10    for line in open(file_path, 'r'):        counter += 1        if counter > 1:            total_length += len(line)        if counter == num_to_sample + 1:            break    file_size = os.path.getsize(file_path)    avg_line_length = total_length / num_to_sample    avg_number_of_lines = int(file_size / avg_line_length)    chunks = []    with tqdm(total=avg_number_of_lines, desc='Reading CSV') as pbar:        for chunk in pd.read_csv(file_path, chunksize=chunk_size, low_memory=False, sep=sep):            chunks.append(chunk)            pbar.update(chunk.shape[0])    print("Concating...")    df = pd.concat(chunks, ignore_index=True)    return df

unutbu 886k197 gold badges1.9k silver badges1.7k bronze badges · Accepted Answer · 2023-03-14 19:52:39Z

465

The error shows that the machine does not have enough memory to read the entireCSV into a DataFrame at one time. Assuming you do not need the entire dataset inmemory all at one time, one way to avoid the problem would be toprocess the CSV inchunks (by specifying thechunksize parameter):

chunksize = 10 ** 6for chunk in pd.read_csv(filename, chunksize=chunksize):    # chunk is a DataFrame. To "process" the rows in the chunk:    for index, row in chunk.iterrows():        print(row)

Thechunksize parameter specifies the number of rows per chunk.(The last chunk may contain fewer thanchunksize rows, of course.)

pandas >= 1.2

read_csv withchunksize returns a context manager, to be used like so:

chunksize = 10 ** 6with pd.read_csv(filename, chunksize=chunksize) as reader:    for chunk in reader:        process(chunk)

SeeGH38225

Share

Improve this answer

editedMar 14, 2023 at 19:52

Dan Dascalescu

154k66 gold badges336 silver badges422 bronze badges

answeredSep 21, 2014 at 17:54

unutbu

886k197 gold badges1.9k silver badges1.7k bronze badges

Sign up to request clarification or add additional context in comments.

21 Comments

Jeff

Jeff Over a year ago

you generally need 2X the final memory to read in something (from csv, though other formats are better at having lower memory requirements). FYI this is true for trying to do almost anything all at once. Much better to chunk it (which has a constant memory usage).

2014-09-21T17:57:09.183Z+00:00

unutbu

unutbu Over a year ago

@altabq: The problem here is that we don't have enough memory to build a single DataFrame holding all the data. The solution above tries to cope with this situation by reducing the chunks (e.g. by aggregating or extracting just the desired information) one chunk at a time -- thus saving memory. Whatever you do, DO NOT callDF.append(chunk) inside the loop. That will useO(N^2) copying operations. It is better to append the aggregated datato a list, and then build the DataFrame from the list withone call topd.DataFrame orpd.concat (depending on the type of aggregated data).

2016-02-17T18:29:43.83Z+00:00

unutbu

unutbu Over a year ago

@altabq: CallingDF.append(chunk) in a loop requiresO(N^2) copying operations whereN is the size of the chunks, because each call toDF.append returns a new DataFrame. Callingpd.DataFrame orpd.concatonce outside the loop reduces the amount of copying toO(N).

2016-02-17T18:33:18.853Z+00:00

unutbu

unutbu Over a year ago

@Pyderman: Yes, thechunksize parameter refers to the number of rows per chunk. The last chunk may contain fewer thanchunksize rows, of course.

2016-05-11T18:06:39.453Z+00:00

unutbu

unutbu Over a year ago

@Pyderman: Yes; callingpd.concat([list_of_dfs])once after the loop is much faster than callingpd.concat ordf.append many times within the loop. Of course, you'll need a considerable amount of memory to hold the entire 6GB csv as one DataFrame.

2016-05-11T18:27:43.2Z+00:00

|

Movatterモバイル変換

Collectives™ on Stack Overflow

How do I read a large csv file with pandas?

16 Answers16

pandas >= 1.2

21 Comments

2 Comments

6 Comments

1 Comment

2 Comments

Comments

Comments

2 Comments

2 Comments

Comments

2 Comments

Comments

1 Comment

Comments

If you have`csv` file with`millions` of data entry and you want to load full dataset you should use`dask_cudf`,

Comments

Comments

Linked

Related

Hot Network Questions

Subscribe to RSS

Movatterモバイル変換

Collectives™ on Stack Overflow

16 Answers16

pandas >= 1.2

21 Comments

2 Comments

6 Comments

1 Comment

2 Comments

Comments

Comments

2 Comments

2 Comments

Comments

2 Comments

Comments

1 Comment

Comments

If you havecsv file withmillions of data entry and you want to load full dataset you should usedask_cudf,

Comments

Comments

Linked

Related

If you have`csv` file with`millions` of data entry and you want to load full dataset you should use`dask_cudf`,