Pandas read csv out of memory

Question 1

I try to manipulate a large CSV file using Pandas, when I wrote this

df = pd.read_csv(strFileName,sep='\t',delimiter='\t')

it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks

The input is like this:

columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]0:'3M' '2345' '2345' '2014-10-5',30001:'3M' '2958' '2152' '2015-3-22',50002:'GE' '2183' '2183' '2012-12-31',5153:'3M' '2958' '2958' '2015-3-10',3954:'GE' '2183' '2285' '2015-4-19',19255:'GE' '2598' '2598' '2015-3-17',1915

And the desired output is like this:

columns=[ka,kb,errorNum,errorRate,totalNum of records]'3M','2345',0,0%,1'3M','2958',1,50%,2'GE','2183',1,50%,2'GE','2598',0,0%,1

if the data set is small, the below code could be used as provided by another

df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum':  'sum',                                             'recordNum': 'count' })df2['errorRate'] = df2['errorNum'] / df2['recordNum']ka kb_1  recordNum  errorNum  errorRate3M 2345          1         0        0.0   2958          2         1        0.5GE 2183          2         1        0.5   2598          1         0        0.0

(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)

Question 2

No need to specifydelimiter sincesep is already provided. Also,pd.read_table() assumessep='\t', so you could just call that instead ofpd.read_csv().

Question 3

You haven't stated what your intended aggregation would be, but if it'sjust sum and count, then you could aggregate inchunks:

dfs = pd.DataFrame()reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriatefor chunk in reader:    temp = chunk.agg(...)  # your logic here    dfs.append(temp)df = dfs.agg(...)  # redo your logic here

Question 4

hi,thanks, I have edited my post to add the exact operation I want to have on the data

Question 5

try: 30 dfs = pd.DataFrame() 31 reader=pd.read_table(strFileName, chunksize=1024*1024*1024) # choose as appropriate 32 33 for chunk in reader: 34 temp=tb_createTopRankTable(chunk) 35 dfs.append(temp) 36 df=tb_createTopRankTable(dfs) 37 except: 38 traceback.print_exc(file=sys.stdout)

Question 6

line 33, in chunckRank for chunk in reader: CParserError: Error tokenizing data. C error: out of memory Segmentation fault (core dumped)

Question 7

so it still run out of memory at line: for chunk in reader, my computer has 16G memory, so 1024*1024*1024 is not so big

Question 8

Based on your snippet inout of memory error when reading csv file in chunk, when reading line-by-line.

I assume thatkb_2 is the error indicator,

groups={}with open("data/petaJoined.csv", "r") as large_file:    for line in large_file:        arr=line.split('\t')        #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval        k=arr[0]+','+arr[1]        if not (k in groups.keys())            groups[k]={'record_count':0, 'error_sum': 0}        groups[k]['record_count']=groups[k]['record_count']+1        groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])for k,v in groups.items:    print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))

This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.

It will encounter an out-of-memory exception, if there are too many combinations of groups.

Question 9

thanks, the combinations are not so many, I achieve the right thing, thanks for your help, this is a great solution

Question 10

What @chrisaycock suggested is the preferred method if you need to sum or count

If you need to average, it won't work becauseavg(a,b,c,d) does not equalavg(avg(a,b),avg(c,d))

I suggest using a map-reduce like approach, with streaming

create a file calledmap-col.py

import sysfor line in sys.stdin:   print (line.split('\t')[col])

And a file namedreduce-avg.py

import syss=0n=0for line in sys.stdin:   s=s+float(line)   n=n+1print (s/n)

And in order to run the whole thing:

cat strFileName|python map-col.py|python reduce-avg.py>output.txt

This method will work regardless of the size of the file, and will not run out of memory

Question 11

hi,thanks, I have edited my post to add the exact operation I want to have on the data

Question 12

can you help on a solution? Since I met difficulty in group by a 2.5 G file ,seestackoverflow.com/questions/30255986/…

chrisaycock 38.2k15 gold badges94 silver badges128 bronze badges · Accepted Answer · 2017-05-23 12:30:21Z

3

You haven't stated what your intended aggregation would be, but if it'sjust sum and count, then you could aggregate inchunks:

dfs = pd.DataFrame()reader = pd.read_table(strFileName, chunksize=16*1024)  # choose as appropriatefor chunk in reader:    temp = chunk.agg(...)  # your logic here    dfs.append(temp)df = dfs.agg(...)  # redo your logic here

Share

Improve this answer

editedMay 23, 2017 at 12:30

CommunityBot

11 silver badge

answeredMay 14, 2015 at 19:44

chrisaycock

38.2k15 gold badges94 silver badges128 bronze badges

Sign up to request clarification or add additional context in comments.

4 Comments

sunxd

sunxd Over a year ago

hi,thanks, I have edited my post to add the exact operation I want to have on the data

2015-05-15T07:17:40.63Z+00:00

sunxd

sunxd Over a year ago

try: 30 dfs = pd.DataFrame() 31 reader=pd.read_table(strFileName, chunksize=1024*1024*1024) # choose as appropriate 32 33 for chunk in reader: 34 temp=tb_createTopRankTable(chunk) 35 dfs.append(temp) 36 df=tb_createTopRankTable(dfs) 37 except: 38 traceback.print_exc(file=sys.stdout)

2015-05-15T08:10:33.227Z+00:00

sunxd

sunxd Over a year ago

line 33, in chunckRank for chunk in reader: CParserError: Error tokenizing data. C error: out of memory Segmentation fault (core dumped)

2015-05-15T08:11:35.903Z+00:00

sunxd

sunxd Over a year ago

so it still run out of memory at line: for chunk in reader, my computer has 16G memory, so 1024*1024*1024 is not so big

2015-05-15T08:17:14.307Z+00:00

Movatterモバイル変換

Collectives™ on Stack Overflow

Pandas read csv out of memory

3 Answers3

4 Comments

1 Comment

2 Comments

Your Answer

Sign up orlog in

Post as a guest

Linked

Related

Hot Network Questions

Subscribe to RSS