I try to manipulate a large CSV file using Pandas, when I wrote this
df = pd.read_csv(strFileName,sep='\t',delimiter='\t')it raises "pandas.parser.CParserError: Error tokenizing data. C error: out of memory" wc -l indicate there are 13822117 lines, I need to aggregate on this csv file data frame, is there a way to handle this other then split the csv into several files and write codes to merge the results? Any suggestions on how to do that? Thanks
The input is like this:
columns=[ka,kb_1,kb_2,timeofEvent,timeInterval]0:'3M' '2345' '2345' '2014-10-5',30001:'3M' '2958' '2152' '2015-3-22',50002:'GE' '2183' '2183' '2012-12-31',5153:'3M' '2958' '2958' '2015-3-10',3954:'GE' '2183' '2285' '2015-4-19',19255:'GE' '2598' '2598' '2015-3-17',1915And the desired output is like this:
columns=[ka,kb,errorNum,errorRate,totalNum of records]'3M','2345',0,0%,1'3M','2958',1,50%,2'GE','2183',1,50%,2'GE','2598',0,0%,1if the data set is small, the below code could be used as provided by another
df2 = df.groupby(['ka','kb_1'])['isError'].agg({ 'errorNum': 'sum', 'recordNum': 'count' })df2['errorRate'] = df2['errorNum'] / df2['recordNum']ka kb_1 recordNum errorNum errorRate3M 2345 1 0 0.0 2958 2 1 0.5GE 2183 2 1 0.5 2598 1 0 0.0(definition of error Record: when kb_1!=kb_2,the corresponding record is treated as abnormal record)
- No need to specify
delimitersincesepis already provided. Also,pd.read_table()assumessep='\t', so you could just call that instead ofpd.read_csv().chrisaycock– chrisaycock2015-05-14 19:32:34 +00:00CommentedMay 14, 2015 at 19:32
3 Answers3
You haven't stated what your intended aggregation would be, but if it'sjust sum and count, then you could aggregate inchunks:
dfs = pd.DataFrame()reader = pd.read_table(strFileName, chunksize=16*1024) # choose as appropriatefor chunk in reader: temp = chunk.agg(...) # your logic here dfs.append(temp)df = dfs.agg(...) # redo your logic here4 Comments
Based on your snippet inout of memory error when reading csv file in chunk, when reading line-by-line.
I assume thatkb_2 is the error indicator,
groups={}with open("data/petaJoined.csv", "r") as large_file: for line in large_file: arr=line.split('\t') #assuming this structure: ka,kb_1,kb_2,timeofEvent,timeInterval k=arr[0]+','+arr[1] if not (k in groups.keys()) groups[k]={'record_count':0, 'error_sum': 0} groups[k]['record_count']=groups[k]['record_count']+1 groups[k]['error_sum']=groups[k]['error_sum']+float(arr[2])for k,v in groups.items: print ('{group}: {error_rate}'.format(group=k,error_rate=v['error_sum']/v['record_count']))This code snippet stores all the groups in a dictionary, and calculates the error rate after reading the entire file.
It will encounter an out-of-memory exception, if there are too many combinations of groups.
1 Comment
What @chrisaycock suggested is the preferred method if you need to sum or count
If you need to average, it won't work becauseavg(a,b,c,d) does not equalavg(avg(a,b),avg(c,d))
I suggest using a map-reduce like approach, with streaming
create a file calledmap-col.py
import sysfor line in sys.stdin: print (line.split('\t')[col])And a file namedreduce-avg.py
import syss=0n=0for line in sys.stdin: s=s+float(line) n=n+1print (s/n)And in order to run the whole thing:
cat strFileName|python map-col.py|python reduce-avg.py>output.txtThis method will work regardless of the size of the file, and will not run out of memory
2 Comments
Explore related questions
See similar questions with these tags.
