Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork19.4k
Description
I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.
Symptoms:
- With extra trailing commas and setting index_col=False read_csv() fails with: IndexError: list index out of range
- When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6
I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)
-Gagi
If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range
In [1]:importnumpyasnpIn [2]:importpandasaspdIn [3]:importStringIOIn [4]:pd.__version__Out[4]:'0.10.1'In [5]:data='a,b,c\n4,apple,bat,,\n8,orange,cow,,'<--MatchedextracommasIn [6]:data2='a,b,c\n4,apple,bat,,\n8,orange,cow,,,'<--Miss-matchedextracommasIn [7]:printdataa,b,c4,apple,bat,,8,orange,cow,,In [8]:printdata2a,b,c4,apple,bat,,8,orange,cow,,,In [9]:df=pd.read_csv(StringIO.StringIO(data))In [10]:dfOut[10]:abc4applebatNaNNaN8orangecowNaNNaNIn [11]:df.indexOut[11]:MultiIndex[(4,apple), (8,orange)]In [12]:df2=pd.read_csv(StringIO.StringIO(data),index_col=False)---------------------------------------------------------------------------IndexErrorTraceback (mostrecentcalllast)<ipython-input-12-4d7ece45eef3>in<module>()---->1df2=pd.read_csv(StringIO.StringIO(data),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629# self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()IndexError:listindexoutofrangeIn [13]:df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)---------------------------------------------------------------------------CParserErrorTraceback (mostrecentcalllast)<ipython-input-13-441bfd4aff6e>in<module>()---->1df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629# self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.raise_parser_error (pandas\src\parser.c:17023)()CParserError:Errortokenizingdata.Cerror:Expected5fieldsinline3,saw6-showquotedtext-