read_csv() & extra trailing comma(s) cause parsing issues. #2886

New issue

Closed

read_csv() & extra trailing comma(s) cause parsing issues.#2886

Labels

DocsIO DataIO issues that don't fit into a more specific label

Milestone

0.11

Description

dragoljub

opened

on Feb 17, 2013

I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

Symptoms:

With extra trailing commas and setting index_col=False read_csv() fails with: IndexError: list index out of range
When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)

-Gagi

If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]:importnumpyasnpIn [2]:importpandasaspdIn [3]:importStringIOIn [4]:pd.__version__Out[4]:'0.10.1'In [5]:data='a,b,c\n4,apple,bat,,\n8,orange,cow,,'<--MatchedextracommasIn [6]:data2='a,b,c\n4,apple,bat,,\n8,orange,cow,,,'<--Miss-matchedextracommasIn [7]:printdataa,b,c4,apple,bat,,8,orange,cow,,In [8]:printdata2a,b,c4,apple,bat,,8,orange,cow,,,In [9]:df=pd.read_csv(StringIO.StringIO(data))In [10]:dfOut[10]:abc4applebatNaNNaN8orangecowNaNNaNIn [11]:df.indexOut[11]:MultiIndex[(4,apple), (8,orange)]In [12]:df2=pd.read_csv(StringIO.StringIO(data),index_col=False)---------------------------------------------------------------------------IndexErrorTraceback (mostrecentcalllast)<ipython-input-12-4d7ece45eef3>in<module>()---->1df2=pd.read_csv(StringIO.StringIO(data),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629#     self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()IndexError:listindexoutofrangeIn [13]:df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)---------------------------------------------------------------------------CParserErrorTraceback (mostrecentcalllast)<ipython-input-13-441bfd4aff6e>in<module>()---->1df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629#     self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.raise_parser_error (pandas\src\parser.c:17023)()CParserError:Errortokenizingdata.Cerror:Expected5fieldsinline3,saw6-showquotedtext-

Metadata

Assignees

No one assigned

Labels

DocsIO DataIO issues that don't fit into a more specific label

Type

No type

Projects

No projects

Milestone

0.11

Relationships

None yet

Development

No branches or pull requests

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

read_csv() & extra trailing comma(s) cause parsing issues. #2886

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions