Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

read_csv() & extra trailing comma(s) cause parsing issues. #2886

Closed
Labels
DocsIO DataIO issues that don't fit into a more specific label
Milestone
@dragoljub

Description

@dragoljub

I have run into a few opportunities to further improve the wonderful read_csv() function. I'm using the latest x64 0.10.1 build from 10-Feb-2013 16:52.

Symptoms:

  1. With extra trailing commas and setting index_col=False read_csv() fails with: IndexError: list index out of range
  2. When one or more CSV rows has additional trailing commas (compared to previous rows) read_csv() fails with: CParserError: Error tokenizing data. C error: Expected 5 fields in line 3, saw 6

I believe the opportunity to fix these exceptions would require simply ignoring any extra trailing commas. This is how many CSV readers work such as opening CSVs in excel. In my case I regularly work with 100K line CSVs that occasionally have extra trailing columns causing read_csv() to fail. Perhaps its possible to have an option to ignore trailing commas, or even better an option to ignore/skip any malformed rows without raising a terminal exception. :)

-Gagi

If a CSV has 'n' matched extra trailing columns and you do not specify any index_col then the parser will correctly assume that the first 'n' columns are the index , if you set index_col=False it fails with: IndexError: list index out of range

In [1]:importnumpyasnpIn [2]:importpandasaspdIn [3]:importStringIOIn [4]:pd.__version__Out[4]:'0.10.1'In [5]:data='a,b,c\n4,apple,bat,,\n8,orange,cow,,'<--MatchedextracommasIn [6]:data2='a,b,c\n4,apple,bat,,\n8,orange,cow,,,'<--Miss-matchedextracommasIn [7]:printdataa,b,c4,apple,bat,,8,orange,cow,,In [8]:printdata2a,b,c4,apple,bat,,8,orange,cow,,,In [9]:df=pd.read_csv(StringIO.StringIO(data))In [10]:dfOut[10]:abc4applebatNaNNaN8orangecowNaNNaNIn [11]:df.indexOut[11]:MultiIndex[(4,apple), (8,orange)]In [12]:df2=pd.read_csv(StringIO.StringIO(data),index_col=False)---------------------------------------------------------------------------IndexErrorTraceback (mostrecentcalllast)<ipython-input-12-4d7ece45eef3>in<module>()---->1df2=pd.read_csv(StringIO.StringIO(data),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629#     self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6946)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._convert_column_data (pandas\src\parser.c:7670)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._get_column_name (pandas\src\parser.c:10545)()IndexError:listindexoutofrangeIn [13]:df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)---------------------------------------------------------------------------CParserErrorTraceback (mostrecentcalllast)<ipython-input-13-441bfd4aff6e>in<module>()---->1df3=pd.read_csv(StringIO.StringIO(data2),index_col=False)C:\Python27\lib\site-packages\pandas\io\parsers.pycinparser_f(filepath_or_buffer,sep,dialect,compression,doublequote,escapechar,quotechar,quoting,skipinitialspace,lineterminator,header,index_col,names,prefix,skiprows,skipfooter,skip_footer,na_values,true_values,false_values,delimiter,converters,dtype,usecols,engine,delim_whitespace,as_recarray,na_filter,compact_ints,use_unsigned,low_memory,buffer_lines,warn_bad_lines,error_bad_lines,keep_default_na,thousands,comment,decimal,parse_dates,keep_date_col,dayfirst,date_parser,memory_map,nrows,iterator,chunksize,verbose,encoding,squeeze)397buffer_lines=buffer_lines)398-->399return_read(filepath_or_buffer,kwds)400401parser_f.__name__=nameC:\Python27\lib\site-packages\pandas\io\parsers.pycin_read(filepath_or_buffer,kwds)213returnparser214-->215returnparser.read()216217_parser_defaults= {C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)629#     self._engine.set_error_bad_lines(False)630-->631ret=self._engine.read(nrows)632633ifself.options.get('as_recarray'):C:\Python27\lib\site-packages\pandas\io\parsers.pycinread(self,nrows)952953try:-->954data=self._reader.read(nrows)955exceptStopIteration:956ifnrowsisNone:C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader.read (pandas\src\parser.c:5915)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_low_memory (pandas\src\parser.c:6132)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._read_rows (pandas\src\parser.c:6734)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.TextReader._tokenize_rows (pandas\src\parser.c:6619)()C:\Python27\lib\site-packages\pandas\_parser.pydinpandas._parser.raise_parser_error (pandas\src\parser.c:17023)()CParserError:Errortokenizingdata.Cerror:Expected5fieldsinline3,saw6-showquotedtext-

Metadata

Metadata

Assignees

No one assigned

    Labels

    DocsIO DataIO issues that don't fit into a more specific label

    Type

    No type

    Projects

    No projects

    Milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions


      [8]ページ先頭

      ©2009-2025 Movatter.jp