Python Enhancement Proposals

Python »
PEP Index »
PEP 305

PEP 305 – CSV File API

Author:: Kevin Altis <altis at semi-retired.com>,Dave Cole <djc at object-craft.com.au>,Andrew McNamara <andrewm at object-craft.com.au>,Skip Montanaro <skip at pobox.com>,Cliff Wells <LogiplexSoftware at earthlink.net>
Discussions-To:

Abstract

The Comma Separated Values (CSV) file format is the most common importand export format for spreadsheets and databases. Although many CSVfiles are simple to parse, the format is not formally defined by astable specification and is subtle enough that parsing lines of a CSVfile with something likeline.split(",") is eventually bound tofail. This PEP defines an API for reading and writing CSV files. Itis accompanied by a corresponding module which implements the API.

To Do (Notes for the Interested and Ambitious)

Better motivation for the choice of passing a file object to theconstructors. Seehttps://mail.python.org/pipermail/csv/2003-January/000179.html
Unicode. ugh.

Application Domain

This PEP is about doing one thing well: parsing tabular data which mayuse a variety of field separators, quoting characters, quote escapemechanisms and line endings. The authors intend the proposed moduleto solve this one parsing problem efficiently. The authors do notintend to address any of these related topics:

data interpretation (is a field containing the string “10” supposedto be a string, a float or an int? is it a number in base 10, base16 or base 2? is a number in quotes a number or a string?)
locale-specific data representation (should the number 1.23 bewritten as “1.23” or “1,23” or “1 23”?) – this may eventually beaddressed.
fixed width tabular data - can already be parsed reliably.

Rationale

Often, CSV files are formatted simply enough that you can get byreading them line-by-line and splitting on the commas which delimitthe fields. This is especially true if all the data being read isnumeric. This approach may work for a while, then come back to biteyou in the butt when somebody puts something unexpected in the datalike a comma. As you dig into the problem you may eventually come tothe conclusion that you can solve the problem using regularexpressions. This will work for a while, then break mysteriously oneday. The problem grows, so you dig deeper and eventually realize thatyou need a purpose-built parser for the format.

CSV formats are not well-defined and different implementations have anumber of subtle corner cases. It has been suggested that the “V” inthe acronym stands for “Vague” instead of “Values”. Differentdelimiters and quoting characters are just the start. Some programsgenerate whitespace after each delimiter which is not part of thefollowing field. Others quote embedded quoting characters by doublingthem, others by prefixing them with an escape character. The list ofweird ways to do things can seem endless.

All this variability means it is difficult for programmers to reliablyparse CSV files from many sources or generate CSV files designed to befed to specific external programs without a thorough understanding ofthose sources and programs. This PEP and the software which accompanyit attempt to make the process less fragile.

Existing Modules

This problem has been tackled before. At least three modulescurrently available in the Python community enable programmers to readand write CSV files:

Object Craft’s CSV module[2]
Cliff Wells’ Python-DSV module[3]
Laurence Tratt’s ASV module[4]

Each has a different API, making it somewhat difficult for programmersto switch between them. More of a problem may be that they interpretsome of the CSV corner cases differently, so even after surmountingthe differences between the different module APIs, the programmer hasto also deal with semantic differences between the packages.

Module Interface

This PEP supports three basic APIs, one to read and parse CSV files,one to write them, and one to identify different CSV dialects to thereaders and writers.

Reading CSV Files

CSV readers are created with the reader factory function:

obj=reader(iterable[,dialect='excel'][optionalkeywordargs])

A reader object is an iterator which takes an iterable objectreturning lines as the sole required parameter. If it supports abinary mode (file objects do), the iterable argument to the readerfunction must have been opened in binary mode. This gives the readerobject full control over the interpretation of the file’s contents.The optional dialect parameter is discussed below. The readerfunction also accepts several optional keyword arguments which definespecific format settings for the parser (see the section “FormattingParameters”). Readers are typically used as follows:

csvreader=csv.reader(file("some.csv"))forrowincsvreader:process(row)

Each row returned by a reader object is a list of strings or Unicodeobjects.

When both a dialect parameter and individual formatting parameters arepassed to the constructor, first the dialect is queried for formattingparameters, then individual formatting parameters are examined.

Writing CSV Files

Creating writers is similar:

obj=writer(fileobj[,dialect='excel'],[optionalkeywordargs])

A writer object is a wrapper around a file-like object opened forwriting in binary mode (if such a distinction is made). It acceptsthe same optional keyword parameters as the reader constructor.

Writers are typically used as follows:

csvwriter=csv.writer(file("some.csv","w"))forrowinsomeiterable:csvwriter.writerow(row)

To generate a set of field names as the first row of the CSV file, theprogrammer must explicitly write it, e.g.:

csvwriter=csv.writer(file("some.csv","w"),fieldnames=names)csvwriter.write(names)forrowinsomeiterable:csvwriter.write(row)

or arrange for it to be the first row in the iterable being written.

Managing Different Dialects

Because CSV is a somewhat ill-defined format, there are plenty of waysone CSV file can differ from another, yet contain exactly the samedata. Many tools which can import or export tabular data allow theuser to indicate the field delimiter, quote character, lineterminator, and other characteristics of the file. These can befairly easily determined, but are still mildly annoying to figure out,and make for fairly long function calls when specified individually.

To try and minimize the difficulty of figuring out and specifying abunch of formatting parameters, reader and writer objects support adialect argument which is just a convenient handle on a group of theselower level parameters. When a dialect is given as a string itidentifies one of the dialects known to the module via itsregistration functions, otherwise it must be an instance of theDialect class as described below.

Dialects will generally be named after applications or organizationswhich define specific sets of format constraints. Two dialects aredefined in the module as of this writing, “excel”, which describes thedefault format constraints for CSV file export by Excel 97 and Excel2000, and “excel-tab”, which is the same as “excel” but specifies anASCII TAB character as the field delimiter.

Dialects are implemented as attribute only classes to enable users toconstruct variant dialects by subclassing. The “excel” dialect is asubclass of Dialect and is defined as follows:

classDialect:# placeholdersdelimiter=Nonequotechar=Noneescapechar=Nonedoublequote=Noneskipinitialspace=Nonelineterminator=Nonequoting=Noneclassexcel(Dialect):delimiter=','quotechar='"'doublequote=Trueskipinitialspace=Falselineterminator='\r\n'quoting=QUOTE_MINIMAL

The “excel-tab” dialect is defined as:

classexceltsv(excel):delimiter='\t'

(For a description of the individual formatting parameters see thesection “Formatting Parameters”.)

To enable string references to specific dialects, the module definesseveral functions:

dialect=get_dialect(name)names=list_dialects()register_dialect(name,dialect)unregister_dialect(name)

get_dialect() returns the dialect instance associated with thegiven name.list_dialects() returns a list of all registereddialect names.register_dialects() associates a string name witha dialect class.unregister_dialect() deletes a name/dialectassociation.

Formatting Parameters

In addition to the dialect argument, both the reader and writerconstructors take several specific formatting parameters, specified askeyword parameters. The formatting parameters understood are:

quotechar specifies a one-character string to use as the quotingcharacter. It defaults to ‘”’. Setting this to None has the sameeffect as setting quoting to csv.QUOTE_NONE.
delimiter specifies a one-character string to use as the fieldseparator. It defaults to ‘,’.
escapechar specifies a one-character string used to escape thedelimiter when quotechar is set to None.
skipinitialspace specifies how to interpret whitespace whichimmediately follows a delimiter. It defaults to False, which meansthat whitespace immediately following a delimiter is part of thefollowing field.
lineterminator specifies the character sequence which shouldterminate rows.
quoting controls when quotes should be generated by the writer.It can take on any of the following module constants:
- csv.QUOTE_MINIMAL means only when required, for example, when afield contains either the quotechar or the delimiter
- csv.QUOTE_ALL means that quotes are always placed around fields.
- csv.QUOTE_NONNUMERIC means that quotes are always placed aroundnonnumeric fields.
- csv.QUOTE_NONE means that quotes are never placed around fields.
doublequote controls the handling of quotes inside fields. WhenTrue two consecutive quotes are interpreted as one during read, andwhen writing, each quote is written as two quotes.

When processing a dialect setting and one or more of the otheroptional parameters, the dialect parameter is processed before theindividual formatting parameters. This makes it easy to choose adialect, then override one or more of the settings without defining anew dialect class. For example, if a CSV file was generated by Excel2000 using single quotes as the quote character and a colon as thedelimiter, you could create a reader like:

csvreader=csv.reader(file("some.csv"),dialect="excel",quotechar="'",delimiter=':')

Other details of how Excel generates CSV files would be handledautomatically because of the reference to the “excel” dialect.

Should a parameter control how consecutive delimiters areinterpreted? Our thought is “no”. Consecutive delimiters shouldalways denote an empty field.
What about Unicode? Is it sufficient to pass a file object gottenfrom codecs.open()? For example:
```
csvreader=csv.reader(codecs.open("some.csv","r","cp1252"))csvwriter=csv.writer(codecs.open("some.csv","w","utf-8"))
```
In the first example, text would be assumed to be encoded as cp1252.Should the system be aggressive in converting to Unicode or shouldUnicode strings only be returned if necessary?
In the second example, the file will take care of automaticallyencoding Unicode strings as utf-8 before writing to disk.
Note: As of this writing, the csv module doesn’t handle Unicodedata.
What about alternate escape conventions? If the dialect in useincludes anescapechar parameter which is not None and thequoting parameter is set to QUOTE_NONE, delimiters appearingwithin fields will be prefixed by the escape character when writingand are expected to be prefixed by the escape character whenreading.
Should there be a “fully quoted” mode for writing? What about“fully quoted except for numeric values”? Both are implemented(QUOTE_ALL and QUOTE_NONNUMERIC, respectively).
What about end-of-line? If I generate a CSV file on a Unix system,will Excel properly recognize the LF-only line terminators? Filesmust be opened for reading or writing as appropriate using binarymode. Specify thelineterminator sequence as'\r\n'. Theresulting file will be written correctly.
What about an option to generate dicts from the reader and acceptdicts by the writer? See the DictReader and DictWriter classes incsv.py.
Are quote character and delimiters limited to single characters?For the time being, yes.
How should rows of different lengths be handled? Interpretation ofthe data is the application’s job. There is no such thing as a“short row” or a “long row” at this level.