The Comma Separated Values (CSV) file format is the most common importand export format for spreadsheets and databases. Although many CSVfiles are simple to parse, the format is not formally defined by astable specification and is subtle enough that parsing lines of a CSVfile with something likeline.split(",") is eventually bound tofail. This PEP defines an API for reading and writing CSV files. Itis accompanied by a corresponding module which implements the API.
This PEP is about doing one thing well: parsing tabular data which mayuse a variety of field separators, quoting characters, quote escapemechanisms and line endings. The authors intend the proposed moduleto solve this one parsing problem efficiently. The authors do notintend to address any of these related topics:
Often, CSV files are formatted simply enough that you can get byreading them line-by-line and splitting on the commas which delimitthe fields. This is especially true if all the data being read isnumeric. This approach may work for a while, then come back to biteyou in the butt when somebody puts something unexpected in the datalike a comma. As you dig into the problem you may eventually come tothe conclusion that you can solve the problem using regularexpressions. This will work for a while, then break mysteriously oneday. The problem grows, so you dig deeper and eventually realize thatyou need a purpose-built parser for the format.
CSV formats are not well-defined and different implementations have anumber of subtle corner cases. It has been suggested that the “V” inthe acronym stands for “Vague” instead of “Values”. Differentdelimiters and quoting characters are just the start. Some programsgenerate whitespace after each delimiter which is not part of thefollowing field. Others quote embedded quoting characters by doublingthem, others by prefixing them with an escape character. The list ofweird ways to do things can seem endless.
All this variability means it is difficult for programmers to reliablyparse CSV files from many sources or generate CSV files designed to befed to specific external programs without a thorough understanding ofthose sources and programs. This PEP and the software which accompanyit attempt to make the process less fragile.
This problem has been tackled before. At least three modulescurrently available in the Python community enable programmers to readand write CSV files:
Each has a different API, making it somewhat difficult for programmersto switch between them. More of a problem may be that they interpretsome of the CSV corner cases differently, so even after surmountingthe differences between the different module APIs, the programmer hasto also deal with semantic differences between the packages.
This PEP supports three basic APIs, one to read and parse CSV files,one to write them, and one to identify different CSV dialects to thereaders and writers.
CSV readers are created with the reader factory function:
obj=reader(iterable[,dialect='excel'][optionalkeywordargs])
A reader object is an iterator which takes an iterable objectreturning lines as the sole required parameter. If it supports abinary mode (file objects do), the iterable argument to the readerfunction must have been opened in binary mode. This gives the readerobject full control over the interpretation of the file’s contents.The optional dialect parameter is discussed below. The readerfunction also accepts several optional keyword arguments which definespecific format settings for the parser (see the section “FormattingParameters”). Readers are typically used as follows:
csvreader=csv.reader(file("some.csv"))forrowincsvreader:process(row)
Each row returned by a reader object is a list of strings or Unicodeobjects.
When both a dialect parameter and individual formatting parameters arepassed to the constructor, first the dialect is queried for formattingparameters, then individual formatting parameters are examined.
Creating writers is similar:
obj=writer(fileobj[,dialect='excel'],[optionalkeywordargs])
A writer object is a wrapper around a file-like object opened forwriting in binary mode (if such a distinction is made). It acceptsthe same optional keyword parameters as the reader constructor.
Writers are typically used as follows:
csvwriter=csv.writer(file("some.csv","w"))forrowinsomeiterable:csvwriter.writerow(row)
To generate a set of field names as the first row of the CSV file, theprogrammer must explicitly write it, e.g.:
csvwriter=csv.writer(file("some.csv","w"),fieldnames=names)csvwriter.write(names)forrowinsomeiterable:csvwriter.write(row)
or arrange for it to be the first row in the iterable being written.
Because CSV is a somewhat ill-defined format, there are plenty of waysone CSV file can differ from another, yet contain exactly the samedata. Many tools which can import or export tabular data allow theuser to indicate the field delimiter, quote character, lineterminator, and other characteristics of the file. These can befairly easily determined, but are still mildly annoying to figure out,and make for fairly long function calls when specified individually.
To try and minimize the difficulty of figuring out and specifying abunch of formatting parameters, reader and writer objects support adialect argument which is just a convenient handle on a group of theselower level parameters. When a dialect is given as a string itidentifies one of the dialects known to the module via itsregistration functions, otherwise it must be an instance of theDialect class as described below.
Dialects will generally be named after applications or organizationswhich define specific sets of format constraints. Two dialects aredefined in the module as of this writing, “excel”, which describes thedefault format constraints for CSV file export by Excel 97 and Excel2000, and “excel-tab”, which is the same as “excel” but specifies anASCII TAB character as the field delimiter.
Dialects are implemented as attribute only classes to enable users toconstruct variant dialects by subclassing. The “excel” dialect is asubclass of Dialect and is defined as follows:
classDialect:# placeholdersdelimiter=Nonequotechar=Noneescapechar=Nonedoublequote=Noneskipinitialspace=Nonelineterminator=Nonequoting=Noneclassexcel(Dialect):delimiter=','quotechar='"'doublequote=Trueskipinitialspace=Falselineterminator='\r\n'quoting=QUOTE_MINIMAL
The “excel-tab” dialect is defined as:
classexceltsv(excel):delimiter='\t'
(For a description of the individual formatting parameters see thesection “Formatting Parameters”.)
To enable string references to specific dialects, the module definesseveral functions:
dialect=get_dialect(name)names=list_dialects()register_dialect(name,dialect)unregister_dialect(name)
get_dialect() returns the dialect instance associated with thegiven name.list_dialects() returns a list of all registereddialect names.register_dialects() associates a string name witha dialect class.unregister_dialect() deletes a name/dialectassociation.
In addition to the dialect argument, both the reader and writerconstructors take several specific formatting parameters, specified askeyword parameters. The formatting parameters understood are:
quotechar specifies a one-character string to use as the quotingcharacter. It defaults to ‘”’. Setting this to None has the sameeffect as setting quoting to csv.QUOTE_NONE.delimiter specifies a one-character string to use as the fieldseparator. It defaults to ‘,’.escapechar specifies a one-character string used to escape thedelimiter when quotechar is set to None.skipinitialspace specifies how to interpret whitespace whichimmediately follows a delimiter. It defaults to False, which meansthat whitespace immediately following a delimiter is part of thefollowing field.lineterminator specifies the character sequence which shouldterminate rows.quoting controls when quotes should be generated by the writer.It can take on any of the following module constants:doublequote controls the handling of quotes inside fields. WhenTrue two consecutive quotes are interpreted as one during read, andwhen writing, each quote is written as two quotes.When processing a dialect setting and one or more of the otheroptional parameters, the dialect parameter is processed before theindividual formatting parameters. This makes it easy to choose adialect, then override one or more of the settings without defining anew dialect class. For example, if a CSV file was generated by Excel2000 using single quotes as the quote character and a colon as thedelimiter, you could create a reader like:
csvreader=csv.reader(file("some.csv"),dialect="excel",quotechar="'",delimiter=':')
Other details of how Excel generates CSV files would be handledautomatically because of the reference to the “excel” dialect.
Reader objects are iterables whose next() method returns a sequence ofstrings, one string per field in the row.
Writer objects have two methods, writerow() and writerows(). Theformer accepts an iterable (typically a list) of fields which are tobe written to the output. The latter accepts a list of iterables andcalls writerow() for each.
There is a sample implementation available.[1] The goal is for itto efficiently implement the API described in the PEP. It is heavilybased on the Object Craft csv module.[2]
The sample implementation[1] includes a set of test cases.
csvreader=csv.reader(codecs.open("some.csv","r","cp1252"))csvwriter=csv.writer(codecs.open("some.csv","w","utf-8"))
In the first example, text would be assumed to be encoded as cp1252.Should the system be aggressive in converting to Unicode or shouldUnicode strings only be returned if necessary?
In the second example, the file will take care of automaticallyencoding Unicode strings as utf-8 before writing to disk.
Note: As of this writing, the csv module doesn’t handle Unicodedata.
escapechar parameter which is not None and thequoting parameter is set to QUOTE_NONE, delimiters appearingwithin fields will be prefixed by the escape character when writingand are expected to be prefixed by the escape character whenreading.lineterminator sequence as'\r\n'. Theresulting file will be written correctly.There are many references to other CSV-related projects on the Web. Afew are included here.
This document has been placed in the public domain.
Source:https://github.com/python/peps/blob/main/peps/pep-0305.rst
Last modified:2025-02-01 08:59:27 GMT