Reading and Writing CSV files#
Arrow supports reading and writing columnar data from/to CSV files.The features currently offered are the following:
multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension,such as
my_data.csv.gz)fetching column names from the first row in the CSV file
column-wise type inference and conversion to one of
null,int64,float64,date32,time32[s],timestamp[s],timestamp[ns],duration(from numeric strings),stringorbinarydataopportunistic dictionary encoding of
stringandbinarycolumns(disabled by default)detecting various spellings of null values such as
NaNor#N/Awriting CSV files with options to configure the exact output format
Usage#
CSV reading and writing functionality is available through thepyarrow.csv module. In many cases, you will simply call theread_csv() function with the file path you want to read from:
>>>frompyarrowimportcsv>>>fn='tips.csv.gz'>>>table=csv.read_csv(fn)>>>tablepyarrow.Tabletotal_bill: doubletip: doublesex: stringsmoker: stringday: stringtime: stringsize: int64>>>len(table)244>>>df=table.to_pandas()>>>df.head() total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 33 23.68 3.31 Male No Sun Dinner 24 24.59 3.61 Female No Sun Dinner 4
To write CSV files, just callwrite_csv() with apyarrow.RecordBatch orpyarrow.Table and a path orfile-like object:
>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>csv.write_csv(table,"tips.csv")>>>withpa.CompressedOutputStream("tips.csv.gz","gzip")asout:...csv.write_csv(table,out)
Note
The writer does not yet support all Arrow types.
Customized parsing#
To alter the default parsing settings in case of reading CSV files with anunusual structure, you should create aParseOptions instanceand pass it toread_csv():
importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',parse_options=csv.ParseOptions(delimiter=";",invalid_row_handler=skip_handler))
Available parsing options are:
The character delimiting individual cells in the CSV data. | |
The character used optionally for quoting CSV values (False if quoting is not allowed). | |
Whether two quotes in a quoted CSV value denote a single quote in the data. | |
The character used optionally for escaping special characters (False if escaping is not allowed). | |
Whether newline characters are allowed in CSV values. | |
Whether empty lines are ignored in CSV input. | |
Optional handler for invalid rows. |
See also
For more examples seeParseOptions.
Customized conversion#
To alter how CSV data is converted to Arrow types and data, you should createaConvertOptions instance and pass it toread_csv():
importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',convert_options=csv.ConvertOptions(column_types={'total_bill':pa.decimal128(precision=10,scale=2),'tip':pa.decimal128(precision=10,scale=2),}))
Note
To assign a column asduration, the CSV values must be numeric stringsthat match the expected unit (e.g.60000 for 60 seconds whenusingduration[ms]).
Available convert options are:
Whether to check UTF8 validity of string columns. | |
Explicitly map column names to column types. | |
A sequence of strings that denote nulls in the data. | |
A sequence of strings that denote true booleans in the data. | |
A sequence of strings that denote false booleans in the data. | |
The character used as decimal point in floating-point and decimal data. | |
A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given). | |
Whether string / binary columns can have null values. | |
Whether quoted values can be null. | |
Whether to try to automatically dict-encode string / binary data. | |
The maximum dictionary cardinality forauto_dict_encode. | |
The names of columns to include in the Table. | |
If false, columns ininclude_columns but not in the CSV file will error out. |
See also
For more examples seeConvertOptions.
Incremental reading#
For memory-constrained environments, it is also possible to read a CSV fileone batch at a time, usingopen_csv().
There are a few caveats:
For now, the incremental reader is always single-threaded (regardless of
ReadOptions.use_threads)Type inference is done on the first block and types are frozen afterwards;to make sure the right data types are inferred, either set
ReadOptions.block_sizeto a large enough value, or useConvertOptions.column_typesto set the desired data types explicitly.
Character encoding#
By default, CSV files are expected to be encoded in UTF8. Non-UTF8 datais accepted forbinary columns. The encoding can be changed usingtheReadOptions class:
importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',read_options=csv.ReadOptions(column_names=["animals","n_legs","entry"],skip_rows=1))
Available read options are:
Whether to use multiple threads to accelerate reading. | |
How much bytes to process at a time from the input stream. | |
The number of rows to skip before the column names (if any) and the CSV data. | |
The number of rows to skip after the column names. | |
The column names of the target table. | |
Whether to autogenerate column names ifcolumn_names is empty. | |
encoding: object |
See also
For more examples seeReadOptions.
Customized writing#
To alter the default write settings in case of writing CSV files withdifferent conventions, you can create aWriteOptions instance andpass it towrite_csv():
>>>importpyarrowaspa>>>importpyarrow.csvascsv>>># Omit the header row (include_header=True is the default)>>>options=csv.WriteOptions(include_header=False)>>>csv.write_csv(table,"data.csv",options)
Incremental writing#
To write CSV files one batch at a time, create aCSVWriter. Thisrequires the output (a path or file-like object), the schema of the data tobe written, and optionally write options as described above:
>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>withcsv.CSVWriter("data.csv",table.schema)aswriter:>>>writer.write_table(table)
Performance#
Due to the structure of CSV files, one cannot expect the same levels ofperformance as when reading dedicated binary formats likeParquet. Nevertheless, Arrow strives to reduce theoverhead of reading CSV files. A reasonable expectation is at least100 MB/s per core on a performant desktop or laptop computer (measuredin source CSV bytes, not target Arrow data bytes).
Performance options can be controlled through theReadOptions class.Multi-threaded reading is the default for highest performance, distributingthe workload efficiently over all available cores.
Note
The number of concurrent threads is automatically inferred by Arrow.You can inspect and change it using thecpu_count()andset_cpu_count() functions, respectively.

