Reading and Writing CSV files#

Arrow supports reading and writing columnar data from/to CSV files.The features currently offered are the following:

  • multi-threaded or single-threaded reading

  • automatic decompression of input files (based on the filename extension,such asmy_data.csv.gz)

  • fetching column names from the first row in the CSV file

  • column-wise type inference and conversion to one ofnull,int64,float64,date32,time32[s],timestamp[s],timestamp[ns],duration (from numeric strings),string orbinary data

  • opportunistic dictionary encoding ofstring andbinary columns(disabled by default)

  • detecting various spellings of null values such asNaN or#N/A

  • writing CSV files with options to configure the exact output format

Usage#

CSV reading and writing functionality is available through thepyarrow.csv module. In many cases, you will simply call theread_csv() function with the file path you want to read from:

>>>frompyarrowimportcsv>>>fn='tips.csv.gz'>>>table=csv.read_csv(fn)>>>tablepyarrow.Tabletotal_bill: doubletip: doublesex: stringsmoker: stringday: stringtime: stringsize: int64>>>len(table)244>>>df=table.to_pandas()>>>df.head()   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

To write CSV files, just callwrite_csv() with apyarrow.RecordBatch orpyarrow.Table and a path orfile-like object:

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>csv.write_csv(table,"tips.csv")>>>withpa.CompressedOutputStream("tips.csv.gz","gzip")asout:...csv.write_csv(table,out)

Note

The writer does not yet support all Arrow types.

Customized parsing#

To alter the default parsing settings in case of reading CSV files with anunusual structure, you should create aParseOptions instanceand pass it toread_csv():

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',parse_options=csv.ParseOptions(delimiter=";",invalid_row_handler=skip_handler))

Available parsing options are:

delimiter

The character delimiting individual cells in the CSV data.

quote_char

The character used optionally for quoting CSV values (False if quoting is not allowed).

double_quote

Whether two quotes in a quoted CSV value denote a single quote in the data.

escape_char

The character used optionally for escaping special characters (False if escaping is not allowed).

newlines_in_values

Whether newline characters are allowed in CSV values.

ignore_empty_lines

Whether empty lines are ignored in CSV input.

invalid_row_handler

Optional handler for invalid rows.

See also

For more examples seeParseOptions.

Customized conversion#

To alter how CSV data is converted to Arrow types and data, you should createaConvertOptions instance and pass it toread_csv():

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',convert_options=csv.ConvertOptions(column_types={'total_bill':pa.decimal128(precision=10,scale=2),'tip':pa.decimal128(precision=10,scale=2),}))

Note

To assign a column asduration, the CSV values must be numeric stringsthat match the expected unit (e.g.60000 for 60 seconds whenusingduration[ms]).

Available convert options are:

check_utf8

Whether to check UTF8 validity of string columns.

column_types

Explicitly map column names to column types.

null_values

A sequence of strings that denote nulls in the data.

true_values

A sequence of strings that denote true booleans in the data.

false_values

A sequence of strings that denote false booleans in the data.

decimal_point

The character used as decimal point in floating-point and decimal data.

timestamp_parsers

A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given).

strings_can_be_null

Whether string / binary columns can have null values.

quoted_strings_can_be_null

Whether quoted values can be null.

auto_dict_encode

Whether to try to automatically dict-encode string / binary data.

auto_dict_max_cardinality

The maximum dictionary cardinality forauto_dict_encode.

include_columns

The names of columns to include in the Table.

include_missing_columns

If false, columns ininclude_columns but not in the CSV file will error out.

See also

For more examples seeConvertOptions.

Incremental reading#

For memory-constrained environments, it is also possible to read a CSV fileone batch at a time, usingopen_csv().

There are a few caveats:

  1. For now, the incremental reader is always single-threaded (regardless ofReadOptions.use_threads)

  2. Type inference is done on the first block and types are frozen afterwards;to make sure the right data types are inferred, either setReadOptions.block_size to a large enough value, or useConvertOptions.column_types to set the desired data types explicitly.

Character encoding#

By default, CSV files are expected to be encoded in UTF8. Non-UTF8 datais accepted forbinary columns. The encoding can be changed usingtheReadOptions class:

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',read_options=csv.ReadOptions(column_names=["animals","n_legs","entry"],skip_rows=1))

Available read options are:

use_threads

Whether to use multiple threads to accelerate reading.

block_size

How much bytes to process at a time from the input stream.

skip_rows

The number of rows to skip before the column names (if any) and the CSV data.

skip_rows_after_names

The number of rows to skip after the column names.

column_names

The column names of the target table.

autogenerate_column_names

Whether to autogenerate column names ifcolumn_names is empty.

encoding

encoding: object

See also

For more examples seeReadOptions.

Customized writing#

To alter the default write settings in case of writing CSV files withdifferent conventions, you can create aWriteOptions instance andpass it towrite_csv():

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>># Omit the header row (include_header=True is the default)>>>options=csv.WriteOptions(include_header=False)>>>csv.write_csv(table,"data.csv",options)

Incremental writing#

To write CSV files one batch at a time, create aCSVWriter. Thisrequires the output (a path or file-like object), the schema of the data tobe written, and optionally write options as described above:

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>withcsv.CSVWriter("data.csv",table.schema)aswriter:>>>writer.write_table(table)

Performance#

Due to the structure of CSV files, one cannot expect the same levels ofperformance as when reading dedicated binary formats likeParquet. Nevertheless, Arrow strives to reduce theoverhead of reading CSV files. A reasonable expectation is at least100 MB/s per core on a performant desktop or laptop computer (measuredin source CSV bytes, not target Arrow data bytes).

Performance options can be controlled through theReadOptions class.Multi-threaded reading is the default for highest performance, distributingthe workload efficiently over all available cores.

Note

The number of concurrent threads is automatically inferred by Arrow.You can inspect and change it using thecpu_count()andset_cpu_count() functions, respectively.