Reading and Writing CSV files #

Arrow supports reading and writing columnar data from/to CSV files.The features currently offered are the following:

multi-threaded or single-threaded reading
automatic decompression of input files (based on the filename extension,such asmy_data.csv.gz)
fetching column names from the first row in the CSV file
column-wise type inference and conversion to one ofnull,int64,float64,date32,time32[s],timestamp[s],timestamp[ns],duration (from numeric strings),string orbinary data
opportunistic dictionary encoding ofstring andbinary columns(disabled by default)
detecting various spellings of null values such asNaN or#N/A
writing CSV files with options to configure the exact output format

Usage#

CSV reading and writing functionality is available through thepyarrow.csv module. In many cases, you will simply call theread_csv() function with the file path you want to read from:

>>>frompyarrowimportcsv>>>fn='tips.csv.gz'>>>table=csv.read_csv(fn)>>>tablepyarrow.Tabletotal_bill: doubletip: doublesex: stringsmoker: stringday: stringtime: stringsize: int64>>>len(table)244>>>df=table.to_pandas()>>>df.head()   total_bill   tip     sex smoker  day    time  size0       16.99  1.01  Female     No  Sun  Dinner     21       10.34  1.66    Male     No  Sun  Dinner     32       21.01  3.50    Male     No  Sun  Dinner     33       23.68  3.31    Male     No  Sun  Dinner     24       24.59  3.61  Female     No  Sun  Dinner     4

To write CSV files, just callwrite_csv() with apyarrow.RecordBatch orpyarrow.Table and a path orfile-like object:

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>csv.write_csv(table,"tips.csv")>>>withpa.CompressedOutputStream("tips.csv.gz","gzip")asout:...csv.write_csv(table,out)

Note

The writer does not yet support all Arrow types.

Customized parsing#

To alter the default parsing settings in case of reading CSV files with anunusual structure, you should create aParseOptions instanceand pass it toread_csv():

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',parse_options=csv.ParseOptions(delimiter=";",invalid_row_handler=skip_handler))

Available parsing options are:

`delimiter`	The character delimiting individual cells in the CSV data.
`quote_char`	The character used optionally for quoting CSV values (False if quoting is not allowed).
`double_quote`	Whether two quotes in a quoted CSV value denote a single quote in the data.
`escape_char`	The character used optionally for escaping special characters (False if escaping is not allowed).
`newlines_in_values`	Whether newline characters are allowed in CSV values.
`ignore_empty_lines`	Whether empty lines are ignored in CSV input.
`invalid_row_handler`	Optional handler for invalid rows.

Customized conversion#

To alter how CSV data is converted to Arrow types and data, you should createaConvertOptions instance and pass it toread_csv():

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',convert_options=csv.ConvertOptions(column_types={'total_bill':pa.decimal128(precision=10,scale=2),'tip':pa.decimal128(precision=10,scale=2),}))

Note

To assign a column asduration, the CSV values must be numeric stringsthat match the expected unit (e.g.60000 for 60 seconds whenusingduration[ms]).

Available convert options are:

`check_utf8`	Whether to check UTF8 validity of string columns.
`column_types`	Explicitly map column names to column types.
`null_values`	A sequence of strings that denote nulls in the data.
`true_values`	A sequence of strings that denote true booleans in the data.
`false_values`	A sequence of strings that denote false booleans in the data.
`decimal_point`	The character used as decimal point in floating-point and decimal data.
`timestamp_parsers`	A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given).
`strings_can_be_null`	Whether string / binary columns can have null values.
`quoted_strings_can_be_null`	Whether quoted values can be null.
`auto_dict_encode`	Whether to try to automatically dict-encode string / binary data.
`auto_dict_max_cardinality`	The maximum dictionary cardinality forauto_dict_encode.
`include_columns`	The names of columns to include in the Table.
`include_missing_columns`	If false, columns ininclude_columns but not in the CSV file will error out.

Incremental reading#

For memory-constrained environments, it is also possible to read a CSV fileone batch at a time, usingopen_csv().

There are a few caveats:

For now, the incremental reader is always single-threaded (regardless ofReadOptions.use_threads)
Type inference is done on the first block and types are frozen afterwards;to make sure the right data types are inferred, either setReadOptions.block_size to a large enough value, or useConvertOptions.column_types to set the desired data types explicitly.

Character encoding#

By default, CSV files are expected to be encoded in UTF8. Non-UTF8 datais accepted forbinary columns. The encoding can be changed usingtheReadOptions class:

importpyarrowaspaimportpyarrow.csvascsvtable=csv.read_csv('tips.csv.gz',read_options=csv.ReadOptions(column_names=["animals","n_legs","entry"],skip_rows=1))

Available read options are:

`use_threads`	Whether to use multiple threads to accelerate reading.
`block_size`	How much bytes to process at a time from the input stream.
`skip_rows`	The number of rows to skip before the column names (if any) and the CSV data.
`skip_rows_after_names`	The number of rows to skip after the column names.
`column_names`	The column names of the target table.
`autogenerate_column_names`	Whether to autogenerate column names ifcolumn_names is empty.
`encoding`	encoding: object

Customized writing#

To alter the default write settings in case of writing CSV files withdifferent conventions, you can create aWriteOptions instance andpass it towrite_csv():

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>># Omit the header row (include_header=True is the default)>>>options=csv.WriteOptions(include_header=False)>>>csv.write_csv(table,"data.csv",options)

Incremental writing#

To write CSV files one batch at a time, create aCSVWriter. Thisrequires the output (a path or file-like object), the schema of the data tobe written, and optionally write options as described above:

>>>importpyarrowaspa>>>importpyarrow.csvascsv>>>withcsv.CSVWriter("data.csv",table.schema)aswriter:>>>writer.write_table(table)

Performance#

Due to the structure of CSV files, one cannot expect the same levels ofperformance as when reading dedicated binary formats likeParquet. Nevertheless, Arrow strives to reduce theoverhead of reading CSV files. A reasonable expectation is at least100 MB/s per core on a performant desktop or laptop computer (measuredin source CSV bytes, not target Arrow data bytes).

Performance options can be controlled through theReadOptions class.Multi-threaded reading is the default for highest performance, distributingthe workload efficiently over all available cores.

Note

The number of concurrent threads is automatically inferred by Arrow.You can inspect and change it using thecpu_count()andset_cpu_count() functions, respectively.

On this page

Edit on GitHub

Movatterモバイル変換

Reading and Writing CSV files#

Usage#

Customized parsing#

Customized conversion#

Incremental reading#

Character encoding#

Customized writing#

Incremental writing#

Performance#

Reading and Writing CSV files #