pyarrow.csv.ConvertOptions #

classpyarrow.csv.ConvertOptions(check_utf8=None,*,column_types=None,null_values=None,true_values=None,false_values=None,decimal_point=None,strings_can_be_null=None,quoted_strings_can_be_null=None,include_columns=None,include_missing_columns=None,auto_dict_encode=None,auto_dict_max_cardinality=None,timestamp_parsers=None)#

Bases:_Weakrefable

Options for converting CSV data.

Parameters:

check_utf8bool, optional (defaultTrue): Whether to check UTF8 validity of string columns.
column_typespyarrow.Schema ordict, optional: Explicitly map column names to column types. Passing this argumentdisables type inference on the defined columns.
null_valueslist, optional: A sequence of strings that denote nulls in the data(defaults are appropriate in most cases). Note that by default,string columns are not checked for null values. To enablenull checking for those, specifystrings_can_be_null=True.
true_valueslist, optional: A sequence of strings that denote true booleans in the data(defaults are appropriate in most cases).
false_valueslist, optional: A sequence of strings that denote false booleans in the data(defaults are appropriate in most cases).
decimal_point1-characterstr, optional (default ‘.’): The character used as decimal point in floating-point and decimaldata.
strings_can_be_nullbool, optional (defaultFalse): Whether string / binary columns can have null values.If true, then strings in null_values are considered null forstring columns.If false, then all strings are valid string values.
quoted_strings_can_be_nullbool, optional (defaultTrue): Whether quoted values can be null.If true, then strings in “null_values” are also considered nullwhen they appear quoted in the CSV file. Otherwise, quoted valuesare never considered null.
include_columnslist, optional: The names of columns to include in the Table.If empty, the Table will include all columns from the CSV file.If not empty, only these columns will be included, in this order.
include_missing_columnsbool, optional (defaultFalse): If false, columns ininclude_columns but not in the CSV file willerror out.If true, columns ininclude_columns but not in the CSV file willproduce a column of nulls (whose type is selected usingcolumn_types, or null by default).This option is ignored ifinclude_columns is empty.
auto_dict_encodebool, optional (defaultFalse): Whether to try to automatically dict-encode string / binary data.If true, then when type inference detects a string or binary column,it it dict-encoded up toauto_dict_max_cardinality distinct values(per chunk), after which it switches to regular encoding.This setting is ignored for non-inferred columns (those incolumn_types).
auto_dict_max_cardinalityint, optional: The maximum dictionary cardinality forauto_dict_encode.This value is per chunk.
timestamp_parserslist, optional: A sequence of strptime()-compatible format strings, tried in orderwhen attempting to infer or convert timestamp values (the specialvalue ISO8601() can also be given). By default, a fast built-inISO-8601 parser is used.

Examples

Defining an example data:

>>>importio>>>s=(..."animals,n_legs,entry,fast\n"..."Flamingo,2,01/03/2022,Yes\n"..."Horse,4,02/03/2022,Yes\n"..."Brittle stars,5,03/03/2022,No\n"..."Centipede,100,04/03/2022,No\n"...",6,05/03/2022,"...)>>>print(s)animals,n_legs,entry,fastFlamingo,2,01/03/2022,YesHorse,4,02/03/2022,YesBrittle stars,5,03/03/2022,NoCentipede,100,04/03/2022,No,6,05/03/2022,

Change the type of a column:

>>>importpyarrowaspa>>>frompyarrowimportcsv>>>convert_options=csv.ConvertOptions(column_types={"n_legs":pa.float64()})>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: stringn_legs: doubleentry: stringfast: string----animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]n_legs: [[2,4,5,100,6]]entry: [["01/03/2022","02/03/2022","03/03/2022","04/03/2022","05/03/2022"]]fast: [["Yes","Yes","No","No",""]]

Define a date parsing format to get a timestamp type column(in case dates are not in ISO format and not converted by default):

>>>convert_options=csv.ConvertOptions(...timestamp_parsers=["%m/%d/%Y","%m-%d-%Y"])>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: stringn_legs: int64entry: timestamp[s]fast: string----animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]n_legs: [[2,4,5,100,6]]entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,2022-04-03 00:00:00,2022-05-03 00:00:00]]fast: [["Yes","Yes","No","No",""]]

Specify a subset of columns to be read:

>>>convert_options=csv.ConvertOptions(...include_columns=["animals","n_legs"])>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: stringn_legs: int64----animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]n_legs: [[2,4,5,100,6]]

List additional column to be included as a null typed column:

>>>convert_options=csv.ConvertOptions(...include_columns=["animals","n_legs","location"],...include_missing_columns=True)>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: stringn_legs: int64location: null----animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]n_legs: [[2,4,5,100,6]]location: [5 nulls]

Define columns as dictionary type (by default only thestring/binary columns are dictionary encoded):

>>>convert_options=csv.ConvertOptions(...timestamp_parsers=["%m/%d/%Y","%m-%d-%Y"],...auto_dict_encode=True)>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: dictionary<values=string, indices=int32, ordered=0>n_legs: int64entry: timestamp[s]fast: dictionary<values=string, indices=int32, ordered=0>----animals: [  -- dictionary:["Flamingo","Horse","Brittle stars","Centipede",""]  -- indices:[0,1,2,3,4]]n_legs: [[2,4,5,100,6]]entry: [[2022-01-03 00:00:00,2022-02-03 00:00:00,2022-03-03 00:00:00,2022-04-03 00:00:00,2022-05-03 00:00:00]]fast: [  -- dictionary:["Yes","No",""]  -- indices:[0,0,1,1,2]]

Set upper limit for the number of categories. If the categoriesis more than the limit, the conversion to dictionary will nothappen:

>>>convert_options=csv.ConvertOptions(...include_columns=["animals"],...auto_dict_encode=True,...auto_dict_max_cardinality=2)>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: string----animals: [["Flamingo","Horse","Brittle stars","Centipede",""]]

Set empty strings to missing values:

>>>convert_options=csv.ConvertOptions(include_columns=["animals","n_legs"],...strings_can_be_null=True)>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tableanimals: stringn_legs: int64----animals: [["Flamingo","Horse","Brittle stars","Centipede",null]]n_legs: [[2,4,5,100,6]]

Define values to be True and False when converting a columninto a bool type:

>>>convert_options=csv.ConvertOptions(...include_columns=["fast"],...false_values=["No"],...true_values=["Yes"])>>>csv.read_csv(io.BytesIO(s.encode()),convert_options=convert_options)pyarrow.Tablefast: bool----fast: [[true,true,false,false,null]]

__init__(*args,**kwargs)#

Methods

`__init__`(args, *kwargs)
`equals`(self, ConvertOptions other)
`validate`(self)

Attributes

`auto_dict_encode`	Whether to try to automatically dict-encode string / binary data.
`auto_dict_max_cardinality`	The maximum dictionary cardinality forauto_dict_encode.
`check_utf8`	Whether to check UTF8 validity of string columns.
`column_types`	Explicitly map column names to column types.
`decimal_point`	The character used as decimal point in floating-point and decimal data.
`false_values`	A sequence of strings that denote false booleans in the data.
`include_columns`	The names of columns to include in the Table.
`include_missing_columns`	If false, columns ininclude_columns but not in the CSV file will error out.
`null_values`	A sequence of strings that denote nulls in the data.
`quoted_strings_can_be_null`	Whether quoted values can be null.
`strings_can_be_null`	Whether string / binary columns can have null values.
`timestamp_parsers`	A sequence of strptime()-compatible format strings, tried in order when attempting to infer or convert timestamp values (the special value ISO8601() can also be given).
`true_values`	A sequence of strings that denote true booleans in the data.

auto_dict_encode#: Whether to try to automatically dict-encode string / binary data.

auto_dict_max_cardinality#

The maximum dictionary cardinality forauto_dict_encode.

This value is per chunk.

check_utf8#: Whether to check UTF8 validity of string columns.

column_types#: Explicitly map column names to column types.

decimal_point#: The character used as decimal point in floating-point and decimaldata.

equals(self,ConvertOptionsother)#

Parameters:

otherpyarrow.csv.ConvertOptions

Returns:

bool

false_values#: A sequence of strings that denote false booleans in the data.

include_columns#

The names of columns to include in the Table.

If empty, the Table will include all columns from the CSV file.If not empty, only these columns will be included, in this order.

include_missing_columns#: If false, columns ininclude_columns but not in the CSV file willerror out.If true, columns ininclude_columns but not in the CSV file willproduce a null column (whose type is selected usingcolumn_types,or null by default).This option is ignored ifinclude_columns is empty.

null_values#: A sequence of strings that denote nulls in the data.

quoted_strings_can_be_null#: Whether quoted values can be null.

strings_can_be_null#: Whether string / binary columns can have null values.

timestamp_parsers#: A sequence of strptime()-compatible format strings, tried in orderwhen attempting to infer or convert timestamp values (the specialvalue ISO8601() can also be given). By default, a fast built-inISO-8601 parser is used.

true_values#: A sequence of strings that denote true booleans in the data.

validate(self)#

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.csv.ConvertOptions#

pyarrow.csv.ConvertOptions #