pyarrow.array#
- pyarrow.array(obj,type=None,mask=None,size=None,from_pandas=None,boolsafe=True,MemoryPoolmemory_pool=None)#
Create pyarrow.Array instance from a Python object.
- Parameters:
- objsequence,iterable,
ndarray
,pandas.Series
, Arrow-compatiblearray
If both type and size are specified may be a single use iterable. Ifnot strongly-typed, Arrow type will be inferred for resulting array.Any Arrow-compatible array that implements the Arrow PyCapsule Protocol(has an
__arrow_c_array__
or__arrow_c_device_array__
method)can be passed as well.- type
pyarrow.DataType
Explicit type to attempt to coerce to, otherwise will be inferred fromthe data.
- mask
array
[bool], optional Indicate which values are null (True) or not null (False).
- size
int64
, optional Size of the elements. If the input is larger than size bail at thislength. For iterators, if size is larger than the input iterator thiswill be treated as a “max size”, but will involve an initial allocationof size followed by a resize to the actual size (so if you know theexact size specifying it correctly will give you better performance).
- from_pandasbool, default
None
Use pandas’s semantics for inferring nulls from values inndarray-like data. If passed, the mask tasks precedence, butif a value is unmasked (not-null), but still null according topandas semantics, then it is null. Defaults to False if notpassed explicitly by user, or True if a pandas object ispassed in.
- safebool, default
True
Check for overflows or other unsafe conversions.
- memory_pool
pyarrow.MemoryPool
, optional If not passed, will allocate memory from the currently-set defaultmemory pool.
- objsequence,iterable,
- Returns:
- array
pyarrow.Array
orpyarrow.ChunkedArray
A ChunkedArray instead of an Array is returned if:
the object data overflowed binary storage.
the object’s
__arrow_array__
protocol method returned a chunkedarray.
- array
Notes
Timezone will be preserved in the returned array for timezone-aware data,else no timezone will be returned for naive timestamps.Internally, UTC values are stored for timezone-aware data with thetimezone set in the data type.
Pandas’s DateOffsets and dateutil.relativedelta.relativedelta are bydefault converted as MonthDayNanoIntervalArray. relativedelta leapdaysare ignored as are all absolute fields on both objects. datetime.timedeltacan also be converted to MonthDayNanoIntervalArray but this requirespassing MonthDayNanoIntervalType explicitly.
Converting to dictionary array will promote to a wider integer type forindices if the number of distinct values cannot be represented, even ifthe index type was explicitly set. This means that if there are more than127 values the returned dictionary array’s index type will be at leastpa.int16() even if pa.int8() was passed to the function. Note that anexplicit index type will not be demoted even if it is wider than required.
Examples
>>>importpandasaspd>>>importpyarrowaspa>>>pa.array(pd.Series([1,2]))<pyarrow.lib.Int64Array object at ...>[ 1, 2]
>>>pa.array(["a","b","a"],type=pa.dictionary(pa.int8(),pa.string()))<pyarrow.lib.DictionaryArray object at ...>...-- dictionary: [ "a", "b" ]-- indices: [ 0, 1, 0 ]
>>>importnumpyasnp>>>pa.array(pd.Series([1,2]),mask=np.array([0,1],dtype=bool))<pyarrow.lib.Int64Array object at ...>[ 1, null]
>>>arr=pa.array(range(1024),type=pa.dictionary(pa.int8(),pa.int64()))>>>arr.type.index_typeDataType(int16)