pyarrow.array #

pyarrow.array(obj,type=None,mask=None,size=None,from_pandas=None,boolsafe=True,MemoryPoolmemory_pool=None)#

Create pyarrow.Array instance from a Python object.

Parameters:

objsequence,iterable,ndarray,pandas.Series, Arrow-compatiblearray: If both type and size are specified may be a single use iterable. Ifnot strongly-typed, Arrow type will be inferred for resulting array.Any Arrow-compatible array that implements the Arrow PyCapsule Protocol(has an__arrow_c_array__ or__arrow_c_device_array__ method)can be passed as well.
typepyarrow.DataType: Explicit type to attempt to coerce to, otherwise will be inferred fromthe data.
maskarray[bool], optional: Indicate which values are null (True) or not null (False).
sizeint64, optional: Size of the elements. If the input is larger than size bail at thislength. For iterators, if size is larger than the input iterator thiswill be treated as a “max size”, but will involve an initial allocationof size followed by a resize to the actual size (so if you know theexact size specifying it correctly will give you better performance).
from_pandasbool, defaultNone: Use pandas’s semantics for inferring nulls from values inndarray-like data. If passed, the mask tasks precedence, butif a value is unmasked (not-null), but still null according topandas semantics, then it is null. Defaults to False if notpassed explicitly by user, or True if a pandas object ispassed in.
safebool, defaultTrue: Check for overflows or other unsafe conversions.
memory_poolpyarrow.MemoryPool, optional: If not passed, will allocate memory from the currently-set defaultmemory pool.

Returns:

arraypyarrow.Array orpyarrow.ChunkedArray

A ChunkedArray instead of an Array is returned if:

the object data overflowed binary storage.
the object’s__arrow_array__ protocol method returned a chunkedarray.

Notes

Timezone will be preserved in the returned array for timezone-aware data,else no timezone will be returned for naive timestamps.Internally, UTC values are stored for timezone-aware data with thetimezone set in the data type.

Pandas’s DateOffsets and dateutil.relativedelta.relativedelta are bydefault converted as MonthDayNanoIntervalArray. relativedelta leapdaysare ignored as are all absolute fields on both objects. datetime.timedeltacan also be converted to MonthDayNanoIntervalArray but this requirespassing MonthDayNanoIntervalType explicitly.

Converting to dictionary array will promote to a wider integer type forindices if the number of distinct values cannot be represented, even ifthe index type was explicitly set. This means that if there are more than127 values the returned dictionary array’s index type will be at leastpa.int16() even if pa.int8() was passed to the function. Note that anexplicit index type will not be demoted even if it is wider than required.

Examples

>>>importpandasaspd>>>importpyarrowaspa>>>pa.array(pd.Series([1,2]))<pyarrow.lib.Int64Array object at ...>[  1,  2]

>>>pa.array(["a","b","a"],type=pa.dictionary(pa.int8(),pa.string()))<pyarrow.lib.DictionaryArray object at ...>...-- dictionary:  [    "a",    "b"  ]-- indices:  [    0,    1,    0  ]

>>>importnumpyasnp>>>pa.array(pd.Series([1,2]),mask=np.array([0,1],dtype=bool))<pyarrow.lib.Int64Array object at ...>[  1,  null]

>>>arr=pa.array(range(1024),type=pa.dictionary(pa.int8(),pa.int64()))>>>arr.type.index_typeDataType(int16)

On this page

Edit on GitHub

Movatterモバイル変換

pyarrow.array#

pyarrow.array #