15.Working with Data Frames

Python-oracledb queries can fetch directly to data frames. This can improveperformance and reduce memory requirements when your application uses Pythondata frame libraries such asApache PyArrow,Pandas,Polars,NumPy,Dask,PyTorch, or writes files inApache Parquet format. TheOracleDataFrame objects fetched expose an Apache Arrow PyCapsuleInterface which, in some cases, allow zero-copy data interchanges to the dataframe objects of other libraries.

Note

The data frame support in python-oracledb 3.3 is a pre-release and maychange in a future version.

Fetching Data Frames

The methodConnection.fetch_df_all() fetches all rows from a query.The methodConnection.fetch_df_batches() implements an iterator forfetching batches of rows. The methods returnOracleDataFrame objects.

For example, to fetch all rows from a query and print some information aboutthe results:

sql="select * from departments"# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=sql,arraysize=100)print(odf.column_names())print(f"{odf.num_columns()} columns")print(f"{odf.num_rows()} rows")

With Oracle Database’s standard DEPARTMENTS table, this would display:

['DEPARTMENT_ID','DEPARTMENT_NAME','MANAGER_ID','LOCATION_ID']4columns27rows

To fetch in batches, use an iterator:

importpyarrowsql="select * from departments where department_id < 80"# Adjust "size" to tune the query fetch performance# Here it is small to show iterationforodfinconnection.fetch_df_batches(statement=sql,size=4):df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(df)

With Oracle Database’s standard DEPARTMENTS table, this would display:

DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID010Administration2001700120Marketing2011800230Purchasing1141700340HumanResources2032400DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID050Shipping1211500160IT1031400270PublicRelations2042700

Converting to other data frame formats isshown later inthis chapter.

Inserting OracleDataFrames into Oracle Database

To insert data currently inOracleDataFrame formatinto Oracle Database requires it to be converted. For example, you couldconvert it into a Pandas DataFrame for insert with the Pandas methodto_sql(). Or convert into a Python list via the PyArrowTable.to_pylist() method and then use standard python-oracledbfunctionality to execute a SQL INSERT statement.

15.1.Data Frame Type Mapping

Internally, python-oracledb’sOracleDataFramesupport makes use ofApache nanoarrowlibraries to build data frames.

The following data type mapping occurs from Oracle Database types to the Arrowtypes used in OracleDataFrame objects. Querying any other data types fromOracle Database will result in an exception.Output type handlers cannot be used to map data types.

Table 15.1Mapping from Oracle Database to Arrow data types
Oracle Database Type	Arrow Data Type
DB_TYPE_BINARY_DOUBLE	DOUBLE
DB_TYPE_BINARY_FLOAT	FLOAT
DB_TYPE_BLOB	LARGE_BINARY
DB_TYPE_BOOLEAN	BOOLEAN
DB_TYPE_CHAR	STRING
DB_TYPE_CLOB	LARGE_STRING
DB_TYPE_DATE	TIMESTAMP
DB_TYPE_LONG	LARGE_STRING
DB_TYPE_LONG_RAW	LARGE_BINARY
DB_TYPE_NCHAR	STRING
DB_TYPE_NCLOB	LARGE_STRING
DB_TYPE_NUMBER	DECIMAL128, INT64, or DOUBLE
DB_TYPE_NVARCHAR	STRING
DB_TYPE_RAW	BINARY
DB_TYPE_TIMESTAMP	TIMESTAMP
DB_TYPE_TIMESTAMP_LTZ	TIMESTAMP
DB_TYPE_TIMESTAMP_TZ	TIMESTAMP
DB_TYPE_VARCHAR	STRING
DB_TYPE_VECTOR	List or struct with DOUBLE, FLOAT, INT8, or UINT8 values

Numbers

When converting Oracle Database NUMBERs:

If the column has been created without a precision and scale, then the Arrowdata type will be DOUBLE.
Ifdefaults.fetch_decimals is set toTrue, then the Arrow datatype is DECIMAL128.
If the column has been created with a scale of0, and a precision valuethat is less than or equal to18, then the Arrow data type is INT64.
In all other cases, the Arrow data type is DOUBLE.

Vectors

When converting Oracle Database VECTORs:

Dense vectors are fetched as lists.
Sparse vectors are fetched as structs with fieldsnum_dimensions,indices andvalues similar toSparseVector objects.
Fixed and flexible dimensions are supported for dense VECTOR columns. Forsparse VECTOR columns, the dimension of each vector must be the same.
VECTOR columns with flexible formats are not supported. Each vector valuemust have the same storage format data type.
Vector values are fetched as the following types:
Oracle Database VECTOR format
Arrow data type
FLOAT64
DOUBLE
FLOAT32
FLOAT
INT8
INT8
BINARY
UINT8

Oracle Database VECTOR format	Arrow data type
FLOAT64	DOUBLE
FLOAT32	FLOAT
INT8	INT8
BINARY	UINT8

SeeUsing VECTOR data with Data Frames for more information.

LOBs

When converting Oracle Database CLOBs and BLOBs:

The LOBs must be no more than 1 GB in length.

Dates and Timestamps

When converting Oracle Database DATEs and TIMESTAMPs:

Arrow TIMESTAMPs will not have timezone data.
For Oracle Database DATE columns, the Arrow TIMESTAMP will have a time unitof “seconds”.
For Oracle Database TIMESTAMP types, the Arrow TIMESTAMP time unit depends onthe Oracle type’s fractional precision as shown in the table below:
Oracle Database TIMESTAMP fractional second precision range
Arrow TIMESTAMP time unit
0
seconds
1 - 3
milliseconds
4 - 6
microseconds
7 - 9
nanoseconds

Oracle Database TIMESTAMP fractional second precision range	Arrow TIMESTAMP time unit
0	seconds
1 - 3	milliseconds
4 - 6	microseconds
7 - 9	nanoseconds

15.2.Converting OracleDataFrame to Other Data Frames

To use data frames in your chosen analysis library,OracleDataFrameobjects can be converted. Examples for some libraries areshown in the following sections. Other libraries will have similar methods.

Conversion Overview

The guidelines for convertingOracleDataFrame objects to data frames for other libraries are:

To convert to aPyArrow Table, usepyarrow.Table.from_arrays() which leverages the Arrow PyCapsule interface.
To convert to aPandas DataFrame, usepyarrow.Table.to_pandas().
If you want to use a library other than Pandas or PyArrow, use the library’sfrom_arrow() method to convert a PyArrow Table to the applicable dataframe, if your library supports this. For example, withPolars usepolars.from_arrow().
If your library does not supportfrom_arrow(), then usefrom_dataframe() if the library supports it. This can be slower,depending on the implementation.

Overall, the general recommendation is to use Apache Arrow as much as possiblebut if there are no options, then usefrom_dataframe(). You should testand benchmark to find the best option for your applications.

15.2.1.Creating PyArrow Tables

An example that creates and uses aPyArrow Table is:

importpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id, name from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Create a PyArrow tablepyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())print("\nNumber of rows and columns:")(r,c)=pyarrow_table.shapeprint(f"{r} rows,{c} columns")

This makes use ofOracleDataFrame.column_arrays() which returns a listofOracleArrowArray Objects.

Internallypyarrow.Table.from_arrays() leverages the ApacheArrow PyCapsule interface thatOracleDataFrameexposes.

Seesamples/dataframe_pyarrow.py for a runnable example.

15.2.2.Creating Pandas DataFrames

An example that creates and uses aPandas DataFrame is:

importpandasimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select * from mytable where id = :1"myid=12345# the bind variable valueodf=connection.fetch_df_all(statement=sql,parameters=[myid],arraysize=1000)# Get a Pandas DataFrame from the data.df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()# Perform various Pandas operations on the DataFrameprint(df.T)# transformprint(df.tail(3))# last three rows

Theto_pandas() method supports arguments liketypes_mapper=pandas.ArrowDtype anddeduplicate_objects=False, which maybe useful for some data sets.

Seesamples/dataframe_pandas.py for a runnable example.

15.2.3.Creating Polars DataFrames

An example that creates and uses aPolars DataFrame is:

importpolarsimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Convert to a Polars DataFramepyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())df=polars.from_arrow(pyarrow_table)# Perform various Polars operations on the DataFramer,c=df.shapeprint(f"{r} rows,{c} columns")print(p.sum())

Seesamples/dataframe_polars.py for a runnable example.

15.2.4.Writing Apache Parquet Files

To write output inApache Parquet fileformat, you can use data frames as an efficient intermediary. Use theConnection.fetch_df_batches() iterator and convert to aPyArrow Table that canbe written by the PyArrow library.

importpyarrowimportpyarrow.parquetaspqFILE_NAME="sample.parquet"# Tune the fetch batch size for your queryBATCH_SIZE=10000sql="select * from mytable"pqwriter=Noneforodfinconnection.fetch_df_batches(statement=sql,size=BATCH_SIZE):# Get a PyArrow table from the query resultspyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())ifnotpqwriter:pqwriter=pq.ParquetWriter(FILE_NAME,pyarrow_table.schema)pqwriter.write_table(pyarrow_table)pqwriter.close()

Seesamples/dataframe_parquet_write.pyfor a runnable example.

15.2.5.The DLPack Protocol

The DataFrame format facilitates working with query results astensors. Conversion can be done using the standardDLPack Protocol implemented by PyArrow.

Using NumPy Arrays

For example, to convert toNumPyndarray format:

importnumpyimportpyarrowSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to an ndarray via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))np=numpy.from_dlpack(pyarrow_array)# Perform various numpy operations on the ndarrayprint(numpy.sum(np))print(numpy.log10(np))

Seesamples/dataframe_numpy.py for a runnable example.

15.2.6.Using Torch

An example of working with data as aTorch tensor is:

importpyarrowimporttorchSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to a Torch tensor via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))tt=torch.from_dlpack(pyarrow_array)# Perform various Torch operations on the tensorprint(torch.sum(tt))print(torch.log10(tt))

Seesamples/dataframe_torch.py for a runnable example.

15.3.Using VECTOR data with Data Frames

Columns of theVECTOR data type can be fetched withthe methodsConnection.fetch_df_all() andConnection.fetch_df_batches(). VECTOR columns can have flexibledimensions, but flexible storage formats are not supported: each vector valuemust have the same format data type. Vectors can be dense or sparse.

SeeData Frame Type Mapping for the type mapping for VECTORs.

Dense Vectors

By default, Oracle Database vectors are “dense”. These are fetched inpython-oracledb as Arrow lists. For example, if the table:

createtablemyvec(v64vector(3,float64));

contains these two vectors:

[4.1,5.2,6.3][7.1,8.2,9.3]

then the code:

odf=connection.fetch_df_all("select v64 from myvec")pyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())

will result in a PyArrow table containing lists of doubles. The table can beconverted to a data frame of your chosen library using functionality describedearlier in this chapter. For example, to convert to Pandas:

pdf=pyarrow_table.to_pandas()print(pdf)

The output will be:

V640[4.1,5.2,6.3]1[7.1,8.2,9.3]

Sparse Vectors

Sparse vectors (where many of the values are 0) are fetched as structs withfieldsnum_dimensions,indices, andvalues similar toSparseVector objects which are discussed in anon-data frame context inUsing SPARSE Vectors.

If the table:

createtablemyvec(v64vector(3,float64,sparse));

contains these two vectors:

[3,[1,2],[4.1,5.2]][3,[0],[9.3]]

then the code to fetch as data frames:

importpyarrowodf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(pdf)print("First row:")num_dimensions=pdf.iloc[0].V64['num_dimensions']print(f"num_dimensions={num_dimensions}")indices=pdf.iloc[0].V64['indices']print(f"indices={indices}")values=pdf.iloc[0].V64['values']print(f"values={values}")

will display:

V640{'num_dimensions':3,'indices':[1,2],'valu...1{'num_dimensions':3,'indices':[0],'values'...Firstrow:num_dimensions=3indices=[12]values=[4.15.2]

You can convert each struct as needed. One way to convert intoPandasdataframes with sparse values is via aSciPycoordinate format matrix. The Pandas methodfrom_spmatrix() can then be used to create thefinal sparse dataframe:

importnumpyimportpandasimportpyarrowimportscipydefconvert_to_sparse_array(val):dimensions=val["num_dimensions"]col_indices=val["indices"]row_indices=numpy.zeros(len(col_indices))values=val["values"]sparse_matrix=scipy.sparse.coo_matrix((values,(col_indices,row_indices)),shape=(dimensions,1))returnpandas.arrays.SparseArray.from_spmatrix(sparse_matrix)odf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),odf.column_names()).to_pandas()pdf["SPARSE_ARRAY_V64"]=pdf["V64"].apply(convert_to_sparse_array)print(pdf.SPARSE_ARRAY_V64)

The code will print:

0[0.0,4.1,5.2]Fill:0.0IntIndexIndices:ar...1[9.3,0.0,0.0]Fill:0.0IntIndexIndices:ar...Name:SPARSE_ARRAY_V64,dtype:object

Movatterモバイル変換