15.Working with Data Frames

Python-oracledb queries can fetch directly to data frames. This can improveperformance and reduce memory requirements when your application uses Pythondata frame libraries such asApache PyArrow,Pandas,Polars,NumPy,Dask,PyTorch, or writes files inApache Parquet format. TheOracleDataFrame objects fetched expose an Apache Arrow PyCapsuleInterface which, in some cases, allow zero-copy data interchanges to the dataframe objects of other libraries.

Note

The data frame support in python-oracledb 3.3 is a pre-release and maychange in a future version.

Fetching Data Frames

The methodConnection.fetch_df_all() fetches all rows from a query.The methodConnection.fetch_df_batches() implements an iterator forfetching batches of rows. The methods returnOracleDataFrame objects.

For example, to fetch all rows from a query and print some information aboutthe results:

sql="select * from departments"# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=sql,arraysize=100)print(odf.column_names())print(f"{odf.num_columns()} columns")print(f"{odf.num_rows()} rows")

With Oracle Database’s standard DEPARTMENTS table, this would display:

['DEPARTMENT_ID','DEPARTMENT_NAME','MANAGER_ID','LOCATION_ID']4columns27rows

To fetch in batches, use an iterator:

importpyarrowsql="select * from departments where department_id < 80"# Adjust "size" to tune the query fetch performance# Here it is small to show iterationforodfinconnection.fetch_df_batches(statement=sql,size=4):df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(df)

With Oracle Database’s standard DEPARTMENTS table, this would display:

DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID010Administration2001700120Marketing2011800230Purchasing1141700340HumanResources2032400DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID050Shipping1211500160IT1031400270PublicRelations2042700

Converting to other data frame formats isshown later inthis chapter.

Inserting OracleDataFrames into Oracle Database

To insert data currently inOracleDataFrame formatinto Oracle Database requires it to be converted. For example, you couldconvert it into a Pandas DataFrame for insert with the Pandas methodto_sql(). Or convert into a Python list via the PyArrowTable.to_pylist() method and then use standard python-oracledbfunctionality to execute a SQL INSERT statement.

15.1.Data Frame Type Mapping

Internally, python-oracledb’sOracleDataFramesupport makes use ofApache nanoarrowlibraries to build data frames.

The following data type mapping occurs from Oracle Database types to the Arrowtypes used in OracleDataFrame objects. Querying any other data types fromOracle Database will result in an exception.Output type handlers cannot be used to map data types.

Table 15.1Mapping from Oracle Database to Arrow data types

Oracle Database Type

Arrow Data Type

DB_TYPE_BINARY_DOUBLE

DOUBLE

DB_TYPE_BINARY_FLOAT

FLOAT

DB_TYPE_BLOB

LARGE_BINARY

DB_TYPE_BOOLEAN

BOOLEAN

DB_TYPE_CHAR

STRING

DB_TYPE_CLOB

LARGE_STRING

DB_TYPE_DATE

TIMESTAMP

DB_TYPE_LONG

LARGE_STRING

DB_TYPE_LONG_RAW

LARGE_BINARY

DB_TYPE_NCHAR

STRING

DB_TYPE_NCLOB

LARGE_STRING

DB_TYPE_NUMBER

DECIMAL128, INT64, or DOUBLE

DB_TYPE_NVARCHAR

STRING

DB_TYPE_RAW

BINARY

DB_TYPE_TIMESTAMP

TIMESTAMP

DB_TYPE_TIMESTAMP_LTZ

TIMESTAMP

DB_TYPE_TIMESTAMP_TZ

TIMESTAMP

DB_TYPE_VARCHAR

STRING

DB_TYPE_VECTOR

List or struct with DOUBLE, FLOAT, INT8, or UINT8 values

Numbers

When converting Oracle Database NUMBERs:

  • If the column has been created without a precision and scale, then the Arrowdata type will be DOUBLE.

  • Ifdefaults.fetch_decimals is set toTrue, then the Arrow datatype is DECIMAL128.

  • If the column has been created with a scale of0, and a precision valuethat is less than or equal to18, then the Arrow data type is INT64.

  • In all other cases, the Arrow data type is DOUBLE.

Vectors

When converting Oracle Database VECTORs:

  • Dense vectors are fetched as lists.

  • Sparse vectors are fetched as structs with fieldsnum_dimensions,indices andvalues similar toSparseVector objects.

  • Fixed and flexible dimensions are supported for dense VECTOR columns. Forsparse VECTOR columns, the dimension of each vector must be the same.

  • VECTOR columns with flexible formats are not supported. Each vector valuemust have the same storage format data type.

  • Vector values are fetched as the following types:

    Oracle Database VECTOR format

    Arrow data type

    FLOAT64

    DOUBLE

    FLOAT32

    FLOAT

    INT8

    INT8

    BINARY

    UINT8

SeeUsing VECTOR data with Data Frames for more information.

LOBs

When converting Oracle Database CLOBs and BLOBs:

  • The LOBs must be no more than 1 GB in length.

Dates and Timestamps

When converting Oracle Database DATEs and TIMESTAMPs:

  • Arrow TIMESTAMPs will not have timezone data.

  • For Oracle Database DATE columns, the Arrow TIMESTAMP will have a time unitof “seconds”.

  • For Oracle Database TIMESTAMP types, the Arrow TIMESTAMP time unit depends onthe Oracle type’s fractional precision as shown in the table below:

    Oracle Database TIMESTAMP fractional second precision range

    Arrow TIMESTAMP time unit

    0

    seconds

    1 - 3

    milliseconds

    4 - 6

    microseconds

    7 - 9

    nanoseconds

15.2.Converting OracleDataFrame to Other Data Frames

To use data frames in your chosen analysis library,OracleDataFrameobjects can be converted. Examples for some libraries areshown in the following sections. Other libraries will have similar methods.

Conversion Overview

The guidelines for convertingOracleDataFrame objects to data frames for other libraries are:

  • To convert to aPyArrow Table, usepyarrow.Table.from_arrays() which leverages the Arrow PyCapsule interface.

  • To convert to aPandas DataFrame, usepyarrow.Table.to_pandas().

  • If you want to use a library other than Pandas or PyArrow, use the library’sfrom_arrow() method to convert a PyArrow Table to the applicable dataframe, if your library supports this. For example, withPolars usepolars.from_arrow().

  • If your library does not supportfrom_arrow(), then usefrom_dataframe() if the library supports it. This can be slower,depending on the implementation.

Overall, the general recommendation is to use Apache Arrow as much as possiblebut if there are no options, then usefrom_dataframe(). You should testand benchmark to find the best option for your applications.

15.2.1.Creating PyArrow Tables

An example that creates and uses aPyArrow Table is:

importpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id, name from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Create a PyArrow tablepyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())print("\nNumber of rows and columns:")(r,c)=pyarrow_table.shapeprint(f"{r} rows,{c} columns")

This makes use ofOracleDataFrame.column_arrays() which returns a listofOracleArrowArray Objects.

Internallypyarrow.Table.from_arrays() leverages the ApacheArrow PyCapsule interface thatOracleDataFrameexposes.

Seesamples/dataframe_pyarrow.py for a runnable example.

15.2.2.Creating Pandas DataFrames

An example that creates and uses aPandas DataFrame is:

importpandasimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select * from mytable where id = :1"myid=12345# the bind variable valueodf=connection.fetch_df_all(statement=sql,parameters=[myid],arraysize=1000)# Get a Pandas DataFrame from the data.df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()# Perform various Pandas operations on the DataFrameprint(df.T)# transformprint(df.tail(3))# last three rows

Theto_pandas() method supports arguments liketypes_mapper=pandas.ArrowDtype anddeduplicate_objects=False, which maybe useful for some data sets.

Seesamples/dataframe_pandas.py for a runnable example.

15.2.3.Creating Polars DataFrames

An example that creates and uses aPolars DataFrame is:

importpolarsimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Convert to a Polars DataFramepyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())df=polars.from_arrow(pyarrow_table)# Perform various Polars operations on the DataFramer,c=df.shapeprint(f"{r} rows,{c} columns")print(p.sum())

Seesamples/dataframe_polars.py for a runnable example.

15.2.4.Writing Apache Parquet Files

To write output inApache Parquet fileformat, you can use data frames as an efficient intermediary. Use theConnection.fetch_df_batches() iterator and convert to aPyArrow Table that canbe written by the PyArrow library.

importpyarrowimportpyarrow.parquetaspqFILE_NAME="sample.parquet"# Tune the fetch batch size for your queryBATCH_SIZE=10000sql="select * from mytable"pqwriter=Noneforodfinconnection.fetch_df_batches(statement=sql,size=BATCH_SIZE):# Get a PyArrow table from the query resultspyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())ifnotpqwriter:pqwriter=pq.ParquetWriter(FILE_NAME,pyarrow_table.schema)pqwriter.write_table(pyarrow_table)pqwriter.close()

Seesamples/dataframe_parquet_write.pyfor a runnable example.

15.2.5.The DLPack Protocol

The DataFrame format facilitates working with query results astensors. Conversion can be done using the standardDLPack Protocol implemented by PyArrow.

Using NumPy Arrays

For example, to convert toNumPyndarray format:

importnumpyimportpyarrowSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to an ndarray via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))np=numpy.from_dlpack(pyarrow_array)# Perform various numpy operations on the ndarrayprint(numpy.sum(np))print(numpy.log10(np))

Seesamples/dataframe_numpy.py for a runnable example.

15.2.6.Using Torch

An example of working with data as aTorch tensor is:

importpyarrowimporttorchSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to a Torch tensor via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))tt=torch.from_dlpack(pyarrow_array)# Perform various Torch operations on the tensorprint(torch.sum(tt))print(torch.log10(tt))

Seesamples/dataframe_torch.py for a runnable example.

15.3.Using VECTOR data with Data Frames

Columns of theVECTOR data type can be fetched withthe methodsConnection.fetch_df_all() andConnection.fetch_df_batches(). VECTOR columns can have flexibledimensions, but flexible storage formats are not supported: each vector valuemust have the same format data type. Vectors can be dense or sparse.

SeeData Frame Type Mapping for the type mapping for VECTORs.

Dense Vectors

By default, Oracle Database vectors are “dense”. These are fetched inpython-oracledb as Arrow lists. For example, if the table:

createtablemyvec(v64vector(3,float64));

contains these two vectors:

[4.1,5.2,6.3][7.1,8.2,9.3]

then the code:

odf=connection.fetch_df_all("select v64 from myvec")pyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())

will result in a PyArrow table containing lists of doubles. The table can beconverted to a data frame of your chosen library using functionality describedearlier in this chapter. For example, to convert to Pandas:

pdf=pyarrow_table.to_pandas()print(pdf)

The output will be:

V640[4.1,5.2,6.3]1[7.1,8.2,9.3]

Sparse Vectors

Sparse vectors (where many of the values are 0) are fetched as structs withfieldsnum_dimensions,indices, andvalues similar toSparseVector objects which are discussed in anon-data frame context inUsing SPARSE Vectors.

If the table:

createtablemyvec(v64vector(3,float64,sparse));

contains these two vectors:

[3,[1,2],[4.1,5.2]][3,[0],[9.3]]

then the code to fetch as data frames:

importpyarrowodf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(pdf)print("First row:")num_dimensions=pdf.iloc[0].V64['num_dimensions']print(f"num_dimensions={num_dimensions}")indices=pdf.iloc[0].V64['indices']print(f"indices={indices}")values=pdf.iloc[0].V64['values']print(f"values={values}")

will display:

V640{'num_dimensions':3,'indices':[1,2],'valu...1{'num_dimensions':3,'indices':[0],'values'...Firstrow:num_dimensions=3indices=[12]values=[4.15.2]

You can convert each struct as needed. One way to convert intoPandasdataframes with sparse values is via aSciPycoordinate format matrix. The Pandas methodfrom_spmatrix() can then be used to create thefinal sparse dataframe:

importnumpyimportpandasimportpyarrowimportscipydefconvert_to_sparse_array(val):dimensions=val["num_dimensions"]col_indices=val["indices"]row_indices=numpy.zeros(len(col_indices))values=val["values"]sparse_matrix=scipy.sparse.coo_matrix((values,(col_indices,row_indices)),shape=(dimensions,1))returnpandas.arrays.SparseArray.from_spmatrix(sparse_matrix)odf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),odf.column_names()).to_pandas()pdf["SPARSE_ARRAY_V64"]=pdf["V64"].apply(convert_to_sparse_array)print(pdf.SPARSE_ARRAY_V64)

The code will print:

0[0.0,4.1,5.2]Fill:0.0IntIndexIndices:ar...1[9.3,0.0,0.0]Fill:0.0IntIndexIndices:ar...Name:SPARSE_ARRAY_V64,dtype:object