15.Working with Data Frames
Python-oracledb queries can fetch directly to data frames. This can improveperformance and reduce memory requirements when your application uses Pythondata frame libraries such asApache PyArrow,Pandas,Polars,NumPy,Dask,PyTorch, or writes files inApache Parquet format. TheOracleDataFrame objects fetched expose an Apache Arrow PyCapsuleInterface which, in some cases, allow zero-copy data interchanges to the dataframe objects of other libraries.
Note
The data frame support in python-oracledb 3.3 is a pre-release and maychange in a future version.
Fetching Data Frames
The methodConnection.fetch_df_all()
fetches all rows from a query.The methodConnection.fetch_df_batches()
implements an iterator forfetching batches of rows. The methods returnOracleDataFrame objects.
For example, to fetch all rows from a query and print some information aboutthe results:
sql="select * from departments"# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=sql,arraysize=100)print(odf.column_names())print(f"{odf.num_columns()} columns")print(f"{odf.num_rows()} rows")
With Oracle Database’s standard DEPARTMENTS table, this would display:
['DEPARTMENT_ID','DEPARTMENT_NAME','MANAGER_ID','LOCATION_ID']4columns27rows
To fetch in batches, use an iterator:
importpyarrowsql="select * from departments where department_id < 80"# Adjust "size" to tune the query fetch performance# Here it is small to show iterationforodfinconnection.fetch_df_batches(statement=sql,size=4):df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(df)
With Oracle Database’s standard DEPARTMENTS table, this would display:
DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID010Administration2001700120Marketing2011800230Purchasing1141700340HumanResources2032400DEPARTMENT_IDDEPARTMENT_NAMEMANAGER_IDLOCATION_ID050Shipping1211500160IT1031400270PublicRelations2042700
Converting to other data frame formats isshown later inthis chapter.
Inserting OracleDataFrames into Oracle Database
To insert data currently inOracleDataFrame formatinto Oracle Database requires it to be converted. For example, you couldconvert it into a Pandas DataFrame for insert with the Pandas methodto_sql()
. Or convert into a Python list via the PyArrowTable.to_pylist()
method and then use standard python-oracledbfunctionality to execute a SQL INSERT statement.
15.1.Data Frame Type Mapping
Internally, python-oracledb’sOracleDataFramesupport makes use ofApache nanoarrowlibraries to build data frames.
The following data type mapping occurs from Oracle Database types to the Arrowtypes used in OracleDataFrame objects. Querying any other data types fromOracle Database will result in an exception.Output type handlers cannot be used to map data types.
Oracle Database Type | Arrow Data Type |
---|---|
DB_TYPE_BINARY_DOUBLE | DOUBLE |
DB_TYPE_BINARY_FLOAT | FLOAT |
DB_TYPE_BLOB | LARGE_BINARY |
DB_TYPE_BOOLEAN | BOOLEAN |
DB_TYPE_CHAR | STRING |
DB_TYPE_CLOB | LARGE_STRING |
DB_TYPE_DATE | TIMESTAMP |
DB_TYPE_LONG | LARGE_STRING |
DB_TYPE_LONG_RAW | LARGE_BINARY |
DB_TYPE_NCHAR | STRING |
DB_TYPE_NCLOB | LARGE_STRING |
DB_TYPE_NUMBER | DECIMAL128, INT64, or DOUBLE |
DB_TYPE_NVARCHAR | STRING |
DB_TYPE_RAW | BINARY |
DB_TYPE_TIMESTAMP | TIMESTAMP |
DB_TYPE_TIMESTAMP_LTZ | TIMESTAMP |
DB_TYPE_TIMESTAMP_TZ | TIMESTAMP |
DB_TYPE_VARCHAR | STRING |
DB_TYPE_VECTOR | List or struct with DOUBLE, FLOAT, INT8, or UINT8 values |
Numbers
When converting Oracle Database NUMBERs:
If the column has been created without a precision and scale, then the Arrowdata type will be DOUBLE.
If
defaults.fetch_decimals
is set toTrue, then the Arrow datatype is DECIMAL128.If the column has been created with a scale of0, and a precision valuethat is less than or equal to18, then the Arrow data type is INT64.
In all other cases, the Arrow data type is DOUBLE.
Vectors
When converting Oracle Database VECTORs:
Dense vectors are fetched as lists.
Sparse vectors are fetched as structs with fields
num_dimensions
,indices
andvalues
similar toSparseVector objects.Fixed and flexible dimensions are supported for dense VECTOR columns. Forsparse VECTOR columns, the dimension of each vector must be the same.
VECTOR columns with flexible formats are not supported. Each vector valuemust have the same storage format data type.
Vector values are fetched as the following types:
Oracle Database VECTOR format
Arrow data type
FLOAT64
DOUBLE
FLOAT32
FLOAT
INT8
INT8
BINARY
UINT8
SeeUsing VECTOR data with Data Frames for more information.
LOBs
When converting Oracle Database CLOBs and BLOBs:
The LOBs must be no more than 1 GB in length.
Dates and Timestamps
When converting Oracle Database DATEs and TIMESTAMPs:
Arrow TIMESTAMPs will not have timezone data.
For Oracle Database DATE columns, the Arrow TIMESTAMP will have a time unitof “seconds”.
For Oracle Database TIMESTAMP types, the Arrow TIMESTAMP time unit depends onthe Oracle type’s fractional precision as shown in the table below:
Oracle Database TIMESTAMP fractional second precision range
Arrow TIMESTAMP time unit
0
seconds
1 - 3
milliseconds
4 - 6
microseconds
7 - 9
nanoseconds
15.2.Converting OracleDataFrame to Other Data Frames
To use data frames in your chosen analysis library,OracleDataFrameobjects can be converted. Examples for some libraries areshown in the following sections. Other libraries will have similar methods.
Conversion Overview
The guidelines for convertingOracleDataFrame objects to data frames for other libraries are:
To convert to aPyArrow Table, usepyarrow.Table.from_arrays() which leverages the Arrow PyCapsule interface.
To convert to aPandas DataFrame, usepyarrow.Table.to_pandas().
If you want to use a library other than Pandas or PyArrow, use the library’s
from_arrow()
method to convert a PyArrow Table to the applicable dataframe, if your library supports this. For example, withPolars usepolars.from_arrow().If your library does not support
from_arrow()
, then usefrom_dataframe()
if the library supports it. This can be slower,depending on the implementation.
Overall, the general recommendation is to use Apache Arrow as much as possiblebut if there are no options, then usefrom_dataframe()
. You should testand benchmark to find the best option for your applications.
15.2.1.Creating PyArrow Tables
An example that creates and uses aPyArrow Table is:
importpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id, name from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Create a PyArrow tablepyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())print("\nNumber of rows and columns:")(r,c)=pyarrow_table.shapeprint(f"{r} rows,{c} columns")
This makes use ofOracleDataFrame.column_arrays()
which returns a listofOracleArrowArray Objects.
Internallypyarrow.Table.from_arrays() leverages the ApacheArrow PyCapsule interface thatOracleDataFrameexposes.
Seesamples/dataframe_pyarrow.py for a runnable example.
15.2.2.Creating Pandas DataFrames
An example that creates and uses aPandas DataFrame is:
importpandasimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select * from mytable where id = :1"myid=12345# the bind variable valueodf=connection.fetch_df_all(statement=sql,parameters=[myid],arraysize=1000)# Get a Pandas DataFrame from the data.df=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()# Perform various Pandas operations on the DataFrameprint(df.T)# transformprint(df.tail(3))# last three rows
Theto_pandas() method supports arguments liketypes_mapper=pandas.ArrowDtype
anddeduplicate_objects=False
, which maybe useful for some data sets.
Seesamples/dataframe_pandas.py for a runnable example.
15.2.3.Creating Polars DataFrames
An example that creates and uses aPolars DataFrame is:
importpolarsimportpyarrow# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performancesql="select id from SampleQueryTab order by id"odf=connection.fetch_df_all(statement=sql,arraysize=100)# Convert to a Polars DataFramepyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())df=polars.from_arrow(pyarrow_table)# Perform various Polars operations on the DataFramer,c=df.shapeprint(f"{r} rows,{c} columns")print(p.sum())
Seesamples/dataframe_polars.py for a runnable example.
15.2.4.Writing Apache Parquet Files
To write output inApache Parquet fileformat, you can use data frames as an efficient intermediary. Use theConnection.fetch_df_batches()
iterator and convert to aPyArrow Table that canbe written by the PyArrow library.
importpyarrowimportpyarrow.parquetaspqFILE_NAME="sample.parquet"# Tune the fetch batch size for your queryBATCH_SIZE=10000sql="select * from mytable"pqwriter=Noneforodfinconnection.fetch_df_batches(statement=sql,size=BATCH_SIZE):# Get a PyArrow table from the query resultspyarrow_table=pyarrow.Table.from_arrays(arrays=odf.column_arrays(),names=odf.column_names())ifnotpqwriter:pqwriter=pq.ParquetWriter(FILE_NAME,pyarrow_table.schema)pqwriter.write_table(pyarrow_table)pqwriter.close()
Seesamples/dataframe_parquet_write.pyfor a runnable example.
15.2.5.The DLPack Protocol
The DataFrame format facilitates working with query results astensors. Conversion can be done using the standardDLPack Protocol implemented by PyArrow.
Using NumPy Arrays
For example, to convert toNumPyndarray
format:
importnumpyimportpyarrowSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to an ndarray via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))np=numpy.from_dlpack(pyarrow_array)# Perform various numpy operations on the ndarrayprint(numpy.sum(np))print(numpy.log10(np))
Seesamples/dataframe_numpy.py for a runnable example.
15.2.6.Using Torch
An example of working with data as aTorch tensor is:
importpyarrowimporttorchSQL="select id from SampleQueryTab order by id"# Get an OracleDataFrame# Adjust arraysize to tune the query fetch performanceodf=connection.fetch_df_all(statement=SQL,arraysize=100)# Convert to a Torch tensor via the Python DLPack specificationpyarrow_array=pyarrow.array(odf.get_column_by_name("ID"))tt=torch.from_dlpack(pyarrow_array)# Perform various Torch operations on the tensorprint(torch.sum(tt))print(torch.log10(tt))
Seesamples/dataframe_torch.py for a runnable example.
15.3.Using VECTOR data with Data Frames
Columns of theVECTOR data type can be fetched withthe methodsConnection.fetch_df_all()
andConnection.fetch_df_batches()
. VECTOR columns can have flexibledimensions, but flexible storage formats are not supported: each vector valuemust have the same format data type. Vectors can be dense or sparse.
SeeData Frame Type Mapping for the type mapping for VECTORs.
Dense Vectors
By default, Oracle Database vectors are “dense”. These are fetched inpython-oracledb as Arrow lists. For example, if the table:
createtablemyvec(v64vector(3,float64));
contains these two vectors:
[4.1,5.2,6.3][7.1,8.2,9.3]
then the code:
odf=connection.fetch_df_all("select v64 from myvec")pyarrow_table=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names())
will result in a PyArrow table containing lists of doubles. The table can beconverted to a data frame of your chosen library using functionality describedearlier in this chapter. For example, to convert to Pandas:
pdf=pyarrow_table.to_pandas()print(pdf)
The output will be:
V640[4.1,5.2,6.3]1[7.1,8.2,9.3]
Sparse Vectors
Sparse vectors (where many of the values are 0) are fetched as structs withfieldsnum_dimensions
,indices
, andvalues
similar toSparseVector objects which are discussed in anon-data frame context inUsing SPARSE Vectors.
If the table:
createtablemyvec(v64vector(3,float64,sparse));
contains these two vectors:
[3,[1,2],[4.1,5.2]][3,[0],[9.3]]
then the code to fetch as data frames:
importpyarrowodf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),names=odf.column_names()).to_pandas()print(pdf)print("First row:")num_dimensions=pdf.iloc[0].V64['num_dimensions']print(f"num_dimensions={num_dimensions}")indices=pdf.iloc[0].V64['indices']print(f"indices={indices}")values=pdf.iloc[0].V64['values']print(f"values={values}")
will display:
V640{'num_dimensions':3,'indices':[1,2],'valu...1{'num_dimensions':3,'indices':[0],'values'...Firstrow:num_dimensions=3indices=[12]values=[4.15.2]
You can convert each struct as needed. One way to convert intoPandasdataframes with sparse values is via aSciPycoordinate format matrix. The Pandas methodfrom_spmatrix() can then be used to create thefinal sparse dataframe:
importnumpyimportpandasimportpyarrowimportscipydefconvert_to_sparse_array(val):dimensions=val["num_dimensions"]col_indices=val["indices"]row_indices=numpy.zeros(len(col_indices))values=val["values"]sparse_matrix=scipy.sparse.coo_matrix((values,(col_indices,row_indices)),shape=(dimensions,1))returnpandas.arrays.SparseArray.from_spmatrix(sparse_matrix)odf=connection.fetch_df_all("select v64 from myvec")pdf=pyarrow.Table.from_arrays(odf.column_arrays(),odf.column_names()).to_pandas()pdf["SPARSE_ARRAY_V64"]=pdf["V64"].apply(convert_to_sparse_array)print(pdf.SPARSE_ARRAY_V64)
The code will print:
0[0.0,4.1,5.2]Fill:0.0IntIndexIndices:ar...1[9.3,0.0,0.0]Fill:0.0IntIndexIndices:ar...Name:SPARSE_ARRAY_V64,dtype:object