8.Batch Statement and Bulk Copy Operations

Python-oracledb is perfect for large ETL (“Extract, Transform, Load”) dataoperations.

This chapter focuses on efficient data ingestion. Python-oracledb lets youeasily optimize batch insertion, and also allows “noisy” data (values not in asuitable format) to be filtered for review while other, correct, values areinserted.

Related topics includeTuning python-oracledb andWorking with Data Frames.

8.1.Batch Statement Execution

Inserting, updating or deleting multiple rows can be performed efficiently withCursor.executemany(), making it easy to work with large data sets withpython-oracledb. This method can significantly outperform repeated calls toCursor.execute() by reducing network transfer costs and databaseoverheads. Theexecutemany() method can also be used toexecute a PL/SQL statement multiple times in one call.

There are examples in theGitHub examplesdirectory.

The following tables will be used in the samples that follow:

createtableParentTable(ParentIdnumber(9)notnull,Descriptionvarchar2(60)notnull,constraintParentTable_pkprimarykey(ParentId));createtableChildTable(ChildIdnumber(9)notnull,ParentIdnumber(9)notnull,Descriptionvarchar2(60)notnull,constraintChildTable_pkprimarykey(ChildId),constraintChildTable_fkforeignkey(ParentId)referencesParentTable);

8.1.1.Batch Execution of SQL

The following example inserts five rows into the tableParentTable:

data=[(10,"Parent 10"),(20,"Parent 20"),(30,"Parent 30"),(40,"Parent 40"),(50,"Parent 50")]cursor.executemany("insert into ParentTable values (:1, :2)",data)

Each tuple value maps to one of the bind variable placeholders.

This code requires only oneround-trip from the client tothe database instead of the five round-trips that would be required forrepeated calls toexecute(). For very large data sets, theremay be an external buffer or network limits to how many rows can be processed,so repeated calls toexecutemany() may be required. The limits are basedon both the number of rows being processed as well as the “size” of each rowthat is being processed. Repeated calls toexecutemany() arestill better than repeated calls toexecute().

To insert a single column, make sure the bind variables are correctly createdas tuples, for example:

data=[(10,),(20,),(30,),]cursor.executemany('insert into mytable (mycol) values (:1)',data)

Named binds can be performed by passing an array of dicts, where the keys matchthe bind variable placeholder names:

data=[{"pid":10,"pdesc":"Parent 10"},{"pid":20,"pdesc":"Parent 20"},{"pid":30,"pdesc":"Parent 30"},{"pid":40,"pdesc":"Parent 40"},{"pid":50,"pdesc":"Parent 50"}]cursor.executemany("insert into ParentTable values :pid, :pdesc)",data)

8.1.2.Predefining Memory Areas

When multiple rows of data are being processed there is the possibility thatthe data is not uniform in type and size. In such cases, python-oracledb makessome effort to accommodate such differences. Type determination for eachcolumn is deferred until a value that is notNone is found in the column’sdata. If all values in a particular column areNone, then python-oracledbassumes the type is a string and has a length of 1. Python-oracledb will alsoadjust the size of the buffers used to store strings and bytes when a longervalue is encountered in the data. These sorts of operations incur overhead asmemory has to be reallocated and data copied. To eliminate this overhead,usingsetinputsizes() tells python-oracledb about the type andsize of the data that is going to be used.

Consider the following code:

data=[(110,"Parent 110"),(2000,"Parent 2000"),(30000,"Parent 30000"),(400000,"Parent 400000"),(5000000,"Parent 5000000")]cursor.setinputsizes(None,20)cursor.executemany("""        insert into ParentTable (ParentId, Description)        values (:1, :2)""",data)

If this example did not callsetinputsizes(), thenpython-oracledb performs five allocations of increasing size and performdata copies as it discovers each new, longer string. However,cursor.setinputsizes(None,20) tells python-oracledb that the maximum sizeof the strings that will be processed is 20 characters. The first parameter ofNone tells python-oracledb that its default processing will be sufficientsince numeric data is already stored efficiently. Since python-oracledballocates memory for each row based on the supplied values, do not oversizethem.

If the size of the buffers allocated for any of the bind values exceeds 2 GB,you will receive the errorDPI-1015:arraysizeof<n>istoolarge, where<n> varies with the size of each element being allocated in the buffer. If youreceive this error, decrease the number of rows being inserted.

With named bind variables, use named parameters when callingsetinputsizes():

data=[{"pid":110,"pdesc":"Parent 110"},{"pid":2000,"pdesc":"Parent 2000"},{"pid":30000,"pdesc":"Parent 30000"},{"pid":400000,"pdesc":"Parent 400000"},{"pid":5000000,"pdesc":"Parent 5000000"}]cursor.setinputsizes(pdesc=20)cursor.executemany("""        insert into ParentTable (ParentId, Description)        values (:pid, :pdesc)""",data)

8.1.3.Batch Execution of PL/SQL

Usingexecutemany() can improve performance when PL/SQLfunctions, procedures, or anonymous blocks need to be called multiple times.

Runnable examples are inplsql_batch.py.

IN Binds

An example usingbind by position for IN binds is:

data=[(10,"Parent 10"),(20,"Parent 20"),(30,"Parent 30"),(40,"Parent 40"),(50,"Parent 50")]cursor.executemany("begin mypkg.create_parent(:1, :2); end;",data)

Note that thebatcherrors parameter ofexecutemany()(discussed inHandling Data Errors) cannot be used with PL/SQL block execution.

OUT Binds

When using OUT binds in PL/SQL, the input data omits entries for the OUT bindvariable placeholders. An example PL/SQL procedure that returns OUT binds is:

createorreplaceproceduremyproc(p1innumber,p2outnumber)asbeginp2:=p1*2;end;

This can be called in python-oracledb using positional binds like:

data=[(100,),(200,),(300,)]outvals=cursor.var(oracledb.DB_TYPE_NUMBER,arraysize=len(data))cursor.setinputsizes(None,outvals)cursor.executemany("begin myproc(:1, :2); end;",data)print(outvals.values)

The output is:

[200,400,600]

The equivalent code using named binds is:

data=[{"p1bv":100},{"p1bv":200},{"p1bv":300}]outvals=cursor.var(oracledb.DB_TYPE_NUMBER,arraysize=len(data))cursor.setinputsizes(p1bv=None,p2bv=outvals)cursor.executemany("begin myproc(:p1bv, :p2bv); end;",data)print(outvals.values)

Note that in python-oracledb Thick mode, whenexecutemany() isused for PL/SQL code that returns OUT binds, it will have the same performancecharacteristics as repeated calls toexecute().

IN/OUT Binds

An example PL/SQL procedure that returns IN/OUT binds is:

createorreplaceproceduremyproc2(p1innumber,p2inoutvarchar2)asbeginp2:=p2||' '||p1;end;

This can be called in python-oracledb using positional binds like:

data=[(440,'Gregory'),(550,'Haley'),(660,'Ian')]outvals=cursor.var(oracledb.DB_TYPE_VARCHAR,size=100,arraysize=len(data))cursor.setinputsizes(None,outvals)cursor.executemany("begin myproc2(:1, :2); end;",data)print(outvals.values)

Thesize parameter ofCursor.var() indicates the maximum length ofthe string that can be returned.

Output is:

['Gregory 440','Haley 550','Ian 660']

The equivalent code using named binds is:

data=[{"p1bv":440,"p2bv":'Gregory'},{"p1bv":550,"p2bv":'Haley'},{"p1bv":660,"p2bv":'Ian'}]outvals=cursor.var(oracledb.DB_TYPE_VARCHAR,size=100,arraysize=len(data))cursor.setinputsizes(p1bv=None,p2bv=outvals)cursor.executemany("begin myproc2(:p1bv, :p2bv); end;",data)print(outvals.values)

8.1.4.Handling Data Errors

Large datasets may contain some invalid data. When using batch execution asdiscussed above, the entire batch will be discarded if a single error isdetected, potentially eliminating the performance benefits of batch executionand increasing the complexity of the code required to handle those errors. Ifthe parameterbatchErrors is set to the valueTrue when callingexecutemany(), however, processing will continue even if thereare data errors in some rows, and the rows containing errors can be examinedafterwards to determine what course the application should take. Note that ifany errors are detected, a transaction will be started but not committed, evenifConnection.autocommit is set toTrue. After examining the errorsand deciding what to do with them, the application needs to explicitly commitor roll back the transaction withConnection.commit() orConnection.rollback(), as needed.

This example shows how data errors can be identified:

data=[(60,"Parent 60"),(70,"Parent 70"),(70,"Parent 70 (duplicate)"),(80,"Parent 80"),(80,"Parent 80 (duplicate)"),(90,"Parent 90")]cursor.executemany("insert into ParentTable values (:1, :2)",data,batcherrors=True)forerrorincursor.getbatcherrors():print("Error",error.message,"at row offset",error.offset)

The output is:

ErrorORA-00001:uniqueconstraint(PYTHONDEMO.PARENTTABLE_PK)violatedatrowoffset2ErrorORA-00001:uniqueconstraint(PYTHONDEMO.PARENTTABLE_PK)violatedatrowoffset4

The row offset is the index into the array of the data that could not beinserted due to errors. The application could choose to commit or rollback theother rows that were successfully inserted. Alternatively, it could correctthe data for the two invalid rows and attempt to insert them again beforecommitting.

8.1.5.Identifying Affected Rows

When executing a DML statement usingexecute(), the number ofrows affected can be examined by looking at the attributerowcount. When performing batch execution withCursor.executemany(), the row count will return thetotalnumber of rows that were affected. If you want to know the total number of rowsaffected by each row of data that is bound you must set the parameterarraydmlrowcounts toTrue, as shown:

parent_ids_to_delete=[20,30,50]cursor.executemany("delete from ChildTable where ParentId = :1",[(i,)foriinparent_ids_to_delete],arraydmlrowcounts=True)row_counts=cursor.getarraydmlrowcounts()forparent_id,countinzip(parent_ids_to_delete,row_counts):print("Parent ID:",parent_id,"deleted",count,"rows.")

Using the data found in theGitHub samples the outputis as follows:

ParentID:20deleted3rows.ParentID:30deleted2rows.ParentID:50deleted4rows.

8.1.6.DML RETURNING

DML statements like INSERT, UPDATE, DELETE, and MERGE can return values by usingthe DML RETURNING syntax. A bind variable can be created to accept this data.SeeUsing Bind Variables for more information.

If, instead of merely deleting the rows as shown in the previous example, youalso wanted to know some information about each of the rows that were deleted,you can use the following code:

parent_ids_to_delete=[20,30,50]child_id_var=cursor.var(int,arraysize=len(parent_ids_to_delete))cursor.setinputsizes(None,child_id_var)cursor.executemany("""        delete from ChildTable        where ParentId = :1        returning ChildId into :2""",[(i,)foriinparent_ids_to_delete])forix,parent_idinenumerate(parent_ids_to_delete):print("Child IDs deleted for parent ID",parent_id,"are",child_id_var.getvalue(ix))

The output will be:

ChildIDsdeletedforparentID20are[1002,1003,1004]ChildIDsdeletedforparentID30are[1005,1006]ChildIDsdeletedforparentID50are[1012,1013,1014,1015]

Note that the bind variable created to accept the returned data must have anarraysize large enough to hold data for each row that is processed. Also, thecall toCursor.setinputsizes() binds this variable immediately so thatit does not have to be passed in each row of data.

8.2.Bulk Copy Operations

Bulk copy operations are facilitated with the use ofCursor.executemany(), the use of appropriate SQL statements, and theuse of Python modules.

Also, seeWorking with Data Frames andOracle Database Pipelining.

8.2.1.Loading CSV Files into Oracle Database

TheCursor.executemany() method and Python’scsv module can be used toefficiently insert CSV (Comma Separated Values) data. For example, considerthe filedata.csv:

101,Abel154,Baker132,Charlie199,Delta...

And the schema:

createtabletest(idnumber,namevarchar2(25));

Data loading can be done in batches of records since the number of records mayprevent all data being inserted at once:

importoracledbimportcsv# CSV fileFILE_NAME='data.csv'# Adjust the number of rows to be inserted in each iteration# to meet your memory and performance requirementsBATCH_SIZE=10000connection=oracledb.connect(user="hr",password=userpwd,dsn="dbhost.example.com/orclpdb")withconnection.cursor()ascursor:# Predefine the memory areas to match the table definition.# This can improve performance by avoiding memory reallocations.# Here, one parameter is passed for each of the columns.# "None" is used for the ID column, since the size of NUMBER isn't# variable.  The "25" matches the maximum expected data size for the# NAME columncursor.setinputsizes(None,25)withopen(FILE_NAME,'r')ascsv_file:csv_reader=csv.reader(csv_file,delimiter=',')sql="insert into test (id, name) values (:1, :2)"data=[]forlineincsv_reader:data.append((line[0],line[1]))iflen(data)%BATCH_SIZE==0:cursor.executemany(sql,data)data=[]ifdata:cursor.executemany(sql,data)connection.commit()

Depending on data sizes and business requirements, database changes such astemporarily disabling redo logging on the table, or disabling indexes may alsobe beneficial.

Seesamples/load_csv.py for a runnable example.

8.2.2.Creating CSV Files from Oracle Database

Python’scsv modulecan be used to efficiently create CSV (Comma Separated Values) files. Forexample:

cursor.arraysize=1000# tune this for large queriesprint(f"Writing to{FILE_NAME}")withopen(FILE_NAME,"w")asf:writer=csv.writer(f,lineterminator="\n",quoting=csv.QUOTE_NONNUMERIC)cursor.execute("""select rownum, sysdate, mycol from BigTab""")writer.writerow(info.nameforinfoincursor.description)writer.writerows(cursor)

Seesamples/write_csv.py for a runnable example.

8.2.3.Bulk Copying Data between Databases

TheCursor.executemany() function is useful for copying data from onedatabase to another, for example in an ETL (“Extract, Transform, Load”)workflow:

# Connect to both databasessource_connection=oracledb.connect(user=un1,password=pw1,dsn=cs1)target_connection=oracledb.connect(user=un2,password=pw2,dsn=cs2)# Setup cursorssource_cursor=source_connection.cursor()source_cursor.arraysize=1000# tune this for query performancetarget_cursor=target_connection.cursor()target_cursor.setinputsizes(None,25)# set according to column types# Perform bulk fetch and insertionsource_cursor.execute("select c1, c2 from MySrcTable")whileTrue:# Extract the recordsrows=source_cursor.fetchmany()ifnotrows:break# Optionally transform the records here# ...# Load the records into the target databasetarget_cursor.executemany("insert into MyDestTable values (:1, :2)",rows)target_connection.commit()

Thearraysize value alters how many rows eachCursor.fetchmany() call returns, seeTuning Fetch Performance. Thesetinputsizes() call is used to optimize memory allocation wheninserting withexecutemany(), seePredefining Memory Areas. Youmay also want to tune the SDU setting for best nework performance, seeTuning python-oracledb.

If you are inserting back into the same database that the records originallycame from, you do not need to open a second connection. Instead, both cursorscan be obtained from one connection.

Avoiding Copying Data Over the Network

When copying data to another table in the same database, it may be preferableto use INSERT INTO SELECT or CREATE AS SELECT to avoid the overhead of copyingdata to, and back from, the Python process. This also avoids any data typechanges. For example to create a complete copy of a table:

cursor.execute("create table new_table as select * from old_table")

Similarly, when copying to a different database, consider creating adatabaselink between the databases and usingINSERT INTO SELECT or CREATE AS SELECT.

You can control the data transfer by changing your SELECT statement.