Commitf7d3865

jayantsing-db

authored and

varun-edachali-dbx

committed

Refactor decimal conversion in PyArrow tables to use direct casting (#544)

This PR replaces the previous implementation of convert_decimals_in_arrow_table() with a more efficient approach that uses PyArrow's native casting operation instead of going through pandas conversion and array creation.- Remove conversion to pandas DataFrame via to_pandas() and apply() methods- Remove intermediate steps of creating array from decimal column and setting it back- Replace with direct type casting using PyArrow's cast() method- Build a new table with transformed columns rather than modifying the original table- Create a new schema based on the modified fieldsThe new approach is more performant by avoiding pandas conversion overhead. The table below highlights substantial performance improvements when retrieving all rows from a table containing decimal columns, particularly when compression is disabled. Even greater gains were observed with compression enabled—showing approximately an 84% improvement (6 seconds compared to 39 seconds). Benchmarking was performed against e2-dogfood, with the client located in the us-west-2 region.![image](https://github.com/user-attachments/assets/5407b651-8ab6-4c13-b525-cf912f503ba0)Signed-off-by: Jayant Singh <jayant.singh@databricks.com>Signed-off-by: varun-edachali-dbx <varun.edachali@databricks.com>

1 parent8f7754b commitf7d3865Copy full SHA for f7d3865

File tree

1 file changed

+19

-9

lines changed

src/databricks/sql
- utils.py

1 file changed

+19

-9

lines changed

`‎src/databricks/sql/utils.py‎`

Lines changed: 19 additions & 9 deletions

Original file line number	Diff line number	Diff line change
`@@ -611,21 +611,31 @@ def convert_arrow_based_set_to_arrow_table(arrow_batches, lz4_compressed, schema`
`611`	`611`
`612`	`612`
`613`	`613`	`defconvert_decimals_in_arrow_table(table,description)->"pyarrow.Table":`
	`614`	`+new_columns= []`
	`615`	`+new_fields= []`
	`616`	`+`
`614`	`617`	`fori,colinenumerate(table.itercolumns()):`
	`618`	`+field=table.field(i)`
	`619`	`+`
`615`	`620`	`ifdescription[i][1]=="decimal":`
`616`		`-decimal_col=col.to_pandas().apply(`
`617`		`-lambdav:vifvisNoneelseDecimal(v)`
`618`		`- )`
`619`	`621`	`precision,scale=description[i][4],description[i][5]`
`620`	`622`	`assertscaleisnotNone`
`621`	`623`	`assertprecisionisnotNone`
`622`		`-# Spark limits decimal to a maximum scale of 38,`
`623`		`-# so 128 is guaranteed to be big enough`
	`624`	`+# create the target decimal type`
`624`	`625`	`dtype=pyarrow.decimal128(precision,scale)`
`625`		`-col_data=pyarrow.array(decimal_col,type=dtype)`
`626`		`-field=table.field(i).with_type(dtype)`
`627`		`-table=table.set_column(i,field,col_data)`
`628`		`-returntable`
	`626`	`+`
	`627`	`+new_col=col.cast(dtype)`
	`628`	`+new_field=field.with_type(dtype)`
	`629`	`+`
	`630`	`+new_columns.append(new_col)`
	`631`	`+new_fields.append(new_field)`
	`632`	`+else:`
	`633`	`+new_columns.append(col)`
	`634`	`+new_fields.append(field)`
	`635`	`+`
	`636`	`+new_schema=pyarrow.schema(new_fields)`
	`637`	`+`
	`638`	`+returnpyarrow.Table.from_arrays(new_columns,schema=new_schema)`
`629`	`639`
`630`	`640`
`631`	`641`	`defconvert_to_assigned_datatypes_in_column_table(column_table,description):`

0 commit comments

Comments

(0)

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Commitf7d3865

File tree

1 file changed

1 file changed

`‎src/databricks/sql/utils.py‎`

0 commit comments