Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Arrow performance optimizations#638

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Merged
jprakash-db merged 5 commits intomainfromjprakash-db/arrow-optim
Jul 16, 2025
Merged

Conversation

@jprakash-db
Copy link
Contributor

@jprakash-dbjprakash-db commentedJul 15, 2025
edited
Loading

Description

This pull request introduces several performance optimizations for operations involving Apache Arrow tables within the Databricks SQL Python client.

  • Reduce overhead and improve efficiency when concatenating Arrow tables, especially when fetching data in batches.
  • Additionally, the PR streamlines HTTP download logic and improves code readability and maintainability.

Optimizations

Arrow Table Concatenation Optimizations

Batching Concatenations:
Instead of repeatedly calling pyarrow.concat_tables on pairs of tables (which is inefficient), partial results are now collected into a list (partial_result_chunks) and concatenated only once at the end using pyarrow.concat_tables(partial_result_chunks, use_threads=True).

CloudFetch Downloader Refactor

HTTP Client Consolidation:
Replaces direct use of requests.Session with a singleton pattern via DatabricksHttpClient. This centralizes HTTP handling and is more robust for connection management and configuration.

Benchmarking

Arrow concatentation optimization

Benchmarked using
num_tables : 10000 | row_per_table : 10000 , columns_per_table: 10 | attempts : 10

Metricpre - latencypost - latencyImprovement
count10.0s10.0s
mean9.26s1.48s
std0.78s0.61s
min8.27s0.005s
95%10.43s1.89s81%
99%10.55s1.90s82%
max10.58s1.90s

End to end optimization

This includes the end to end test include the arrow update and http client update
benchmarking workspace: benchmarking-staging-aws-us-east-1.staging.cloud.databricks.com
test runs: 10 per benchmark
benchmarking query : SELECT * FROM main.tpcds_sf100_delta.catalog_sales WHERE cs_ship_mode_sk <= 14 AND cs_sold_date_sk BETWEEN 2450815 AND (2450815 + 410) LIMIT {LIMIT} OFFSET {row_offset}

Num of Rowsp95 Prep95 PostImprovementp99 Prep99 PostImprovement
10,0004.01s2.71s32.41%4.33s2.85s34.1%
100,00019.52s14.32s26%22.23s14.55s34.5%

Summary

Arrow Optimizations - 81% faster ⚡

End to End optimizations - Greater than 30% faster ⚡

@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in ourcontributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-dbjprakash-dbforce-pushed thejprakash-db/arrow-optim branch frome1484c2 to8cdfd88CompareJuly 15, 2025 07:14
@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in ourcontributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in ourcontributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@github-actions
Copy link

Thanks for your contribution! To satisfy the DCO policy in ourcontributing guide every commit message must include a sign-off message. One or more of your commits is missing this message. You can reword previous commit messages with an interactive rebase (git rebase -i main).

@jprakash-dbjprakash-db marked this pull request as ready for reviewJuly 15, 2025 11:02
@jprakash-dbjprakash-db merged commite0ca049 intomainJul 16, 2025
22 of 24 checks passed
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@vikrantpuppalavikrantpuppalavikrantpuppala approved these changes

@jayantsing-dbjayantsing-dbAwaiting requested review from jayantsing-db

@gopalldbgopalldbAwaiting requested review from gopalldb

@samikshya-dbsamikshya-dbAwaiting requested review from samikshya-db

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

3 participants

@jprakash-db@vikrantpuppala

[8]ページ先頭

©2009-2025 Movatter.jp