NotificationsYou must be signed in to change notification settings
Fork670
Star10.3k

PERF-#4494: Get partition widths/lengths in parallel instead of serially#4683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

noloerino wants to merge14 commits intomodin-project:master

base:master

Choose a base branch

fromnoloerino:parallel-dims

Draft

PERF-#4494: Get partition widths/lengths in parallel instead of serially#4683

noloerino wants to merge14 commits intomodin-project:masterfromnoloerino:parallel-dims

+403 −32

Conversation

Copy link

Collaborator

noloerino commentedJul 18, 2022•
edited
Loading

What do these changes do?

Computes widths and lengths of block partitions in parallel as batched calls toray.get/DaskWrapper.materialize rather than in serial.

This adds thetry_build_[length|width]_cache andtry_set_[length|width]_cache methods to block partitions; the former returns a promise/future for computing the partition's length, and the latter should be called by the partition manager to inform the block partition of the computation's value. This also adds the_update_partition_dimension_caches to thePartitionManager class, which will call the length/width futures returned by its constituent partitions.

commit message follows format outlinedhere
passesflake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passesblack --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit withgit commit -s
ResolvesPERF: get all partition widths/lengths in parallel instead of serially. #4494
tests added and passing
module layout described atdocs/development/architecture.rst is up-to-date
added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

noloerino requested a review froma team as acode owner

July 18, 2022 21:36

noloerino marked this pull request as draft

July 18, 2022 21:36

Copy link

codecovbot commentedJul 19, 2022•
edited
Loading

Codecov Report

Merging#4683 (490778c) intomaster (8e1190c) willdecrease coverage by13.12%.
The diff coverage is67.93%.

@@             Coverage Diff             @@##           master    #4683       +/-   ##===========================================- Coverage   85.28%   72.15%   -13.13%===========================================  Files         259      259                 Lines       19378    19496      +118     ===========================================- Hits        16527    14068     -2459- Misses       2851     5428     +2577

Impacted Files	Coverage Δ
...s/pandas_on_dask/partitioning/virtual_partition.py	`62.99% <0.00%> (-23.74%)`	⬇️
...ns/pandas_on_ray/partitioning/virtual_partition.py	`71.66% <6.66%> (-16.07%)`	⬇️
...lementations/pandas_on_dask/dataframe/dataframe.py	`80.76% <25.00%> (-15.07%)`	⬇️
...dataframe/pandas/partitioning/partition_manager.py	`75.67% <75.00%> (-10.79%)`	⬇️
...entations/pandas_on_dask/partitioning/partition.py	`79.77% <81.81%> (-9.25%)`	⬇️
...plementations/pandas_on_ray/dataframe/dataframe.py	`84.44% <82.50%> (-15.56%)`	⬇️
modin/core/dataframe/pandas/dataframe/dataframe.py	`71.44% <100.00%> (-22.89%)`	⬇️
...in/core/dataframe/pandas/partitioning/partition.py	`100.00% <100.00%> (ø)`
...mentations/pandas_on_ray/partitioning/partition.py	`91.66% <100.00%> (+0.51%)`	⬆️
...ns/pandas_on_ray/partitioning/partition_manager.py	`83.50% <100.00%> (+2.68%)`	⬆️
... and84 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests.Learn more

noloerino commented

Jul 20, 2022

View reviewed changes

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py Outdated

Copy link

CollaboratorAuthor

noloerinoJul 20, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This logic is duplicated from thePartitionManager classes above, but I'm not sure how to access the correct partition manager from here.

noloerino marked this pull request as ready for review

July 20, 2022 21:24

Copy link

Collaborator

pyrito commentedJul 21, 2022

Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?

Copy link

CollaboratorAuthor

noloerino commentedJul 21, 2022 via email

Sadly no, and I’d appreciate some suggestions on what code to run. Rehansuggested manually invalidating the ._row_lengths_cache and .length_cachefields on a dataframe and its partitions, then ensuring they’re recomputedproperly. It succeeds for simple examples, but I had trouble producing aRay timeline, and I’m not sure how else to benchmark it (most API-leveldataframe manipulations would probably hit the cached length/width).

…

On Wed, Jul 20, 2022 at 19:18 Karthik Velayutham ***@***.***> wrote: Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master? — Reply to this email directly, view it on GitHub <#4683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFFY4GR46CDY7NCNZ722GSDVVCXO7ANCNFSM535Z7DMA> . You are receiving this because you authored the thread.Message ID: ***@***.***>

jeffreykennethli reviewed

Jul 21, 2022

View reviewed changes

modin/core/dataframe/pandas/partitioning/partition.py OutdatedShow resolvedHide resolved

modin/core/dataframe/pandas/partitioning/partition_manager.py OutdatedShow resolvedHide resolved

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py OutdatedShow resolvedHide resolved

Copy link

Collaborator

mvashishtha commentedJul 21, 2022•
edited
Loading

@noloerino @pyrito

Sadly no, and I’d appreciate some suggestions on what code to run.

I spent a while today trying to get a script that showcases the performance here without breaking anything in Modin, but I failed. Getting a reproducer is hard for a few reasons.

For one thing, this optimization is only useful for unusual cases like in#4493 where the partitions' call queues include costly operations. When there is no call queue, the partitions will execute all dataframe functions eagerly, simultaneously calculating shapes. The call queues are generally meant to carry cheap operations like transpose and reindexing, but thereproducer in that issue has a frame that is very expensive to serialize, so that even the transpose was expensive. There the slow code was in_copartition, which unnecessarily calculated the widths of the base frame.#4495 fixed that unnecessary recalculation, so that script no longer works. Also, everyPandasDataFrame computes all the lengths when it filters empty subframes as soon as it's constructedhere, so any Modin dataframe at rest already knows its partition shapes.

Looking at all the serial shape computations I listedhere, most are in internal length computations. One is_copartition, and I spent a while trying to get around the cache fix in#4495 with a pair of frames that really needed copartitioning, but in that case themap_axis_partitions in_copartition triggers parallel computation. The last type of length computation is inapply_func_to_indices_both_axis, which as far as I can tell is only used inmelt. We could try engineering an example that bypasses the cache formelt, but I don't think it's worth the time...

I think it's good practice to get multiple ray objects in parallel (see alsothis note about a similar improvement in_to_pandas). Also, if our caches fail for any reason later on, we can have faster length computation as a backup.

Copy link

Collaborator

vnlitvinov commentedJul 21, 2022

This adds a certain bit of complexity (judging by the number of lines change, haven't looked at the diff yet), and I haven't yet seen any performance proof for that. I would like to see some measurements before increasing our (already huge) codebase...

RehanSD requested changes

Jul 26, 2022

View reviewed changes

Copy link

Collaborator

RehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some comments, but great work!

modin/core/dataframe/pandas/partitioning/partition_manager.py Outdated

Copy link

Collaborator

RehanSDJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why do we need to compute dimensions here?

Copy link

CollaboratorAuthor

noloerinoJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thelength andwidth values of each partition are accessed in the localcompute_part_size, defined immediately below. The doublefor loop structure wherecompute_part_size is called makes it hard to parallelize the computation of these dimensions, so I thought it would be simplest to precompute the relevant dimensions before the loop.

modin/core/dataframe/pandas/partitioning/partition.py Outdated

Copy link

Collaborator

RehanSDJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We need to unwrap_length_cache here, since its type will bePandasDataframePartition

Copy link

CollaboratorAuthor

noloerinoJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you mean byunwrap? Also, as far as I can tell, the logic for this method should be the same as it originally was (the code was just moved into thetry_build_length_cache, so does this mean the original code returnedPandasDataframePartition as well?

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py OutdatedShow resolvedHide resolved

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py Outdated

Copy link

Collaborator

RehanSDJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Shouldn't this just bei as well?

Copy link

CollaboratorAuthor

noloerinoJul 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No, sincenew_lengths may have fewer elements thancaches in the case where some length values were already computed (and are filtered out by theisinstance(cache, Future) check). The value computed atnew_lengths[dask_idx] should correspond to the promise atcaches[i].

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/virtual_partition.py OutdatedShow resolvedHide resolved

modin/core/execution/dask/implementations/pandas_on_dask/partitioning/partition.py OutdatedShow resolvedHide resolved

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition.py OutdatedShow resolvedHide resolved

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/partition_manager.py OutdatedShow resolvedHide resolved

modin/core/execution/ray/implementations/pandas_on_ray/partitioning/virtual_partition.py OutdatedShow resolvedHide resolved

Copy link

CollaboratorAuthor

noloerino commentedJul 26, 2022

@vnlitvinov that makes sense, I'll look into coming up with concrete benchmarks.

noloerino force-pushed theparallel-dims branch from6338bcb to3a220f7Compare

July 27, 2022 00:08

Copy link

Collaborator

vnlitvinov commentedJul 27, 2022•
edited
Loading

@pyrito please have a look athttps://github.com/vnlitvinov/modin/tree/speedup-masking and#4726, it might be doing somewhat the same in terms of getting the sizes in parallel

Copy link

Collaborator

YarShev commentedJul 27, 2022

Related discussion on handling metadata (index and columns) in#3673.

noloerino force-pushed theparallel-dims branch 2 times, most recently from6a17fc3 toe0bb5faCompare

August 8, 2022 23:52

noloerinoand others added13 commits

August 9, 2022 11:42

PERF-modin-project#4494: Get all partition widths/lengths in parallel

cb4f35c

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix ray.get dimension

7183383

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Parallelize nested ray get

5e16ccf

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Add parallel length/width to ray partitions

80fa12c

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Attempt at parallelizing dask length/width

3320f27

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Lint

be0a146

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix docstrings

09df9a5

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Make _copartition + mgr parallel (test_general failing on Python back…

af21d50

…end)Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix reindexed_base ref

b7c3471

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix accidental length for width

5729f18

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Apply suggestions from code review

3da8073

Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>

First round of PR comments

2d4a8d3

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix missing ray import

b10f6f5

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

Fix lints

490778c

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

noloerino force-pushed theparallel-dims branch frome0bb5fa to490778cCompare

August 9, 2022 20:21

mvashishtha mentioned this pull request

Aug 10, 2022

PERF: get all partition widths/lengths in parallel instead of serially.#4494

Closed

mvashishtha marked this pull request as draft

August 10, 2022 16:51

vnlitvinov mentioned this pull request

Aug 26, 2022

REFACTOR: remove redefinition of_row_lengths and_column_widths functions inPandasOnDaskDataframe.#3780

Closed

Reviewers

RehanSDRehanSD requested changes

devin-petersohnAwaiting requested review from devin-petersohndevin-petersohn will be requested when the pull request is marked ready for reviewdevin-petersohn is a code owner

mvashishthaAwaiting requested review from mvashishthamvashishtha will be requested when the pull request is marked ready for reviewmvashishtha is a code owner

YarShevAwaiting requested review from YarShevYarShev will be requested when the pull request is marked ready for reviewYarShev is a code owner

vnlitvinovAwaiting requested review from vnlitvinovvnlitvinov will be requested when the pull request is marked ready for reviewvnlitvinov is a code owner

anmyachevAwaiting requested review from anmyachevanmyachev will be requested when the pull request is marked ready for reviewanmyachev is a code owner

dchigarevAwaiting requested review from dchigarevdchigarev will be requested when the pull request is marked ready for reviewdchigarev is a code owner

+1 more reviewer

jeffreykennethlijeffreykennethli left review comments

Reviewers whose approvals may not affect merge requirements

Labels

None yet

Movatterモバイル変換

PERF-#4494: Get partition widths/lengths in parallel instead of serially#4683

Are you sure you want to change the base?

PERF-#4494: Get partition widths/lengths in parallel instead of serially#4683

Conversation

noloerino commentedJul 18, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

What do these changes do?

Uh oh!

codecovbot commentedJul 19, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Codecov Report

Uh oh!

noloerinoJul 20, 2022

Choose a reason for hiding this comment

Uh oh!

pyrito commentedJul 21, 2022

Uh oh!

noloerino commentedJul 21, 2022 via email

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

mvashishtha commentedJul 21, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

vnlitvinov commentedJul 21, 2022

Uh oh!

RehanSD left a comment

Choose a reason for hiding this comment

Uh oh!

RehanSDJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

noloerinoJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

RehanSDJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

noloerinoJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

RehanSDJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

noloerinoJul 26, 2022

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

noloerino commentedJul 26, 2022

Uh oh!

vnlitvinov commentedJul 27, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

YarShev commentedJul 27, 2022

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

7 participants

noloerino commentedJul 18, 2022•
edited
Loading

codecovbot commentedJul 19, 2022•
edited
Loading

mvashishtha commentedJul 21, 2022•
edited
Loading

vnlitvinov commentedJul 27, 2022•
edited
Loading