Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

PERF-#4494: Get partition widths/lengths in parallel instead of serially#4683

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
noloerino wants to merge14 commits intomodin-project:master
base:master
Choose a base branch
Loading
fromnoloerino:parallel-dims

Conversation

@noloerino
Copy link
Collaborator

@noloerinonoloerino commentedJul 18, 2022
edited
Loading

What do these changes do?

Computes widths and lengths of block partitions in parallel as batched calls toray.get/DaskWrapper.materialize rather than in serial.

This adds thetry_build_[length|width]_cache andtry_set_[length|width]_cache methods to block partitions; the former returns a promise/future for computing the partition's length, and the latter should be called by the partition manager to inform the block partition of the computation's value. This also adds the_update_partition_dimension_caches to thePartitionManager class, which will call the length/width futures returned by its constituent partitions.

  • commit message follows format outlinedhere
  • passesflake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passesblack --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit withgit commit -s
  • ResolvesPERF: get all partition widths/lengths in parallel instead of serially. #4494
  • tests added and passing
  • module layout described atdocs/development/architecture.rst is up-to-date
  • added (Issue Number: PR title (PR Number)) and github username to release notes for next major release

@noloerinonoloerino requested a review froma team as acode ownerJuly 18, 2022 21:36
@noloerinonoloerino marked this pull request as draftJuly 18, 2022 21:36
@codecov
Copy link

codecovbot commentedJul 19, 2022
edited
Loading

Codecov Report

Merging#4683 (490778c) intomaster (8e1190c) willdecrease coverage by13.12%.
The diff coverage is67.93%.

@@             Coverage Diff             @@##           master    #4683       +/-   ##===========================================- Coverage   85.28%   72.15%   -13.13%===========================================  Files         259      259                 Lines       19378    19496      +118     ===========================================- Hits        16527    14068     -2459- Misses       2851     5428     +2577
Impacted FilesCoverage Δ
...s/pandas_on_dask/partitioning/virtual_partition.py62.99% <0.00%> (-23.74%)⬇️
...ns/pandas_on_ray/partitioning/virtual_partition.py71.66% <6.66%> (-16.07%)⬇️
...lementations/pandas_on_dask/dataframe/dataframe.py80.76% <25.00%> (-15.07%)⬇️
...dataframe/pandas/partitioning/partition_manager.py75.67% <75.00%> (-10.79%)⬇️
...entations/pandas_on_dask/partitioning/partition.py79.77% <81.81%> (-9.25%)⬇️
...plementations/pandas_on_ray/dataframe/dataframe.py84.44% <82.50%> (-15.56%)⬇️
modin/core/dataframe/pandas/dataframe/dataframe.py71.44% <100.00%> (-22.89%)⬇️
...in/core/dataframe/pandas/partitioning/partition.py100.00% <100.00%> (ø)
...mentations/pandas_on_ray/partitioning/partition.py91.66% <100.00%> (+0.51%)⬆️
...ns/pandas_on_ray/partitioning/partition_manager.py83.50% <100.00%> (+2.68%)⬆️
... and84 more

📣 Codecov can now indicate which changes are the most critical in Pull Requests.Learn more

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This logic is duplicated from thePartitionManager classes above, but I'm not sure how to access the correct partition manager from here.

@noloerinonoloerino marked this pull request as ready for reviewJuly 20, 2022 21:24
@pyrito
Copy link
Collaborator

Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master?

@noloerino
Copy link
CollaboratorAuthor

noloerino commentedJul 21, 2022 via email

Sadly no, and I’d appreciate some suggestions on what code to run. Rehansuggested manually invalidating the ._row_lengths_cache and .length_cachefields on a dataframe and its partitions, then ensuring they’re recomputedproperly. It succeeds for simple examples, but I had trouble producing aRay timeline, and I’m not sure how else to benchmark it (most API-leveldataframe manipulations would probably hit the cached length/width).
On Wed, Jul 20, 2022 at 19:18 Karthik Velayutham ***@***.***> wrote: Haven't taken a closer look at the implementation details, but do you have any benchmarks or performance measurements to compare with master? — Reply to this email directly, view it on GitHub <#4683 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AFFY4GR46CDY7NCNZ722GSDVVCXO7ANCNFSM535Z7DMA> . You are receiving this because you authored the thread.Message ID: ***@***.***>
pyrito reacted with eyes emoji

@mvashishtha
Copy link
Collaborator

mvashishtha commentedJul 21, 2022
edited
Loading

@noloerino@pyrito

Sadly no, and I’d appreciate some suggestions on what code to run.

I spent a while today trying to get a script that showcases the performance here without breaking anything in Modin, but I failed. Getting a reproducer is hard for a few reasons.

For one thing, this optimization is only useful for unusual cases like in#4493 where the partitions' call queues include costly operations. When there is no call queue, the partitions will execute all dataframe functions eagerly, simultaneously calculating shapes. The call queues are generally meant to carry cheap operations like transpose and reindexing, but thereproducer in that issue has a frame that is very expensive to serialize, so that even the transpose was expensive. There the slow code was in_copartition, which unnecessarily calculated the widths of the base frame.#4495 fixed that unnecessary recalculation, so that script no longer works. Also, everyPandasDataFrame computes all the lengths when it filters empty subframes as soon as it's constructedhere, so any Modin dataframe at rest already knows its partition shapes.

Looking at all the serial shape computations I listedhere, most are in internal length computations. One is_copartition, and I spent a while trying to get around the cache fix in#4495 with a pair of frames that really needed copartitioning, but in that case themap_axis_partitions in_copartition triggers parallel computation. The last type of length computation is inapply_func_to_indices_both_axis, which as far as I can tell is only used inmelt. We could try engineering an example that bypasses the cache formelt, but I don't think it's worth the time...

I think it's good practice to get multiple ray objects in parallel (see alsothis note about a similar improvement in_to_pandas). Also, if our caches fail for any reason later on, we can have faster length computation as a backup.

pyrito reacted with eyes emoji

@vnlitvinov
Copy link
Collaborator

This adds a certain bit of complexity (judging by the number of lines change, haven't looked at the diff yet), and I haven't yet seen any performance proof for that. I would like to see some measurements before increasing our (already huge) codebase...

Copy link
Collaborator

@RehanSDRehanSD left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some comments, but great work!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why do we need to compute dimensions here?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thelength andwidth values of each partition are accessed in the localcompute_part_size, defined immediately below. The doublefor loop structure wherecompute_part_size is called makes it hard to parallelize the computation of these dimensions, so I thought it would be simplest to precompute the relevant dimensions before the loop.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We need to unwrap_length_cache here, since its type will bePandasDataframePartition

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

What do you mean byunwrap? Also, as far as I can tell, the logic for this method should be the same as it originally was (the code was just moved into thetry_build_length_cache, so does this mean the original code returnedPandasDataframePartition as well?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Shouldn't this just bei as well?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

No, sincenew_lengths may have fewer elements thancaches in the case where some length values were already computed (and are filtered out by theisinstance(cache, Future) check). The value computed atnew_lengths[dask_idx] should correspond to the promise atcaches[i].

@noloerino
Copy link
CollaboratorAuthor

@vnlitvinov that makes sense, I'll look into coming up with concrete benchmarks.

@vnlitvinov
Copy link
Collaborator

vnlitvinov commentedJul 27, 2022
edited
Loading

@pyrito please have a look athttps://github.com/vnlitvinov/modin/tree/speedup-masking and#4726, it might be doing somewhat the same in terms of getting the sizes in parallel

@YarShev
Copy link
Collaborator

Related discussion on handling metadata (index and columns) in#3673.

@noloerinonoloerinoforce-pushed theparallel-dims branch 2 times, most recently from6a17fc3 toe0bb5faCompareAugust 8, 2022 23:52
noloerinoand others added13 commitsAugust 9, 2022 11:42
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
…end)Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Co-authored-by: Rehan Sohail Durrani <rdurrani@berkeley.edu>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@RehanSDRehanSDRehanSD requested changes

@devin-petersohndevin-petersohnAwaiting requested review from devin-petersohndevin-petersohn will be requested when the pull request is marked ready for reviewdevin-petersohn is a code owner

@mvashishthamvashishthaAwaiting requested review from mvashishthamvashishtha will be requested when the pull request is marked ready for reviewmvashishtha is a code owner

@YarShevYarShevAwaiting requested review from YarShevYarShev will be requested when the pull request is marked ready for reviewYarShev is a code owner

@vnlitvinovvnlitvinovAwaiting requested review from vnlitvinovvnlitvinov will be requested when the pull request is marked ready for reviewvnlitvinov is a code owner

@anmyachevanmyachevAwaiting requested review from anmyachevanmyachev will be requested when the pull request is marked ready for reviewanmyachev is a code owner

@dchigarevdchigarevAwaiting requested review from dchigarevdchigarev will be requested when the pull request is marked ready for reviewdchigarev is a code owner

+1 more reviewer

@jeffreykennethlijeffreykennethlijeffreykennethli left review comments

Reviewers whose approvals may not affect merge requirements

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

PERF: get all partition widths/lengths in parallel instead of serially.

7 participants

@noloerino@pyrito@mvashishtha@vnlitvinov@YarShev@jeffreykennethli@RehanSD

[8]ページ先頭

©2009-2025 Movatter.jp