Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

FEAT-#4909: Properly implement map operator#5118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
noloerino wants to merge10 commits intomodin-project:master
base:master
Choose a base branch
Loading
fromnoloerino:map-operator

Conversation

@noloerino
Copy link
Collaborator

@noloerinonoloerino commentedOct 12, 2022
edited
Loading

What do these changes do?

This PR cleans up the interfaces of the variousmap,apply, andbroadcast dataframe and partition manager methods.

Sincereduce andtreereduce both use these methods, these are also affected by the aforementioned changes. The changes also incidentally address#4912 and (partially)#5094, but those changes can be separated out fairly easily if this PR is too large.

Overall, the following changes have been made to the dataframe API (the partition manager changes are very similar):

Old methodNew method
map
and
broadcast_apply
map_partitions
apply_full_axis
and
broadcast_apply_full_axis
map_partition_full_axis
apply_select_indices
and
broadcast_apply_select_indices
map_select_indices
apply_func_to_indices_both_axis
map_select_indices_both_axes

A lot of logic that used to be in separate functions got moved into nested if/else chains with this refactor: suggestions on how to clean up the code would be appreciated.

Microbenchmarks

All tests were run on an EC2 t2.2xlarge instance (8 CPUs, 32 GiB RAM, 128 GB disk, Ubuntu Jammy AMD64) with the Ray backend, with int64 dataframes of size 2^16 x 2^14. Each test was run 5 times and averaged.

These benchmarks seem to indicate no appreciable performance difference on datasets of this size.

abs

The abs function is changed to map across rows rather than cell-wise.

  • PR (f5ef6f9e): 0.0354s
  • master (c070b65): 0.0352s

apply

The test randf.apply(np.sum, axis=0).

  • PR: 9.0078s
  • master: 9.0166

describe

  • PR: 32.3454s
  • master: 32.0462

@codecov
Copy link

codecovbot commentedOct 12, 2022
edited
Loading

Codecov Report

Merging#5118 (f5ef6f9) intomaster (abcf1e9) willincrease coverage by4.51%.
The diff coverage is94.87%.

@@            Coverage Diff             @@##           master    #5118      +/-   ##==========================================+ Coverage   84.56%   89.08%   +4.51%==========================================  Files         256      257       +1       Lines       19349    19613     +264     ==========================================+ Hits        16363    17472    +1109+ Misses       2986     2141     -845
Impacted FilesCoverage Δ
...ns/pandas_on_ray/partitioning/partition_manager.py71.73% <ø> (+4.34%)⬆️
modin/core/dataframe/base/dataframe/dataframe.py95.34% <60.00%> (-4.66%)⬇️
...dataframe/pandas/partitioning/partition_manager.py88.88% <91.50%> (+1.98%)⬆️
modin/core/dataframe/pandas/dataframe/dataframe.py95.18% <94.49%> (-0.06%)⬇️
modin/core/dataframe/algebra/binary.py100.00% <100.00%> (ø)
modin/core/dataframe/base/dataframe/utils.py100.00% <100.00%> (ø)
...me/pandas/interchange/dataframe_protocol/column.py93.33% <100.00%> (ø)
...pandas/interchange/dataframe_protocol/dataframe.py96.72% <100.00%> (ø)
...odin/core/storage_formats/pandas/query_compiler.py96.34% <100.00%> (+0.40%)⬆️
...ecution/ray/implementations/pandas_on_ray/io/io.py93.33% <100.00%> (ø)
... and51 more

📣 We’re building smart automated test selection to slash your CI/CD build times.Learn more

@lgtm-com
Copy link

This pull requestintroduces 1 alert when merging 5c0478cfb1740123b64fd58ccdd8b2a8604dd2ef into88f7b27 -view on LGTM.com

new alerts:

  • 1 for Wrong name for an argument in a class instantiation

@noloerinonoloerino marked this pull request as ready for reviewOctober 12, 2022 20:32
@noloerinonoloerino requested a review froma team as acode ownerOctober 12, 2022 20:32
Copy link
Collaborator

@dchigarevdchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some comments.

BTW, do we really need to combine all of the map functions into a single one? IMO some of them became really huge, complicated, and hard to read. EspeciallyPartitionManager.map_select_indices andPandasDataframe._map_axis.

I would suggest either refactoring them somehow to relax the complexity or splitting some of them into separate methods.

Comment on lines +45 to +48
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why is this needed? I mean, why would we extend internal dataframe API to also be able to acceptAxisInt when we already haveAxis enum?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A lot of the codebase (mostly the query compiler) is written to call dataframe methods with a literal int rather than theAxis enum. I think it would be easier to re-wrap the axis with the enum from within dataframe methods (as is done now) than to go through and fix every instance where relevant dataframe methods are called to use the enum instead.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't see why we need thisAxis enum then. I really don't like this mixing ofAxis,AxisInt, and actual integers for an axis value. I think we should pick only one of the ways of interpreting an axis and then really stick to this, not introducing a variety of axis types in order to cover an existing zoo of value types.

Comment on lines +123 to +117
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we offer acopy_dtypes option only for themap operator but not forreduce andtree_reduce?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not really sure, though my guess is that the frequently dimension-reducing nature of reduce/tree-reduce makes the argument less relevant for those cases. Here, I introducedcopy_dtypes as a replacement fordtypes="copy", which is a little hacky.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If I remember correctly, there was a discussion regarding limiting the usage of pandas entities in the base classes of Modin internals. Some executions may not require pandas at all and wouldn't like to deal with handling breaking changes introduced by some pandas updates.

May we define thedtypes type as something abstract likecollections.abc.Mapping so every execution would use whatever container they like?

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, that makes senes. Is there some other generic container that would acceptpandas.Series though? It seems like it's not a subclass ofMapping.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you elaborate on how the speed-up is achieved?

IMO the cell-wise execution should be beneficial in the general case against row/col-wise.

Copy link
CollaboratorAuthor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My quick and dirty micro-benchmarks show no difference between specifying an axis vs. applying cell-wise, so perhaps it's best to revert back to cell-wise operations. The hope is that for certain operators, being able to apply across a whole axis rather than having to examine each cell would provide a speedup. I will see if any other benchmarks would justify this theory.

@noloerinonoloerinoforce-pushed themap-operator branch 3 times, most recently from02a1619 to363dcfdCompareNovember 1, 2022 19:42
@noloerino
Copy link
CollaboratorAuthor

noloerino commentedNov 1, 2022
edited
Loading

Updated benchmarks for this PR (02a16191, slightly older version before a rebase) vs. current master (6f0ff79).

df.abs() (2^16 x 2^14)

  • 02a16191 - 0.0352s
  • 6f0ff79 - 0.0354s

df.apply(np.sum, axis=0) (2^16 x 2^14)

df1 + df2 (2^15 x 2^14 each)

  • 02a16191 - 0.0218s
  • 6f0ff79 - 0.0212s

df.describe() (2^16 x 2^14)

df.isna().any() (2^16 x 2^14)

  • 02a16191 - 0.0782s
  • 6f0ff79 - 0.0858s

I haven't yet check the sources of speedup (e.g. whether they're from shorter code paths/less partition overhead, or from changing maps to be axis-wise).

Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
Signed-off-by: Jonathan Shi <jhshi@ponder.io>
@noloerino
Copy link
CollaboratorAuthor

CI should be passing now (I ran it on my own repository before pushing here).

Copy link
Collaborator

@dchigarevdchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

May I kindly ask, what was the original idea of the PR? It seems that this PR tries to solve too many problems in one piece. It's really hard to review for me and to make the changes here for you.

I feel that the PR covers the following distinct topics:

  1. Align how we useaxis argument in low-level dataframe
  2. Introduce new logic for working withdtypes/copy_dtypes parameters
  3. Combinemap andbroadcast_apply intomap_partitions
  4. Combineapply_full_axis andbroadcast_apply_full_axis intomap_partition_full_axis
  5. Combineapply_select_indices andbroadcast_apply_select_indices intomap_select_indices
  6. Reworkapply_func_to_indices_both_axis intomap_select_indices_both_axes

All of these may be solved with small different PRs (rather than one huge). They're probably un-doable in parallel as some of them may block each other, however, I think the changes would make much more sense when introduced by small iterations.

anmyachev reacted with thumbs up emoji
Comment on lines +45 to +48
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't see why we need thisAxis enum then. I really don't like this mixing ofAxis,AxisInt, and actual integers for an axis value. I think we should pick only one of the ways of interpreting an axis and then really stick to this, not introducing a variety of axis types in order to cover an existing zoo of value types.

Comment on lines +1893 to +1894
join_type : str, default: "left"
Type of join to apply.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we have a special enum for this, let's use it

classJoinType(Enum):# noqa: PR01

Comment on lines +1039 to +1043
axis=0,
other_partitions=None,
full_axis=False,
apply_indices=[0],
other_apply_indices=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we really want these parameters to be specified? it seems that they just duplicate default values

Suggested change
axis=0,
other_partitions=None,
full_axis=False,
apply_indices=[0],
other_apply_indices=None,
axis=0,
apply_indices=[0],

the partitions will be concatenated together before the function is called, and then re-split
after it returns.
join_type : str, default: "left"
Type of join to apply.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can you please elaborate? something like this is expected:

Suggested change
Typeofjointoapply.
Typeofjointoapplyiftheconcatenationof`self`and`other`wouldberequired.

dtypes=dtypes,
)
ifaxis==Axis.CELL_WISE:
returnself._map_cellwise(func,dtypes)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why does cell-wise map ignore all other parameters?

Comment on lines +1914 to +1916
new_partitions=self._partition_mgr_cls.map_partitions(
self._partitions,func
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we ignoreaxis here? why does the.map_partitions call is inside of_map_axis that's supposed to call function axis-wise only?

*,
axis:Optional[Union[AxisInt,Axis]]=None,
other:Optional["PandasDataframe"]=None,
full_axis=False,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we need this parameter if we have a separate method for this (map_full_axis)?

kw=self._make_init_labels_args(new_partitions,new_index,new_columns)
ifcopy_dtypes:
kw["dtypes"]=self._dtypes
elifisinstance(dtypes,type):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

judging by the method's signature we are only supposed to allowpandas.Series to be adtype parameter, why is this logic here then? Let's either change the signature or adapt the logic somehow

Comment on lines +2080 to +2081
apply_indices=None,
numeric_indices=None,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we really want these two parameters to exist at the same time? we can easily end-up in an ambiguous situation with this set of parameters:

md_df.map_select_indices(apply_indices=["a","b"],numeric_indices=[1,2,3,4,5], ...)# what's the method supposed to do?

)

defrename(
defwindow(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why we're adding this here if there's no implementation? Shouldn't it be located in the base class then?

@noloerino
Copy link
CollaboratorAuthor

Thanks for taking the time to review@dchigarev. Broadly speaking, the purpose of this PR is to make calling the various partition application methods more uniform, and remove misleading "broadcast" nomenclature from the codebase (my understanding is that when the functions were originally written, the intent was for the functions to broadcast arguments to match dimensions like in somenumpy functions).

I'll see if I can split this into several smaller PRs; your suggestions for how to break it down makes sense, although this fragmentation might cause some inconsistencies between how different mapping methods are used. I'll double check with@RehanSD (who assigned me to this task) if this is a viable approach.

@noloerino
Copy link
CollaboratorAuthor

I've decided to split this into smaller parts as you suggested, starting with#5426 and#5427. Thanks again for the advice@dchigarev.

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@dchigarevdchigarevdchigarev requested changes

@devin-petersohndevin-petersohnAwaiting requested review from devin-petersohndevin-petersohn will be requested when the pull request is marked ready for reviewdevin-petersohn is a code owner

@mvashishthamvashishthaAwaiting requested review from mvashishthamvashishtha will be requested when the pull request is marked ready for reviewmvashishtha is a code owner

@RehanSDRehanSDAwaiting requested review from RehanSDRehanSD will be requested when the pull request is marked ready for reviewRehanSD is a code owner

@YarShevYarShevAwaiting requested review from YarShevYarShev will be requested when the pull request is marked ready for reviewYarShev is a code owner

@vnlitvinovvnlitvinovAwaiting requested review from vnlitvinovvnlitvinov will be requested when the pull request is marked ready for reviewvnlitvinov is a code owner

@anmyachevanmyachevAwaiting requested review from anmyachevanmyachev will be requested when the pull request is marked ready for reviewanmyachev is a code owner

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

FEAT: Properly implementmap operator

2 participants

@noloerino@dchigarev

[8]ページ先頭

©2009-2025 Movatter.jp