NotificationsYou must be signed in to change notification settings
Fork670
Star10.3k

FEAT-#4909: Properly implement map operator#5118

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Draft

noloerino wants to merge10 commits intomodin-project:master

base:master

Choose a base branch

fromnoloerino:map-operator

Draft

FEAT-#4909: Properly implement map operator#5118

noloerino wants to merge10 commits intomodin-project:masterfromnoloerino:map-operator

Conversation

Copy link

Collaborator

noloerino commentedOct 12, 2022•
edited
Loading

What do these changes do?

This PR cleans up the interfaces of the variousmap,apply, andbroadcast dataframe and partition manager methods.

Sincereduce andtreereduce both use these methods, these are also affected by the aforementioned changes. The changes also incidentally address#4912 and (partially)#5094, but those changes can be separated out fairly easily if this PR is too large.

Overall, the following changes have been made to the dataframe API (the partition manager changes are very similar):

Old method	New method
map and broadcast_apply	map_partitions
apply_full_axis and broadcast_apply_full_axis	map_partition_full_axis
apply_select_indices and broadcast_apply_select_indices	map_select_indices
apply_func_to_indices_both_axis	map_select_indices_both_axes

A lot of logic that used to be in separate functions got moved into nested if/else chains with this refactor: suggestions on how to clean up the code would be appreciated.

Microbenchmarks

All tests were run on an EC2 t2.2xlarge instance (8 CPUs, 32 GiB RAM, 128 GB disk, Ubuntu Jammy AMD64) with the Ray backend, with int64 dataframes of size 2^16 x 2^14. Each test was run 5 times and averaged.

These benchmarks seem to indicate no appreciable performance difference on datasets of this size.

abs

The abs function is changed to map across rows rather than cell-wise.

PR (f5ef6f9e): 0.0354s
master (c070b65): 0.0352s

apply

The test randf.apply(np.sum, axis=0).

PR: 9.0078s
master: 9.0166

describe

PR: 32.3454s
master: 32.0462

commit message follows format outlinedhere
passesflake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
passesblack --check modin/ asv_bench/benchmarks scripts/doc_checker.py
signed commit withgit commit -s
ResolvesFEAT: Properly implementmap operator #4909,BUG:first_valid_index errors on dataframe with only None/NaN values #4912,BUG: Passing string asaxis argument leads to incorrect behavior #5094
tests added and passing
module layout described atdocs/development/architecture.rst is up-to-date

Copy link

codecovbot commentedOct 12, 2022•
edited
Loading

Codecov Report

Merging#5118 (f5ef6f9) intomaster (abcf1e9) willincrease coverage by4.51%.
The diff coverage is94.87%.

@@            Coverage Diff             @@##           master    #5118      +/-   ##==========================================+ Coverage   84.56%   89.08%   +4.51%==========================================  Files         256      257       +1       Lines       19349    19613     +264     ==========================================+ Hits        16363    17472    +1109+ Misses       2986     2141     -845

Impacted Files	Coverage Δ
...ns/pandas_on_ray/partitioning/partition_manager.py	`71.73% <ø> (+4.34%)`	⬆️
modin/core/dataframe/base/dataframe/dataframe.py	`95.34% <60.00%> (-4.66%)`	⬇️
...dataframe/pandas/partitioning/partition_manager.py	`88.88% <91.50%> (+1.98%)`	⬆️
modin/core/dataframe/pandas/dataframe/dataframe.py	`95.18% <94.49%> (-0.06%)`	⬇️
modin/core/dataframe/algebra/binary.py	`100.00% <100.00%> (ø)`
modin/core/dataframe/base/dataframe/utils.py	`100.00% <100.00%> (ø)`
...me/pandas/interchange/dataframe_protocol/column.py	`93.33% <100.00%> (ø)`
...pandas/interchange/dataframe_protocol/dataframe.py	`96.72% <100.00%> (ø)`
...odin/core/storage_formats/pandas/query_compiler.py	`96.34% <100.00%> (+0.40%)`	⬆️
...ecution/ray/implementations/pandas_on_ray/io/io.py	`93.33% <100.00%> (ø)`
... and51 more

📣 We’re building smart automated test selection to slash your CI/CD build times.Learn more

Copy link

lgtm-combot commentedOct 12, 2022

This pull requestintroduces 1 alert when merging 5c0478cfb1740123b64fd58ccdd8b2a8604dd2ef into88f7b27 -view on LGTM.com

new alerts:

1 for Wrong name for an argument in a class instantiation

noloerino marked this pull request as ready for review

October 12, 2022 20:32

noloerino requested a review froma team as acode owner

October 12, 2022 20:32

dchigarev requested changes

Oct 14, 2022

View reviewed changes

Copy link

Collaborator

dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Left some comments.

BTW, do we really need to combine all of the map functions into a single one? IMO some of them became really huge, complicated, and hard to read. EspeciallyPartitionManager.map_select_indices andPandasDataframe._map_axis.

I would suggest either refactoring them somehow to relax the complexity or splitting some of them into separate methods.

modin/core/dataframe/base/dataframe/utils.py Outdated

Comment on lines +45 to +48

Copy link

Collaborator

dchigarevOct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Why is this needed? I mean, why would we extend internal dataframe API to also be able to acceptAxisInt when we already haveAxis enum?

Copy link

CollaboratorAuthor

noloerinoOct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

A lot of the codebase (mostly the query compiler) is written to call dataframe methods with a literal int rather than theAxis enum. I think it would be easier to re-wrap the axis with the enum from within dataframe methods (as is done now) than to go through and fix every instance where relevant dataframe methods are called to use the enum instead.

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I don't see why we need thisAxis enum then. I really don't like this mixing ofAxis,AxisInt, and actual integers for an axis value. I think we should pick only one of the ways of interpreting an axis and then really stick to this, not introducing a variety of axis types in order to cover an existing zoo of value types.

modin/core/dataframe/base/dataframe/dataframe.py Outdated

Comment on lines +123 to +117

Copy link

Collaborator

dchigarevOct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we offer acopy_dtypes option only for themap operator but not forreduce andtree_reduce?

Copy link

CollaboratorAuthor

noloerinoOct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I'm not really sure, though my guess is that the frequently dimension-reducing nature of reduce/tree-reduce makes the argument less relevant for those cases. Here, I introducedcopy_dtypes as a replacement fordtypes="copy", which is a little hacky.

modin/core/dataframe/base/dataframe/dataframe.py Outdated

Copy link

Collaborator

dchigarevOct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If I remember correctly, there was a discussion regarding limiting the usage of pandas entities in the base classes of Modin internals. Some executions may not require pandas at all and wouldn't like to deal with handling breaking changes introduced by some pandas updates.

May we define thedtypes type as something abstract likecollections.abc.Mapping so every execution would use whatever container they like?

Copy link

CollaboratorAuthor

noloerinoOct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Sure, that makes senes. Is there some other generic container that would acceptpandas.Series though? It seems like it's not a subclass ofMapping.

modin/core/dataframe/pandas/dataframe/dataframe.py OutdatedShow resolvedHide resolved

modin/core/dataframe/base/dataframe/dataframe.py OutdatedShow resolvedHide resolved

modin/core/dataframe/pandas/partitioning/partition_manager.py OutdatedShow resolvedHide resolved

modin/core/storage_formats/pandas/query_compiler.py Outdated

Copy link

Collaborator

dchigarevOct 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Could you elaborate on how the speed-up is achieved?

IMO the cell-wise execution should be beneficial in the general case against row/col-wise.

Copy link

CollaboratorAuthor

noloerinoOct 17, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

My quick and dirty micro-benchmarks show no difference between specifying an axis vs. applying cell-wise, so perhaps it's best to revert back to cell-wise operations. The hope is that for certain operators, being able to apply across a whole axis rather than having to examine each cell would provide a speedup. I will see if any other benchmarks would justify this theory.

noloerino force-pushed themap-operator branch 3 times, most recently from02a1619 to363dcfdCompare

November 1, 2022 19:42

Copy link

CollaboratorAuthor

noloerino commentedNov 1, 2022•
edited
Loading

Updated benchmarks for this PR (02a16191, slightly older version before a rebase) vs. current master (6f0ff79).

df.abs() (2^16 x 2^14)

02a16191 - 0.0352s
6f0ff79 - 0.0354s

df.apply(np.sum, axis=0) (2^16 x 2^14)

02a16191 - 6.531s
6f0ff79 - 8.072s

df1 + df2 (2^15 x 2^14 each)

02a16191 - 0.0218s
6f0ff79 - 0.0212s

df.describe() (2^16 x 2^14)

02a16191 - 33.1s
6f0ff79 - 32.8s

df.isna().any() (2^16 x 2^14)

02a16191 - 0.0782s
6f0ff79 - 0.0858s

I haven't yet check the sources of speedup (e.g. whether they're from shorter code paths/less partition overhead, or from changing maps to be axis-wise).

noloerino force-pushed themap-operator branch from363dcfd to429511fCompare

November 7, 2022 17:51

noloerino force-pushed themap-operator branch from429511f to5c11ab4Compare

November 21, 2022 18:08

noloerino force-pushed themap-operator branch from5c11ab4 tod1596d8Compare

November 29, 2022 18:01

noloerino added10 commits

December 6, 2022 13:16

FEAT-modin-project#4909: Properly implementmapoperator

da415ad

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

suppress mypy warning for differing imports

af1926a

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

rename unidist partition method list

be8ff60

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix unidist io map

ea1c425

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

as last commit

0cd70f5

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix apply_full_axis in unidist io

cb5c99b

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

change remaining wrong functions

f506eed

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

lint

2d987b0

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

edit doc

81afa94

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

fix str copy_dtypes

07aef83

Signed-off-by: Jonathan Shi <jhshi@ponder.io>

noloerino force-pushed themap-operator branch fromab2c1c9 to07aef83Compare

December 7, 2022 02:01

Copy link

CollaboratorAuthor

noloerino commentedDec 7, 2022

CI should be passing now (I ran it on my own repository before pushing here).

dchigarev reviewed

Dec 7, 2022

View reviewed changes

Copy link

Collaborator

dchigarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

May I kindly ask, what was the original idea of the PR? It seems that this PR tries to solve too many problems in one piece. It's really hard to review for me and to make the changes here for you.

I feel that the PR covers the following distinct topics:

Align how we useaxis argument in low-level dataframe
Introduce new logic for working withdtypes/copy_dtypes parameters
Combinemap andbroadcast_apply intomap_partitions
Combineapply_full_axis andbroadcast_apply_full_axis intomap_partition_full_axis
Combineapply_select_indices andbroadcast_apply_select_indices intomap_select_indices
Reworkapply_func_to_indices_both_axis intomap_select_indices_both_axes

All of these may be solved with small different PRs (rather than one huge). They're probably un-doable in parallel as some of them may block each other, however, I think the changes would make much more sense when introduced by small iterations.

modin/core/dataframe/base/dataframe/utils.py Outdated

Comment on lines +45 to +48

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

modin/core/dataframe/pandas/dataframe/dataframe.py

Comment on lines +1893 to +1894

		join_type : str, default: "left"
		Type of join to apply.

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

we have a special enum for this, let's use it

modin/modin/core/dataframe/base/dataframe/utils.py

Line 39 in2ebc9cf

classJoinType(Enum):# noqa: PR01

modin/core/dataframe/pandas/dataframe/dataframe.py

Comment on lines +1039 to +1043

		axis=0,
		other_partitions=None,
		full_axis=False,
		apply_indices=[0],
		other_apply_indices=None,

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we really want these parameters to be specified? it seems that they just duplicate default values

Suggested change

	axis=0,
	other_partitions=None,
	full_axis=False,
	apply_indices=[0],
	other_apply_indices=None,
	axis=0,
	apply_indices=[0],

modin/core/dataframe/pandas/dataframe/dataframe.py

		the partitions will be concatenated together before the function is called, and then re-split
		after it returns.
		join_type : str, default: "left"
		Type of join to apply.

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

can you please elaborate? something like this is expected:

Suggested change

	Typeofjointoapply.
	Typeofjointoapplyiftheconcatenationof`self`and`other`wouldberequired.

modin/core/dataframe/pandas/dataframe/dataframe.py

		dtypes=dtypes,
		)
		ifaxis==Axis.CELL_WISE:
		returnself._map_cellwise(func,dtypes)

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why does cell-wise map ignore all other parameters?

modin/core/dataframe/pandas/dataframe/dataframe.py

Comment on lines +1914 to +1916

		new_partitions=self._partition_mgr_cls.map_partitions(
		self._partitions,func
		)

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we ignoreaxis here? why does the.map_partitions call is inside of_map_axis that's supposed to call function axis-wise only?

modin/core/dataframe/pandas/dataframe/dataframe.py

		*,
		axis:Optional[Union[AxisInt,Axis]]=None,
		other:Optional["PandasDataframe"]=None,
		full_axis=False,

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why do we need this parameter if we have a separate method for this (map_full_axis)?

modin/core/dataframe/pandas/dataframe/dataframe.py

		kw=self._make_init_labels_args(new_partitions,new_index,new_columns)
		ifcopy_dtypes:
		kw["dtypes"]=self._dtypes
		elifisinstance(dtypes,type):

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

judging by the method's signature we are only supposed to allowpandas.Series to be adtype parameter, why is this logic here then? Let's either change the signature or adapt the logic somehow

modin/core/dataframe/pandas/dataframe/dataframe.py

Comment on lines +2080 to +2081

		apply_indices=None,
		numeric_indices=None,

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

do we really want these two parameters to exist at the same time? we can easily end-up in an ambiguous situation with this set of parameters:

md_df.map_select_indices(apply_indices=["a","b"],numeric_indices=[1,2,3,4,5], ...)# what's the method supposed to do?

modin/core/dataframe/pandas/dataframe/dataframe.py

		)

		defrename(
		defwindow(

Copy link

Collaborator

dchigarevDec 7, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

why we're adding this here if there's no implementation? Shouldn't it be located in the base class then?

Copy link

CollaboratorAuthor

noloerino commentedDec 7, 2022

Thanks for taking the time to review@dchigarev. Broadly speaking, the purpose of this PR is to make calling the various partition application methods more uniform, and remove misleading "broadcast" nomenclature from the codebase (my understanding is that when the functions were originally written, the intent was for the functions to broadcast arguments to match dimensions like in somenumpy functions).

I'll see if I can split this into several smaller PRs; your suggestions for how to break it down makes sense, although this fragmentation might cause some inconsistencies between how different mapping methods are used. I'll double check with@RehanSD (who assigned me to this task) if this is a viable approach.

noloerino mentioned this pull request

Dec 13, 2022

REFACTOR: Replacedtypes="copy" withcopy_dtypes flag#5424

Open

noloerino marked this pull request as draft

December 13, 2022 00:47

Copy link

CollaboratorAuthor

noloerino commentedDec 13, 2022

I've decided to split this into smaller parts as you suggested, starting with#5426 and#5427. Thanks again for the advice@dchigarev.

Reviewers

dchigarevdchigarev requested changes

devin-petersohnAwaiting requested review from devin-petersohndevin-petersohn will be requested when the pull request is marked ready for reviewdevin-petersohn is a code owner

mvashishthaAwaiting requested review from mvashishthamvashishtha will be requested when the pull request is marked ready for reviewmvashishtha is a code owner

RehanSDAwaiting requested review from RehanSDRehanSD will be requested when the pull request is marked ready for reviewRehanSD is a code owner

YarShevAwaiting requested review from YarShevYarShev will be requested when the pull request is marked ready for reviewYarShev is a code owner

vnlitvinovAwaiting requested review from vnlitvinovvnlitvinov will be requested when the pull request is marked ready for reviewvnlitvinov is a code owner

anmyachevAwaiting requested review from anmyachevanmyachev will be requested when the pull request is marked ready for reviewanmyachev is a code owner

Labels

None yet

Movatterモバイル変換

FEAT-#4909: Properly implement map operator#5118

Are you sure you want to change the base?

FEAT-#4909: Properly implement map operator#5118

Uh oh!

Conversation

noloerino commentedOct 12, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

What do these changes do?

Microbenchmarks

abs

apply

describe

Uh oh!

codecovbot commentedOct 12, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Codecov Report

Uh oh!

lgtm-combot commentedOct 12, 2022

Uh oh!

dchigarev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noloerino commentedNov 1, 2022• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Uh oh!

noloerino commentedDec 7, 2022

Uh oh!

dchigarev left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

noloerino commentedDec 7, 2022

Uh oh!

noloerino commentedDec 13, 2022

Uh oh!

Reviewers

noloerino commentedOct 12, 2022•
edited
Loading

codecovbot commentedOct 12, 2022•
edited
Loading

noloerino commentedNov 1, 2022•
edited
Loading