NotificationsYou must be signed in to change notification settings
Fork18.6k
Star45.9k

ENH: Reimplement DataFrame.lookup#61185

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Closed

stevenae wants to merge20 commits intopandas-dev:mainfromstevenae:enh-lookup

Closed

ENH: Reimplement DataFrame.lookup#61185

stevenae wants to merge20 commits intopandas-dev:mainfromstevenae:enh-lookup

+153 −13

Conversation

Copy link

Contributor

stevenae commentedMar 26, 2025•
edited
Loading

closesENH: re-implement DataFrame.lookup. #40140
[Tests added and passed]
All [code checks passed]
Added [type annotations]
Added an entry in the latestdoc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.

Optimization notes:
Most important change is removal of:
if not self._is_mixed_type or n > thresh

The old implementation slowed down whenn < thresh, with or without mixed types. Casesn < thresh now 10x faster.

Logic can be followed via python operator precedence:

https://docs.python.org/3/reference/expressions.html#operator-precedence

Test notes:
I am unfamiliar with pytest and did not add paramterization

stevenae added10 commits

March 26, 2025 15:11

dev setup

a4057e5

Update dev_attempts.py

0f5ad86

removed mixed type and threshold

7e30181

Delete dev_attempts.py

6fed58d

Update indexing.rst

8156c42

bringing tests back from 1.1.x

c17a020

extend underline

4a0b856

spacing

e0b0b57

remove dev_version

2a6dfae

fixed test_lookup_requires_unique_axes

21280ed

np.random.Generator.random, not np.random.Generator

stevenae mentioned this pull request

Mar 27, 2025

ENH: re-implement DataFrame.lookup.#40140

Closed

stevenae added3 commits

March 27, 2025 12:19

Reduce columns to those in lookup

48f1cde

Update frame.py

9c060a8

Merge branch 'enh-lookup-subset' into enh-lookup

d620710

Copy link

ContributorAuthor

stevenae commentedMar 27, 2025•
edited
Loading

I tested out three variants of subsetting the dataframe before converting to numpy:

subset column and row
subset only column
subset column, then subset row if types are mixed

Optimization testing script:

importpandasaspdimportnumpyasnpimporttimeitnp.random.seed(43)fornin [100,100_000]:forkinrange(2,6):print(k,n)cols=list('abcdef')df=pd.DataFrame(np.random.randint(0,10,size=(n,len(cols))),columns=cols)df['col']=np.random.choice(cols,n)sample_n=n//10idx=np.random.choice(df['col'].index.to_numpy(),sample_n)cols=np.random.choice(df['col'].to_numpy(),sample_n)timeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)str_col=cols[0]df[str_col]=df[str_col].astype(str)df[str_col]=str_coltimeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)

	col+row	col-only	col+mixed row
	2 100	2 100	2 100
numeric	0.19170337496325374	0.2384615419432521	0.19463533395901322
mixed	0.1781897919718176	0.23713816609233618	0.27453291695564985
	3 100	3 100	3 100
numeric	0.15338195790536702	0.20400249981321394	0.1500512920320034
mixed	0.18086445797234774	0.2427495000883937	0.2795307501219213
	4 100	4 100	4 100
numeric	0.1565960831940174	0.2095870419871062	0.15431487490423024
mixed	0.17770141689106822	0.23276254208758473	0.26711999997496605
	5 100	5 100	5 100
numeric	0.1558396250475198	0.2023254157975316	0.15394329093396664
mixed	0.17938704183325171	0.2375077500473708	0.274615041911602
	2 100000	2 100000	2 100000
numeric	0.6304021249525249	1.2773219170048833	0.855312000028789
mixed	4.435680666938424	1.679579583927989	1.979861208004877
	3 100000	3 100000	3 100000
numeric	0.6471724167931825	1.248306917026639	0.843553707934916
mixed	4.393679084023461	1.7129242909140885	1.955484125064686
	4 100000	4 100000	4 100000
numeric	0.6682121250778437	1.2452070831786841	0.8302506660111248
mixed	4.390174541156739	1.6384193329140544	1.9620799159165472
	5 100000	5 100000	5 100000
numeric	0.6654676250182092	1.2772445830050856	0.865516958059743
mixed	4.451537624932826	1.742541000014171	2.0112057079095393

As a result of this testing I settled on the third option.

rhshadrach requested changes

Mar 28, 2025

View reviewed changes

Copy link

Member

rhshadrach left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

pandas/core/frame.py OutdatedShow resolvedHide resolved

rhshadrach reviewed

Mar 28, 2025

View reviewed changes

Copy link

Member

rhshadrach left a comment•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Edit: Fixed link below

Is the implementation in#40140 (comment) not sufficient?

size=100_000df=pd.DataFrame({'a':np.random.randint(0,100,size),'b':np.random.random(size),'c':'x'})row_labels=np.repeat(np.arange(size),2)col_labels=np.tile(['a','b'],size)%timeitdf.lookup(row_labels,col_labels)# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

Copy link

ContributorAuthor

stevenae commentedMar 28, 2025

Is the implementation in#40140 (comment) not sufficient?

size=100_000df=pd.DataFrame({'a':np.random.randint(0,100,size),'b':np.random.random(size),'c':'x'})row_labels=np.repeat(np.arange(size),2)col_labels=np.tile(['a','b'],size)%timeitdf.lookup(row_labels,col_labels)# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each)  <--- this PR# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each)  <--- proposed implementation

The implementationdf[['a', 'b']].sum().sum() pulls up the entire column, and does not support lookup of individual values by row/column. Is that what you are referring to?

stevenae added2 commits

March 28, 2025 15:46

one line to separate sections

4e0c17f

Update v3.0.0.rst

a5e379b

pandas.DataFrame.lookup

Copy link

ContributorAuthor

stevenae commentedMar 28, 2025

If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.

Done

Copy link

Member

rhshadrach commentedMar 28, 2025•
edited
Loading

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.

Ah, but I think I see. This avoids a large copy when only certain columns are used.

rhshadrach changed the title~~Enh lookup~~ENH: Reimplement DataFrame.lookup

Mar 28, 2025

rhshadrach added Enhancement IndexingRelated to indexing on series/frames, not to indexes themselves labels

Mar 28, 2025

Copy link

Member

rhshadrach commentedMar 28, 2025

cc @pandas-dev/pandas-core

My take: this provides an implementation for what I think is a natural operation that is not straightforward for most users. It provides performance benefits that take into account columnar-based storage (subsetting columns prior to calling.to_numpy()). This seems like a worthy addition in my opinion, especially given the user feedback when the previous version was removed.

Copy link

ContributorAuthor

stevenae commentedMar 28, 2025

@stevenae - sorry, linked to the wrong comment. I've fixed my comment above.
Ah, but I think I see. This avoids a large copy when only certain columns are used.

Yes -- I ran a comparison (script at end) and found this PR implementation beats the comment you referenced on large mixed-type lookups.

Metrics

PR	40140
2 100
0.1964133749715984	0.0907377500552684
0.274302874924615	0.11014608410187066
3 100
0.15044220816344023	0.08912291703745723
0.2768622918520123	0.11031254194676876
4 100
0.15489325020462275	0.09032529196701944
0.26732829213142395	0.10644491598941386
5 100
0.1546538749244064	0.08968612505123019
0.2721201251260936	0.11162270791828632
2 100000
0.8096102089621127	0.40509104216471314
1.9508202918805182	4.064577874960378
3 100000
0.8242515418678522	0.4148290839511901
1.9535491249989718	4.241159915924072
4 100000
0.8302762501407415	0.42497566691599786
1.9240409170743078	4.146159041905776
5 100000
0.8654224998317659	0.44505883287638426
2.0630989999044687	4.4090170410927385

Script

importpandasaspdimportnumpyasnpimporttimeitnp.random.seed(43)defpd_lookup(df,row_labels,col_labels):rows=df.index.get_indexer(row_labels)cols=df.columns.get_indexer(col_labels)result=df.to_numpy()[rows,cols]returnresultfornin [100,100_000]:forkinrange(2,6):print(k,n)cols=list('abcdef')df=pd.DataFrame(np.random.randint(0,10,size=(n,len(cols))),columns=cols)df['col']=np.random.choice(cols,n)sample_n=n//10idx=np.random.choice(df['col'].index.to_numpy(),sample_n)cols=np.random.choice(df['col'].to_numpy(),sample_n)timeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)timeit.timeit(lambda:pd_lookup(df.drop(columns='col'),idx,cols),number=1000)str_col=cols[0]df[str_col]=df[str_col].astype(str)df[str_col]=str_coltimeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)timeit.timeit(lambda:pd_lookup(df.drop(columns='col'),idx,cols),number=1000)

Dr-Irv requested changes

Mar 28, 2025

View reviewed changes

pandas/core/frame.py Outdated

Comment on lines 5153 to 5157

		Returns
		-------
		numpy.ndarray
		The found values.
		"""

Copy link

Contributor

Dr-IrvMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

I think it would be really useful to have an example here in the docs for the API.

Copy link

ContributorAuthor

stevenaeMar 28, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Added, please take a look.

stevenae added2 commits

March 28, 2025 17:05

Adding an example

47e0b1b

Update frame.py

0c04e97

expanded example

Dr-Irv approved these changes

Mar 28, 2025

View reviewed changes

Copy link

Contributor

Dr-Irv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Nice example. I will let the other pandas developers handle the rest of the PR

WillAyd reviewed

Mar 31, 2025

View reviewed changes

doc/source/user_guide/indexing.rst

		and column labels,this can be achieved by ``pandas.factorize``andNumPy indexing.
		For instance:
		and column labels,and the ``lookup``method allows for thisandreturns a
		NumPy array.For instance:

Copy link

Member

WillAydMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

Copy link

ContributorAuthor

stevenaeMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

It looks likevalues also does this.

Copy link

Member

mroeschkeMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Agreed I think this API should return anExtensionArray or numpy array depending on the initial type or result type

Copy link

Member

WillAydMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

values only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:

>>>pd.Series([1,2,3],dtype="int64[pyarrow]").values<ArrowExtensionArray>[1,2,3]Length:3,dtype:int64[pyarrow]

I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive

Copy link

ContributorAuthor

stevenaeMar 31, 2025•
edited
Loading

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Thought through and did a bit more of a heavy-handed rewrite.

Now usingmelt to achieve the outcome ofvalues orto_numpy'

Performance does take a hit, however, we are still outperforming the naiive lookup ofto_numpy for mixed-type lookups.

Old PR	New PR
2 100
0.1964133749715984	0.5150299999950221
0.274302874924615	0.5055611249990761
3 100
0.15044220816344023	0.48040162499819417
0.2768622918520123	0.5237024579982972
4 100
0.15489325020462275	0.49075670799356885
0.26732829213142395	0.5079907500039553
5 100
0.1546538749244064	0.4678692500019679
0.2721201251260936	0.5082256250025239
2 100000
0.8096102089621127	2.114792499996838
1.9508202918805182	2.619460332993185
3 100000
0.8242515418678522	2.2221941250027157
1.9535491249989718	2.6292148750071647
4 100000
0.8302762501407415	2.3314981659932528
1.9240409170743078	2.711707041991758
5 100000
0.8654224998317659	2.201970291993348
2.0630989999044687	2.674396375005017

Copy link

Member

rhshadrachMar 31, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

Do we have other places in our API where we return a NumPy array?

factorize

With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array

This function can be operating on multiple columns of different dtypes. I think the only option in such a case is to return a NumPy array.

Copy link

Member

WillAydApr 1, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

That's true on factorize but that isn't 100% an equivalent comparison. For sure the indexer is a numpy array, but the values in the two-tuple are an Index that should be type-preserving.

That's also a great point on the mixed column types, but that makes me wary of re-implementing this function. With all of the work going towards clarifying our nullability handling and implementing more than just NumPy types, it seems like this function is going to have a ton of edge cases

Copy link

Member

rhshadrachMay 18, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

We could also wrap the result in aSeries.

stevenae requested a review fromrhshadrach

March 31, 2025 14:25

mroeschke reviewed

Mar 31, 2025

View reviewed changes

pandas/core/frame.pyShow resolvedHide resolved

stevenae added3 commits

March 31, 2025 13:03

shorter example

7d6dea5

Update frame.py

5018365

potential compromise

rewrite to preserve types

6aa6218

Copy link

Member

jbrockmendel commentedApr 2, 2025

Trying to make sure i understand correctly: this seems equivalent todf.loc[rows, cols].to_numpy().ravel()? (or maybedf.loc[rows, cols].stack().values might be better for preserving EAs?) And the main motivation is that this is more performant than those options?

Copy link

ContributorAuthor

stevenae commentedApr 2, 2025

Trying to make sure i understand correctly: this seems equivalent todf.loc[rows, cols].to_numpy().ravel()? (or maybedf.loc[rows, cols].stack().values might be better for preserving EAs?) And the main motivation is that this is more performant than those options?

Hi@jbrockmendel -- df.loc[rows, cols] returns all columns for all rows. Lookup only returns the values at paired columns and rows.

Copy link

Member

jbrockmendel commentedApr 2, 2025

That makes sense, thanks. So more of adf[rows, cols].diag() (which doesnt exist)?

Copy link

ContributorAuthor

stevenae commentedApr 3, 2025

That makes sense, thanks. So more of adf[rows, cols].diag() (which doesnt exist)?

I think the best analogue from within pandas is is a for loop of .at[].

Copy link

Member

WillAyd commentedApr 8, 2025

Overall I am -1 adding this back in. I think the utility of this function is limited in the general case of non-homogenous dataframes.

Copy link

Contributor

github-actionsbot commentedMay 17, 2025

This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this.

github-actionsbot added the Stale label

May 17, 2025

Copy link

ContributorAuthor

stevenae commentedMay 17, 2025 via email

I am still interested.@rhshadrach what's the right next step?

…

On Fri, May 16, 2025, 8:08 PM github-actions[bot] ***@***.***> wrote: *github-actions[bot]* left a comment (pandas-dev/pandas#61185) <#61185 (comment)> This pull request is stale because it has been open for thirty days with no activity. Please update <https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#updating-your-pull-request> and respond to this comment if you're still interested in working on this. — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHPXV5NLEHJQ3RVANL326Z4YPAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBXHA3DSOBYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***>

Copy link

Member

rhshadrach commentedMay 18, 2025

@stevenae - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling.

Other DataFrame libraries do not offer such a method (to my knowledge).
The implementation can be achieved using existing functionality with what I measure (see below) as a 30% decrease in performance in non-homogeneous cases, and a 400% increase in performance in the homogeneous case.
Methods that coerce to object dtype when used on non-homogeneous DataFrames is something that I would like to see less in the built-in methods of pandas, not more. Here it's my opinionnot that user's shouldn't be able to do it, but that it we should avoid it being built-in to pandas.

For the benchmark in bullet 2, I ran the code in#61185 (comment) with the following modification ofpd_lookup:

defpd_lookup(df,row_labels,col_labels):df=df.loc[:,sorted(set(col_labels))]rows=df.index.get_indexer(row_labels)cols=df.columns.get_indexer(col_labels)result=df.to_numpy()[rows,cols]returnresult

Copy link

ContributorAuthor

stevenae commentedMay 18, 2025 via email

Understood! Should I put together a recipe for the documentation then?Since it seems there's indeed a 30% performance improvement to be had whenindexing heterogeneous columns.

…

On Sun, May 18, 2025, 9:31 AM Richard Shadrach ***@***.***> wrote: *rhshadrach* left a comment (pandas-dev/pandas#61185) <#61185 (comment)>@stevenae <https://github.com/stevenae> - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling. - Other DataFrame libraries do not offer such a method (to my knowledge). - The implementation can be achieved using existing functionality with what I measure (see below) as a 30% decrease in performance in non-homogeneous cases, and a 400% increase in performance in the homogeneous case. - Methods that coerce to object dtype when used on non-homogeneous DataFrames is something that I would like to see less in the built-in methods of pandas, not more. Here it's my opinion *not* that user's shouldn't be able to do it, but that it we should avoid it being built-in to pandas. For the benchmark in bullet 2, I ran the code in#61185 (comment) <#61185 (comment)> with the following modification of pd_lookup: def pd_lookup(df, row_labels, col_labels): df = df.loc[:, sorted(set(col_labels))] rows = df.index.get_indexer(row_labels) cols = df.columns.get_indexer(col_labels) result = df.to_numpy()[rows, cols] return result — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHOBP6HQLM3EJVA6FQD27CDTHAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBYHE4TEMRWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***>