Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork18.5k
ENH: Reimplement DataFrame.lookup#61185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Uh oh!
There was an error while loading.Please reload this page.
Conversation
np.random.Generator.random, not np.random.Generator
stevenae commentedMar 27, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
I tested out three variants of subsetting the dataframe before converting to numpy:
Optimization testing script: importpandasaspdimportnumpyasnpimporttimeitnp.random.seed(43)fornin [100,100_000]:forkinrange(2,6):print(k,n)cols=list('abcdef')df=pd.DataFrame(np.random.randint(0,10,size=(n,len(cols))),columns=cols)df['col']=np.random.choice(cols,n)sample_n=n//10idx=np.random.choice(df['col'].index.to_numpy(),sample_n)cols=np.random.choice(df['col'].to_numpy(),sample_n)timeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)str_col=cols[0]df[str_col]=df[str_col].astype(str)df[str_col]=str_coltimeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)
As a result of this testing I settled on the third option. |
rhshadrach left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
If we are to move forward, this looks good, should get a whatsnew in enhancements for 3.0.
Uh oh!
There was an error while loading.Please reload this page.
rhshadrach left a comment• edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Edit: Fixed link below
Is the implementation in#40140 (comment) not sufficient?
size=100_000df=pd.DataFrame({'a':np.random.randint(0,100,size),'b':np.random.random(size),'c':'x'})row_labels=np.repeat(np.arange(size),2)col_labels=np.tile(['a','b'],size)%timeitdf.lookup(row_labels,col_labels)# 22.3 ms ± 391 μs per loop (mean ± std. dev. of 7 runs, 10 loops each) <--- this PR# 13.4 ms ± 17 μs per loop (mean ± std. dev. of 7 runs, 100 loops each) <--- proposed implementation
The implementation |
pandas.DataFrame.lookup
Done |
rhshadrach commentedMar 28, 2025 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
@stevenae - sorry, linked to the wrong comment. I've fixed my comment above. Ah, but I think I see. This avoids a large copy when only certain columns are used. |
cc @pandas-dev/pandas-core My take: this provides an implementation for what I think is a natural operation that is not straightforward for most users. It provides performance benefits that take into account columnar-based storage (subsetting columns prior to calling |
Yes -- I ran a comparison (script at end) and found this PR implementation beats the comment you referenced on large mixed-type lookups. Metrics
Script importpandasaspdimportnumpyasnpimporttimeitnp.random.seed(43)defpd_lookup(df,row_labels,col_labels):rows=df.index.get_indexer(row_labels)cols=df.columns.get_indexer(col_labels)result=df.to_numpy()[rows,cols]returnresultfornin [100,100_000]:forkinrange(2,6):print(k,n)cols=list('abcdef')df=pd.DataFrame(np.random.randint(0,10,size=(n,len(cols))),columns=cols)df['col']=np.random.choice(cols,n)sample_n=n//10idx=np.random.choice(df['col'].index.to_numpy(),sample_n)cols=np.random.choice(df['col'].to_numpy(),sample_n)timeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)timeit.timeit(lambda:pd_lookup(df.drop(columns='col'),idx,cols),number=1000)str_col=cols[0]df[str_col]=df[str_col].astype(str)df[str_col]=str_coltimeit.timeit(lambda:df.drop(columns='col').lookup(idx,cols),number=1000)timeit.timeit(lambda:pd_lookup(df.drop(columns='col'),idx,cols),number=1000) |
pandas/core/frame.py Outdated
Returns | ||
------- | ||
numpy.ndarray | ||
The found values. | ||
""" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I think it would be really useful to have an example here in the docs for the API.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Added, please take a look.
expanded example
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Nice example. I will let the other pandas developers handle the rest of the PR
and column labels,this can be achieved by ``pandas.factorize``andNumPy indexing. | ||
For instance: | ||
and column labels,and the ``lookup``method allows for thisandreturns a | ||
NumPy array.For instance: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do we have other places in our API where we return a NumPy array? With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
It looks likevalues
also does this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Agreed I think this API should return anExtensionArray
or numpy array depending on the initial type or result type
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
values
only returns a NumPy array for numpy types. For extension types or arrow-backed types you get something different:
>>>pd.Series([1,2,3],dtype="int64[pyarrow]").values<ArrowExtensionArray>[1,2,3]Length:3,dtype:int64[pyarrow]
I don't think we should force a NumPy array return here; particularly for string data, that could be non-performant and expensive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Thought through and did a bit more of a heavy-handed rewrite.
Now usingmelt
to achieve the outcome ofvalues
orto_numpy
'
Performance does take a hit, however, we are still outperforming the naiive lookup ofto_numpy
for mixed-type lookups.
Old PR | New PR |
---|---|
2 100 | |
0.1964133749715984 | 0.5150299999950221 |
0.274302874924615 | 0.5055611249990761 |
3 100 | |
0.15044220816344023 | 0.48040162499819417 |
0.2768622918520123 | 0.5237024579982972 |
4 100 | |
0.15489325020462275 | 0.49075670799356885 |
0.26732829213142395 | 0.5079907500039553 |
5 100 | |
0.1546538749244064 | 0.4678692500019679 |
0.2721201251260936 | 0.5082256250025239 |
2 100000 | |
0.8096102089621127 | 2.114792499996838 |
1.9508202918805182 | 2.619460332993185 |
3 100000 | |
0.8242515418678522 | 2.2221941250027157 |
1.9535491249989718 | 2.6292148750071647 |
4 100000 | |
0.8302762501407415 | 2.3314981659932528 |
1.9240409170743078 | 2.711707041991758 |
5 100000 | |
0.8654224998317659 | 2.201970291993348 |
2.0630989999044687 | 2.674396375005017 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Do we have other places in our API where we return a NumPy array?
factorize
With the prevalance of the Arrow type system this doesn't seem desirable to be locked into returning a NumPy array
This function can be operating on multiple columns of different dtypes. I think the only option in such a case is to return a NumPy array.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
That's true on factorize but that isn't 100% an equivalent comparison. For sure the indexer is a numpy array, but the values in the two-tuple are an Index that should be type-preserving.
That's also a great point on the mixed column types, but that makes me wary of re-implementing this function. With all of the work going towards clarifying our nullability handling and implementing more than just NumPy types, it seems like this function is going to have a ton of edge cases
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
We could also wrap the result in aSeries
.
Uh oh!
There was an error while loading.Please reload this page.
Trying to make sure i understand correctly: this seems equivalent to |
Hi@jbrockmendel -- df.loc[rows, cols] returns all columns for all rows. Lookup only returns the values at paired columns and rows. |
That makes sense, thanks. So more of a |
I think the best analogue from within pandas is is a for loop of .at[]. |
Overall I am -1 adding this back in. I think the utility of this function is limited in the general case of non-homogenous dataframes. |
This pull request is stale because it has been open for thirty days with no activity. Pleaseupdate and respond to this comment if you're still interested in working on this. |
I am still interested.@rhshadrach what's the right next step? …On Fri, May 16, 2025, 8:08 PM github-actions[bot] ***@***.***> wrote: *github-actions[bot]* left a comment (pandas-dev/pandas#61185) <#61185 (comment)> This pull request is stale because it has been open for thirty days with no activity. Please update <https://pandas.pydata.org/pandas-docs/stable/development/contributing.html#updating-your-pull-request> and respond to this comment if you're still interested in working on this. — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHPXV5NLEHJQ3RVANL326Z4YPAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBXHA3DSOBYGE> . You are receiving this because you were mentioned.Message ID: ***@***.***> |
@stevenae - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling.
For the benchmark in bullet 2, I ran the code in#61185 (comment) with the following modification of defpd_lookup(df,row_labels,col_labels):df=df.loc[:,sorted(set(col_labels))]rows=df.index.get_indexer(row_labels)cols=df.columns.get_indexer(col_labels)result=df.to_numpy()[rows,cols]returnresult |
Understood! Should I put together a recipe for the documentation then?Since it seems there's indeed a 30% performance improvement to be had whenindexing heterogeneous columns. …On Sun, May 18, 2025, 9:31 AM Richard Shadrach ***@***.***> wrote: *rhshadrach* left a comment (pandas-dev/pandas#61185) <#61185 (comment)>@stevenae <https://github.com/stevenae> - I think we need agreement among core devs on whether this should be supported. While I'm sympathetic to users who found this useful prior to it's removal from pandas, there are a few arguments against which I find compelling. - Other DataFrame libraries do not offer such a method (to my knowledge). - The implementation can be achieved using existing functionality with what I measure (see below) as a 30% decrease in performance in non-homogeneous cases, and a 400% increase in performance in the homogeneous case. - Methods that coerce to object dtype when used on non-homogeneous DataFrames is something that I would like to see less in the built-in methods of pandas, not more. Here it's my opinion *not* that user's shouldn't be able to do it, but that it we should avoid it being built-in to pandas. For the benchmark in bullet 2, I ran the code in#61185 (comment) <#61185 (comment)> with the following modification of pd_lookup: def pd_lookup(df, row_labels, col_labels): df = df.loc[:, sorted(set(col_labels))] rows = df.index.get_indexer(row_labels) cols = df.columns.get_indexer(col_labels) result = df.to_numpy()[rows, cols] return result — Reply to this email directly, view it on GitHub <#61185 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAFZOHOBP6HQLM3EJVA6FQD27CDTHAVCNFSM6AAAAABZ3L3IL6VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDQOBYHE4TEMRWGQ> . You are receiving this because you were mentioned.Message ID: ***@***.***> |
@stevenae - yes, I think that would be uncontroversial. |
Okay! Will do in the next couple weeks |
@rhshadrach doc update is at#61471 |
Uh oh!
There was an error while loading.Please reload this page.
doc/source/whatsnew/vX.X.X.rst
file if fixing a bug or adding a new feature.Optimization notes:
Most important change is removal of:
if not self._is_mixed_type or n > thresh
The old implementation slowed down when
n < thresh
, with or without mixed types. Casesn < thresh
now 10x faster.Logic can be followed via python operator precedence:
https://docs.python.org/3/reference/expressions.html#operator-precedence
Test notes:
I am unfamiliar with pytest and did not add paramterization