Uh oh!
There was an error while loading.Please reload this page.
- Notifications
You must be signed in to change notification settings - Fork18.5k
BUG: Raise MergeError when suffixes result in duplicate column names …#61422
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
base:main
Are you sure you want to change the base?
Conversation
fe71e1a
toc9bfc5a
Comparepre-commit.ci autofix |
dc1bb47
toad744df
CompareThanks for your contribution! Just a quick note: you don't need to write
For reference, you can check the contributing guidelines here:https://pandas.pydata.org/docs/development/contributing_codebase.html#documenting-your-code |
Thanks for the pointers. I'll get those added in here soon. Trying to track down why the Unit Tests / Linux-32-bit(pull_request) is failing. I didn't change anything that should have effected Series, so it's kinda weird. I also can't get the pytest to run normally on my dev yet either, so I haven't been able to fully replicate the failure locally yet. So, still a little more work to do here. |
@Farsidetfs I believe the CI failure is not related to your changes. It appears to be caused by the cython version — pandas unit tests fail with |
nikaltipar left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Added a few comments from a bird's eye view. Thanks for the change.
pandas/core/reshape/merge.py Outdated
iflen(left_collisions)>0: | ||
raiseMergeError( | ||
"Passing 'suffixes' which cause duplicate columns " | ||
f"{set(left_collisions)} is not allowed" | ||
) | ||
iflen(right_collisions)>0: | ||
raiseMergeError( | ||
"Passing 'suffixes' which cause duplicate columns " | ||
f"{set(right_collisions)} is not allowed" | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
I would recommend you combine this into a common error to reduce repetition (bonus points for combining it with the pre-existing error just a few lines below)
pandas/core/reshape/merge.py Outdated
# Check for duplicates created by suffixes | ||
left_collisions=llabels.intersection(right.difference(to_rename)) | ||
right_collisions=rlabels.intersection(left.difference(to_rename)) | ||
iflen(left_collisions)>0: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Doesn't justif not left_collisions.empty:
work? Same for a similar check below.
pandas/core/reshape/merge.py Outdated
@@ -3058,6 +3058,20 @@ def renamer(x, suffix: str | None): | |||
llabels=left._transform_index(lrenamer) | |||
rlabels=right._transform_index(rrenamer) | |||
# Check for duplicates created by suffixes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
For new readers of this code, the comment might not be descriptive enough. While your code is supposed to find suffixes that are caused by duplicated would-be created columns across dataframes, there is an extra section that does a duplicate checking just below your new code (but would-be created columns due to duplicity within the same dataframe).
ad744df
to9a40cd0
Comparepre-commit.ci autofix |
e9efd0d
to1476957
Comparepre-commit.ci autofix |
cb23817
to6b82c85
Comparepre-commit.ci autofix |
5aff565
toe62c70f
Comparepre-commit.ci autofix |
@nikaltipar I think this should be ready now. Please let me know if I've missed anything. I took your advice and combined the two with slight modifications to improve efficiency using sets throughout rather than just convert at the end. |
nikaltipar commentedMay 14, 2025
Thanks for taking care of that,@Farsidetfs ! It looks good to me, no other comments from my side. Thanks for adding the unit-tests, too! |
@nikaltipar Could you rebase main branch to trigger CI again? |
nikaltipar commentedMay 14, 2025
I am not able to, I'll have to wait for@Farsidetfs |
7d3837a
toc377804
CompareRebase complete. Thanks. Let me know if there's anything else needed. |
nikaltipar commentedMay 15, 2025
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Looking good!
doc/source/whatsnew/v2.3.0.rst Outdated
@@ -170,7 +170,7 @@ Groupby/resample/rolling | |||
Reshaping | |||
^^^^^^^^^ | |||
- | |||
- Bug in:meth:`DataFrame.merge` where user-provided suffixes could result in duplicate column names if the resulting names matched existing columns. Now raises a:class:`MergeError` in such cases. (:issue:`61402`) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you move this note to v3.0.0
df1=DataFrame({"col1": [1],"col2": [2]}) | ||
df2=DataFrame({"col1": [1],"col2": [2],"col2_dup": [3]}) | ||
withpytest.raises(MergeError,match="duplicate columns"): | ||
merge(df1,df2,on="col1",suffixes=("_dup","")) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you parametrize this test with
@pytest.mark.parametrize("suffixes", [("_dup", ""), ("", "_dup")])
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
@rhshadrach Done. Please let me know if I missed anything.
b7b3a95
tobc62afe
Comparepandas/core/reshape/merge.py Outdated
@@ -3058,17 +3058,19 @@ def renamer(x, suffix: str | None): | |||
llabels=left._transform_index(lrenamer) | |||
rlabels=right._transform_index(rrenamer) | |||
dups=[] | |||
dups=set() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you keep this as alist
and do the uniquenessset
call whendups
is not empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Can you elaborate why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Small performance optimization - if there are no duplicate labels (which is probably the most common case) we are doing unnecessary work hashing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
Got it. I was thinking set was more efficient for operations, but since this should be a very rare case, it is wasted. I'll make the adjustment. Thanks for the clarity.
5b816ce
to9bc2b66
Compare
Uh oh!
There was an error while loading.Please reload this page.
closes#61402
Allcode checks passed.
Addedtype annotations to new arguments/methods/functions.
Added an entry in the latest doc/source/whatsnew/vX.X.X.rst file if fixing a bug or adding a new feature.