- Notifications
You must be signed in to change notification settings - Fork6.6k
[Data] add reset pandas index when merge sorted blocks#45326
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Uh oh!
There was an error while loading.Please reload this page.
Conversation
MissiontoMars commentedMay 14, 2024
Hi@tespent , if you could add a test for this / rebase, that'd be appreciated. |
Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>
07e4b54
to72c2c5d
Comparedef test_index_reset(ray_start_regular_shared): | ||
# pandas blocks should have index reset after processing | ||
import ray | ||
ds = ray.data.range(100, override_num_blocks=10) | ||
ds = ds.map_batches( | ||
lambda df: df.assign(val=(50.6 - df["id"]).abs(), idx=-df["id"]).set_index( | ||
"idx" | ||
), | ||
batch_format="pandas", | ||
) | ||
# ensure index is reset after map | ||
for df in ray.get(ds.get_internal_block_refs()): | ||
assert (df.index == df.reset_index(drop=True).index).all() | ||
ds = ds.sort("val") | ||
# ensure index is reset after sort | ||
for df in ray.get(ds.get_internal_block_refs()): | ||
assert (df.index == df.reset_index(drop=True).index).all() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
This test checks for implementation-- can you instead provide a test for a data pipeline that was breaking before this change?
This pull request has been automatically marked as stale because it has not had You can always ask for help on ourdiscussion forum orRay's public slack channel. If you'd like to keep this open, just leave any comment, and the stale label will be removed. |
This pull request has been automatically closed because there has been no more activity in the 14 days Please feel free to reopen or open a new pull request if you'd still like this to be addressed. Again, you can always ask for help on ourdiscussion forum orRay's public slack channel. Thanks again for your contribution! |
Uh oh!
There was an error while loading.Please reload this page.
Why are these changes needed?
The index of pandas blocks are reset after all block processing, but not for sort currently. This resulted in a lot of other code being broken. For example in get_bounds of random access dataset, it uses the index to get the first and the last row (L286), but without resetting index after sort, it gets random rows.
ray/python/ray/data/random_access_dataset.py
Lines 283 to 289 in9472ae2
Related issue number
Checks
git commit -s
) in this PR.scripts/format.sh
to lint the changes in this PR.method in Tune, I've added it in
doc/source/tune/api/
under thecorresponding
.rst
file.