Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

[Data] add reset pandas index when merge sorted blocks#45326

Open
tespent wants to merge1 commit intoray-project:master
base:master
Choose a base branch
Loading
fromtespent:fix/pandas-sorted-index

Conversation

tespent
Copy link
Contributor

@tespenttespent commentedMay 14, 2024
edited
Loading

Why are these changes needed?

The index of pandas blocks are reset after all block processing, but not for sort currently. This resulted in a lot of other code being broken. For example in get_bounds of random access dataset, it uses the index to get the first and the last row (L286), but without resetting index after sort, it gets random rows.

def_get_bounds(block,key):
iflen(block)==0:
returnNone
b= (block[key][0],block[key][len(block)-1])
ifisinstance(block,pa.Table):
b= (b[0].as_py(),b[1].as_py())
returnb

Related issue number

Checks

  • I've signed off every commit(by using the -s flag, i.e.,git commit -s) in this PR.
  • I've runscripts/format.sh to lint the changes in this PR.
  • I've included any doc changes needed forhttps://docs.ray.io/en/master/.
    • I've added any new APIs to the API Reference. For example, if I added a
      method in Tune, I've added it indoc/source/tune/api/ under the
      corresponding.rst file.
  • I've made sure the tests are passing. Note that there might be a few flaky tests, see the recent failures athttps://flakey-tests.ray.io/
  • Testing Strategy
    • Unit tests
    • Release tests
    • This PR is not tested :(

@MissiontoMars
Copy link

@raulchen

@anyscalesamanyscalesam added triageNeeds triage (eg: priority, bug/not-bug, and owning component) dataRay Data-related issues labelsMay 29, 2024
@anyscalesamanyscalesam added P1Issue that should be fixed within a few weeks and removed triageNeeds triage (eg: priority, bug/not-bug, and owning component) labelsMay 30, 2024
@richardliaw
Copy link
Contributor

Hi@tespent , if you could add a test for this / rebase, that'd be appreciated.

@richardliawrichardliaw added the @external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission. labelMar 18, 2025
Signed-off-by: Wu Yufei <wuyufei.2000@bytedance.com>
@tespenttespentforce-pushed thefix/pandas-sorted-index branch from07e4b54 to72c2c5dCompareMarch 18, 2025 07:25
@tespenttespent requested a review froma team as acode ownerMarch 18, 2025 07:25
Comment on lines +447 to +465
def test_index_reset(ray_start_regular_shared):
# pandas blocks should have index reset after processing
import ray

ds = ray.data.range(100, override_num_blocks=10)
ds = ds.map_batches(
lambda df: df.assign(val=(50.6 - df["id"]).abs(), idx=-df["id"]).set_index(
"idx"
),
batch_format="pandas",
)
# ensure index is reset after map
for df in ray.get(ds.get_internal_block_refs()):
assert (df.index == df.reset_index(drop=True).index).all()

ds = ds.sort("val")
# ensure index is reset after sort
for df in ray.get(ds.get_internal_block_refs()):
assert (df.index == df.reset_index(drop=True).index).all()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

This test checks for implementation-- can you instead provide a test for a data pipeline that was breaking before this change?

@hainesmichaelchainesmichaelc added the community-contributionContributed by the community labelApr 4, 2025
Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment
Reviewers

@richardliawrichardliawrichardliaw left review comments

@ericlericlAwaiting requested review from ericl

@scv119scv119Awaiting requested review from scv119

@c21c21Awaiting requested review from c21

@amogkamamogkamAwaiting requested review from amogkam

@scottjleescottjleeAwaiting requested review from scottjlee

@bveeramanibveeramaniAwaiting requested review from bveeramani

@raulchenraulchenAwaiting requested review from raulchen

@stephanie-wangstephanie-wangAwaiting requested review from stephanie-wang

@omatthew98omatthew98Awaiting requested review from omatthew98

At least 1 approving review is required to merge this pull request.

Assignees

@richardliawrichardliaw

Labels
community-contributionContributed by the communitydataRay Data-related issues@external-author-action-requiredAlternate tag for PRs where the author doesn't have labeling permission.P1Issue that should be fixed within a few weeks
Projects
None yet
Milestone
No milestone
Development

Successfully merging this pull request may close these issues.

6 participants
@tespent@MissiontoMars@richardliaw@c21@hainesmichaelc@anyscalesam

[8]ページ先頭

©2009-2025 Movatter.jp