Movatterモバイル変換

Skip to content

huggingface/dataset-viewerPublic

NotificationsYou must be signed in to change notification settings
Fork97
Star766

[Rows] sub-rowgroup loading using libviewer#3213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft

lhoestq wants to merge3 commits intomain

base:main

Choose a base branch

fromlibviewer-in-rows

Draft

[Rows] sub-rowgroup loading using libviewer#3213

lhoestq wants to merge3 commits intomainfromlibviewer-in-rows

Conversation

lhoestq

Copy link

Member

lhoestq commentedJul 7, 2025•
edited
Loading

continuation of#3199

Pretty important PR since sub-rowgroup loading lets us:

load Pandas datasets that are >300MB, or from parquet files file PyArrow/Spark if row groups are >300MB
increase the row group size indatasets for better Xet deduplication

TODO:

create page index when required (row groups > 300MB)
- it's a richer parquet metadata file than the one from from the "config-parquet-metadata" job and it replaces the existing metadata file
load rows using page index when required

It works using page pruning fromarrow-rs

kszucsand others added3 commits

June 10, 2025 12:02

@kszucs

feat: primitive parquet reader with page pruning

2dcbb2b

@lhoestq

add poetry build for libviewer

7dbd7d8

@lhoestq

add libviewer to rows

f6e88f0

@lhoestq

lhoestq mentioned this pull request

feat: primitive parquet reader with page pruning#3199

Draft

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Labels

None yet

2 participants

@lhoestq

@kszucs

[8]ページ先頭

©2009-2025 Movatter.jp