Movatterモバイル変換

Skip to content

IntelPython/scikit-learn_benchPublic

NotificationsYou must be signed in to change notification settings
Fork74
Star118

feat(datasets): use scikit-learn datasets downloader#201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft

homksei wants to merge1 commit intoIntelPython:main

base:main

Choose a base branch

fromhomksei:feat-datasets-verify

Draft

feat(datasets): use scikit-learn datasets downloader#201

homksei wants to merge1 commit intoIntelPython:mainfromhomksei:feat-datasets-verify

Conversation

@homksei

Copy link

Contributor

homksei commentedNov 26, 2025

Description

This PR updates the dataset downloading mechanism to ensure data integrity by implementing SHA256 checksum verification. It replaces the customretrieve function withsklearn.datasets._base.fetch_file.

Changes:

sklbench/datasets/downloaders.py:
- Modifieddownload_and_read_csv to accept a tuple containing(filename, url, sha256) instead of a raw URL.
- Replaced the localretrieve function withsklearn.datasets._base.fetch_file to handle downloads and hash validation.
- Added logging for download operations.
sklbench/datasets/loaders.py:
- Updated all dataset loading functions (e.g.,load_airline_depdelay,load_hepmass,load_higgs,load_sift, etc.) to provide the specific filename, base URL, and corresponding SHA256 hash.
- Refactoredload_ann_dataset_template to support the new metadata structure.

Motivation:
To prevent the usage of corrupted or tampered data files and to standardize the downloading logic using scikit-learn's internal utilities.

Checklist:

Completeness and readability

I have commented my code, particularly in hard-to-understand areas.
I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
Git commit message contains an appropriate signed-off-by string(seeCONTRIBUTING.md for details).
I have resolved any merge conflicts that might occur with the base branch.

Testing

I have run it locally and tested the changes extensively.
All CI jobs are green or I have provided justification why they aren't.
I have extended testing suite if new functionality was introduced in this PR.

@homksei

feat(datasets): use scikit-learn datasets downloader

13b11d9

@david-cortes-intel

Copy link

Contributor

david-cortes-intel commentedNov 26, 2025

/intelci: run

@david-cortes-intel

david-cortes-intel requested review fromethanglaser andrazdoburdin

November 26, 2025 15:57

@david-cortes-intel

Copy link

Contributor

david-cortes-intel commentedNov 27, 2025

/intelci: run ml-benchmarks

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Labels

None yet

2 participants

@homksei

@david-cortes-intel

[8]ページ先頭

©2009-2025 Movatter.jp