Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

feat(datasets): use scikit-learn datasets downloader#201

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Draft
homksei wants to merge1 commit intoIntelPython:main
base:main
Choose a base branch
Loading
fromhomksei:feat-datasets-verify

Conversation

@homksei
Copy link
Contributor

Description

This PR updates the dataset downloading mechanism to ensure data integrity by implementing SHA256 checksum verification. It replaces the customretrieve function withsklearn.datasets._base.fetch_file.

Changes:

  • sklbench/datasets/downloaders.py:

    • Modifieddownload_and_read_csv to accept a tuple containing(filename, url, sha256) instead of a raw URL.
    • Replaced the localretrieve function withsklearn.datasets._base.fetch_file to handle downloads and hash validation.
    • Added logging for download operations.
  • sklbench/datasets/loaders.py:

    • Updated all dataset loading functions (e.g.,load_airline_depdelay,load_hepmass,load_higgs,load_sift, etc.) to provide the specific filename, base URL, and corresponding SHA256 hash.
    • Refactoredload_ann_dataset_template to support the new metadata structure.

Motivation:
To prevent the usage of corrupted or tampered data files and to standardize the downloading logic using scikit-learn's internal utilities.


Checklist:

Completeness and readability

  • I have commented my code, particularly in hard-to-understand areas.
  • I have updated the documentation to reflect the changes or created a separate PR with updates and provided its number in the description, if necessary.
  • Git commit message contains an appropriate signed-off-by string(seeCONTRIBUTING.md for details).
  • I have resolved any merge conflicts that might occur with the base branch.

Testing

  • I have run it locally and tested the changes extensively.
  • All CI jobs are green or I have provided justification why they aren't.
  • I have extended testing suite if new functionality was introduced in this PR.

@david-cortes-intel
Copy link
Contributor

/intelci: run

@david-cortes-intel
Copy link
Contributor

/intelci: run ml-benchmarks

Sign up for freeto join this conversation on GitHub. Already have an account?Sign in to comment

Reviewers

@razdoburdinrazdoburdinAwaiting requested review from razdoburdin

@ethanglaserethanglaserAwaiting requested review from ethanglaser

@david-cortes-inteldavid-cortes-intelAwaiting requested review from david-cortes-inteldavid-cortes-intel will be requested when the pull request is marked ready for reviewdavid-cortes-intel is a code owner

At least 1 approving review is required to merge this pull request.

Assignees

No one assigned

Labels

None yet

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

2 participants

@homksei@david-cortes-intel

[8]ページ先頭

©2009-2025 Movatter.jp