Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up

Megatron's multi-modal data loader

License

NotificationsYou must be signed in to change notification settings

NVIDIA/Megatron-Energon

Repository files navigation

Megatron's multi-modal data loader

Megatron Energon

TestsDocumentation
Report Bug ·Request Feature


DISCLAIMER: This package contains research code. APIs may change.

What is this?

Megatron Energon is the multi-modal data loader ofMegatron (you can also use it independently).

It's best at

  • loading large training data to train large multi-modal models
  • blending many different datasets together
  • distributing the work across many nodes and processes of a cluster
  • ensuring reproducibility and resumability
  • adapting easily to various types of data samples and processing

Try using it together withMegatron Core.

Quickstart

Megatron Energon is a pip-installable python package that offers

  • dataset-related classes that you can import in your project
  • a command line utility for data preprocessing and conversion

This document is just a quick start. Please also check out thedocumentation.

Installation

To install the latest stable version:

pip install megatron-energon

Or to install the current development version:

pip install git+https://github.com/NVIDIA/Megatron-Energon.git

NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.

For more details on installing this package, seehere.

Usage of command line tool

After installation, the commandenergon will be available.

Here are some examples for things you can do:

CommandDescription
energon prepare DATASET_ROOTTake an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset
energon lint DATASET_ROOTVerify that the dataset complies with the energon dataset format and that all samples are loadable

Usage of the library

To get started, pick aWebDataset-compliant dataset and runenergon prepare DATASET_ROOT on it, to run the interactive assistant and create the.nv-meta folder.

Once done, try to load it from your Python program:

frommegatron.energonimportget_train_dataset,get_loader,WorkerConfigsimple_worker_config=WorkerConfig(rank=0,world_size=1,num_workers=2)train_ds=get_train_dataset('/my/dataset/path',batch_size=2,shuffle_buffer_size=None,max_samples_per_sequence=None,worker_config=simple_worker_config,)train_loader=get_loader(train_ds)forbatchintrain_loader:# Do something with batch# Infer, gradient step, ...pass

For more details, read thedocumentation.

About

Megatron's multi-modal data loader

Resources

License

Stars

Watchers

Forks

Packages

No packages published

Languages


[8]ページ先頭

©2009-2025 Movatter.jp