- Notifications
You must be signed in to change notification settings - Fork20
Megatron's multi-modal data loader
License
NVIDIA/Megatron-Energon
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
DISCLAIMER: This package contains research code. APIs may change.
Megatron Energon is the multi-modal data loader ofMegatron (you can also use it independently).
It's best at
- loading large training data to train large multi-modal models
- blending many different datasets together
- distributing the work across many nodes and processes of a cluster
- ensuring reproducibility and resumability
- adapting easily to various types of data samples and processing
Try using it together withMegatron Core.
Megatron Energon is a pip-installable python package that offers
- dataset-related classes that you can import in your project
- a command line utility for data preprocessing and conversion
This document is just a quick start. Please also check out thedocumentation.
To install the latest stable version:
pip install megatron-energon
Or to install the current development version:
pip install git+https://github.com/NVIDIA/Megatron-Energon.git
NOTE: We encourage you to install the package (and not just import a local copy). This will ensure you have all the needed dependencies and that you can use the command line tool.
For more details on installing this package, seehere.
After installation, the commandenergon
will be available.
Here are some examples for things you can do:
Command | Description |
---|---|
energon prepare DATASET_ROOT | Take an existing WebDataset and add the required yaml files to turn it into an energon-compatible dataset |
energon lint DATASET_ROOT | Verify that the dataset complies with the energon dataset format and that all samples are loadable |
To get started, pick aWebDataset-compliant dataset and runenergon prepare DATASET_ROOT
on it, to run the interactive assistant and create the.nv-meta
folder.
Once done, try to load it from your Python program:
frommegatron.energonimportget_train_dataset,get_loader,WorkerConfigsimple_worker_config=WorkerConfig(rank=0,world_size=1,num_workers=2)train_ds=get_train_dataset('/my/dataset/path',batch_size=2,shuffle_buffer_size=None,max_samples_per_sequence=None,worker_config=simple_worker_config,)train_loader=get_loader(train_ds)forbatchintrain_loader:# Do something with batch# Infer, gradient step, ...pass
For more details, read thedocumentation.
About
Megatron's multi-modal data loader