- Notifications
You must be signed in to change notification settings - Fork24
Releases: NVIDIA/Megatron-Energon
Releases · NVIDIA/Megatron-Energon
7.1.0
cb7d4e0
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
- Add AV metadata fetching in#138
- Adds method
AVDecoder.get_metadata
to retrieve all kinds of video/audio properties all at once
- Adds method
- Fix for uppercase extensions in#141
- Now e.g.
*.WAV
files will be handled just like*.wav
- Now e.g.
- Add
version
to CLI and prepped dataset in#142- Helps track which energon version you have and which version was used to prepare a dataset
Full Changelog:7.0.0...7.1.0
Assets2
Uh oh!
There was an error while loading.Please reload this page.
7.0.0
272733d
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
New Features
- Introducing polylithic datasets that allow to load media data from auxiliary data sources on the fly
- Cache pools to prefetch data (especially from auxiliary datasets) in the background
- New AVDecoder based on fastseek for fast and selective video and audio decoding
- A new watchdog that shows you a trace in case sample processing is stuck
- Use
energon mount
to access your energon datasets through a virtual (FUSE) filesystem - Much faster initialization even with millions of shards by using json instead of yaml
- Samples now carry their
__sources__
information along for simplified debugging - You can now specify a custom sample decoder in your task encoder (to define how to decode each file extension)
- Improved documentation
Breaking Changes
ITERATION ORDER BREAKING CHANGE
- Energon now requires Python 3.10 or later
- Remove cooker condition functions (so we know cooker assignment without loading samples) in#139
- Removed the old
__subflavor__
in favor of the newer__subflavors__
Fixes
- Fix docs build for Python 3.12
- Fix a bug "tuple index out of range" in
self.slice_offsets
in#107 - Fix packing at dataset exhaustion in#115
- Fix Save - Restore indexing in#126
- Fix Batch base class in#131
- Fix absolute paths with protocol (i.e. msc://) in#135
New Contributors
- @Queuecumber made their first contribution in#112
Full Changelog:6.0.1...7.0.0
Assets2
Uh oh!
There was an error while loading.Please reload this page.
6.0.1
77b3e76
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
New Features
- New dataset joining and pre-indexed joined tar iteration (#51,@philipp-fischer,@voegtlel)
- Allow restore with different worker configuration (refactor save/restore concept, worker dimension as outer dimension) (#80,@philipp-fischer,@voegtlel)
- Use EPath for all paths, removes fsspec (#62,@voegtlel)
- Simplify savable loader implementation (#87,@philipp-fischer,@voegtlel)
- Efficient Audio and Video decoding (#38,#93,@jon-barker,@voegtlel)
- Expose
prefetch_factor
arg for loader (#83,@philipp-fischer)
Fixes
- Fix len for
RepeatDataset
with float repeats (#89,@voegtlel) - Fix
EPath
for relative local filesystem string paths (#95,@voegtlel) - Fix
EPath.open
inITarReader
(#88,@shunjiad) - Fix a rare bug in save/restore (#79,@philipp-fischer)
Internal Changes
- Toolchain upgrade: Drop black, isort and introduce ruff, uv and just (#67,@philipp-fischer)
Full Changelog:5.2.0...6.0.1
Assets2
Uh oh!
There was an error while loading.Please reload this page.
5.2.0
a8d4894
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
New Features
- Add ability to useMulti-Storage-Client in Energon, new dependencies (@shunjiad,#50)
- Allow float for
epochized_blend
repetitions (@philipp-fischer,#56) - Extend docs with example for interleaved data samples (@voegtlel,@philipp-fischer,#46)
Fixes
encode_batch
now called for grouped batching (@nvnbagrov,#64)- GC: make
gc_collect_every_n_steps
configurable and add a new default for optimized speed (@philipp-fischer,@voegtlel,#66) - Create docs for usage with parallelism and fix
save_state_global
andrestore_state_global
for Tensor Parallelism (@philipp-fischer,@voegtlel,#72) - Preparing crude (@nvnbagrov,#55)
- Packing pause/burst when buffer is empty. Restore ability to fill the buffer as it is emitted (@voegtlel,#65)
Internal Changes
- Rename
prepare()
in metadatasets topost_initialize()
(@philipp-fischer,#45) - Refactored
prepare
(@nvnbagrov,#57) - Fix restore key if there's an empty worker (@philipp-fischer,#69)
Full Changelog:5.1.1...5.2.0
Assets2
Uh oh!
There was an error while loading.Please reload this page.
5.1.1
d8c48d1
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
Assets2
Uh oh!
There was an error while loading.Please reload this page.
5.1.0
1782f65
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
5.0.0
ea7b99c
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
- Implement epochs for blending (optionally giving the number of repetitions for each dataset for an epoch instead of sampling weight) (@voegtlel,@philipp-fischer,#41,CHECKPOINT BREAKING CHANGE)
- Implement grouped batching (e.g. for Open-Sora) (@voegtlel,#31)
- Fix distribution of samples to workers if using lots of small datasets (@philipp-fischer,#32,ITERATION ORDER BREAKING CHANGE)
- Improve and restructure documentation (@philipp-fischer,@voegtlel,#37)
- Activating
gc.freeze()
in workers on init to improvegc.collect()
speed by a lot (@voegtlel,#40) - Deprecated
SavableLoader.save_state
andSavableLoader.restore_state
: Renamed tosave_state_global
andrestore_state_global
, and removed the option to not specify thedst_rank
for saving (this is breaking but had no real use-case). Added docs for the scenarios. (@voegtlel,@philipp-fischer,#43) - Fix size print for >1PiB (@nvnbagrov,#39)
Internal Changes
- All dataset wrappers now have the worker config (@philipp-fischer,#36)
- Check black, isort, license headers (@voegtlel,@philipp-fischer,#25)
Assets2
Uh oh!
There was an error while loading.Please reload this page.
4.0.0
26700bb
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
- Enable adding of additional data by joining another dataset by@voegtlel and@philipp-fischer in#20
- Replace the dataset type in the dataset.yaml by sample type directly by@voegtlel and@philipp-fischer in#29
Breaking Changes
- Dataset checkpoints from <4.0.0 will not be compatible due to the structural simplification. Everything else (e.g. randomness and the interface compatibility) should remain the same.
Full Changelog:3.0.1...4.0.0
Assets2
Uh oh!
There was an error while loading.Please reload this page.
3.0.1
10c47c6
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
- This fixes
AttributeError: module 'fsspec' has no attribute 'asyn'
see#26 by@philipp-fischer
Full Changelog:3.0.0...3.0.1
Assets2
Uh oh!
There was an error while loading.Please reload this page.
3.0.0
62ea012
This commit was created on GitHub.com and signed with GitHub’sverified signature.
Compare
Could not load tags
Nothing to show
{{ refName }}defaultLoading
What's Changed
- Allow for reproducible scaling with different micro batch size in#11 by@philipp-fischer
- Introduce sequence packing and sample restore in#12 by@voegtlel and@philipp-fischer
energon info
command in#21 by@voegtlel
Full Changelog:2.3.0...3.0.0
Assets2
Uh oh!
There was an error while loading.Please reload this page.