- Notifications
You must be signed in to change notification settings - Fork26.3k
Data Basics
Manuel edited this pageJul 3, 2024 ·3 revisions
Read throughlink
- Sampler: Randomly choosing index per iteration. It would yield indices when
batch_sizeis notNone.- For
IterableDataset, it would keep yielding None(s) per iteration using_InfiniteConstantSampler
- For
- Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invoke
collate_fnover each batch of data and drop the remaining unfilled batch ifdrop_lastis set.- For
IterableDataset, it would simply get next batch-size elements as a batch.
- For
- Single Process:
Sampler | index/indices | V Fetcher | index/indices | V dataset | V collate_fn | V output- Multiple processes:
Sampler (Main process) | index/indices | VIndex Multiprocessing Queue (one healthy worker) | index/indices | V Fetcher (Worker process) | index/indices | V dataset | Batch of data | V collate_fn | V Result Multiprocessing Queue | Data | V pin_memory_thread (Main process) | V outputThis is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.
Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.
- Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
- Control randomness per worker using
worker_init_fn. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork. - COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
- Control randomness per worker using
- Difference between Map-style Datset and Iterable-style Dataset
- Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
- Iterable-style Dataset requires users to manually implement sharding inside
__iter__method usingtorch.utils.data.get_worker_info(). Please check theexample.
- Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities inside
IterableDatasetclass. (This is solved by TorchData project)
Read throughlink andlinkExpected features:
- Automatic/Dydamic sharding
- Determinism Control
- Snapshotting
- DataFrame integration
- etc.
Goto N1222094 for Data Lab
Unit 8: function transforms/Training Loops (Optional) -vmap
I would love to contribute to PyTorch!