Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Data Basics

Manuel edited this pageJul 3, 2024 ·3 revisions

Introduction to DataLoader and Dataset

Read throughlink

Common Object in DataLoader

  • Sampler: Randomly choosing index per iteration. It would yield indices whenbatch_size is notNone.
  • Fetcher: Taking single index or a batch of indices, and returning corresponding data from Dataset. It would invokecollate_fn over each batch of data and drop the remaining unfilled batch ifdrop_last is set.
    • ForIterableDataset, it would simply get next batch-size elements as a batch.

Data/Control flow in DataLoader

  • Single Process:
         Sampler            |      index/indices            |            V         Fetcher            |      index/indices            |            V         dataset            |            V        collate_fn            |            V         output
  • Multiple processes:
          Sampler (Main process)                    |              index/indices                    |                    VIndex Multiprocessing Queue (one healthy worker)                    |              index/indices                    |                    V          Fetcher (Worker process)                    |              index/indices                    |                    V                 dataset                    |              Batch of data                    |                    V                collate_fn                    |                    V        Result Multiprocessing Queue                    |                   Data                    |                    V      pin_memory_thread (Main process)                    |                    V                  output

This is just a general data and control flow in DataLoader. There are multiple further detailed functionalities like prefetching, worker_status, and etc.

Common gotchas for DataLoader

Most of common questions for DataLoader come from multiple workers as multiprocessing is enabled.

  • Default multiprocessing methods are different across platforms based on Python (https://docs.python.org/3/library/multiprocessing.html#contexts-and-start-methods)
    • Control randomness per worker usingworker_init_fn. Otherwise, DataLoader either becomes non-deterministic when using spawn or shares same random state for each worker when using fork.
    • COW in fork (Copy-on-access in Python fork). The simplest solution for the implementation of Dataset is to use Tensor or NumPy array to replace Python arbitrary objects like list and dict.
  • Difference between Map-style Datset and Iterable-style Dataset
    • Map-style Dataset can utilize the indices sampled from main process to get automatic sharding.
    • Iterable-style Dataset requires users to manually implement sharding inside__iter__ method usingtorch.utils.data.get_worker_info(). Please check theexample.
  • Shuffle is not enabled for Iterable-style Dataset. If needed, users need to implement the shuffle utilities insideIterableDataset class. (This is solved by TorchData project)

Introduction to next-generation Data API (TorchData)

Read throughlink andlinkExpected features:

  • Automatic/Dydamic sharding
  • Determinism Control
  • Snapshotting
  • DataFrame integration
  • etc.

Lab for DataLoader and DataPipe

Goto N1222094 for Data Lab

Next

Unit 8: function transforms/Training Loops (Optional) -vmap

I would love to contribute to PyTorch!

Clone this wiki locally


[8]ページ先頭

©2009-2025 Movatter.jp