Rate this Page

Multiprocessing package - torch.multiprocessing#

Created On: Dec 23, 2016 | Last Updated On: Jun 08, 2025

torch.multiprocessing is a wrapper around the nativemultiprocessing module.

It registers custom reducers, that use shared memory to provide sharedviews on the same data in different processes. Once the tensor/storage is movedto shared_memory (seeshare_memory_()), it will be possibleto send it to other processes without making any copies.

The API is 100% compatible with the original module - it’s enough to changeimportmultiprocessing toimporttorch.multiprocessing to have all thetensors sent through the queues or shared via other mechanisms, moved to sharedmemory.

Because of the similarity of APIs we do not document most of this packagecontents, and we recommend referring to very good docs of the original module.

Warning

If the main process exits abruptly (e.g. because of an incoming signal),Python’smultiprocessing sometimes fails to clean up its children.It’s a known caveat, so if you’re seeing any resource leaks afterinterrupting the interpreter, it probably means that this has just happenedto you.

Strategy management#

torch.multiprocessing.get_all_sharing_strategies()[source]#

Return a set of sharing strategies supported on a current system.

torch.multiprocessing.get_sharing_strategy()[source]#

Return the current strategy for sharing CPU tensors.

torch.multiprocessing.set_sharing_strategy(new_strategy)[source]#

Set the strategy for sharing CPU tensors.

Parameters

new_strategy (str) – Name of the selected strategy. Should be one ofthe values returned byget_all_sharing_strategies().

Sharing CUDA tensors#

Sharing CUDA tensors between processes is supported only in Python 3, usingaspawn orforkserver start methods.

Unlike CPU tensors, the sending process is required to keep the original tensoras long as the receiving process retains a copy of the tensor. The refcounting isimplemented under the hood but requires users to follow the next best practices.

Warning

If the consumer process dies abnormally to a fatal signal, the shared tensorcould be forever kept in memory as long as the sending process is running.

  1. Release memory ASAP in the consumer.

## Goodx=queue.get()# do somethings with xdelx
## Badx=queue.get()# do somethings with x# do everything else (producer have to keep x in memory)
  1. Keep producer process running until all consumers exits. This will preventthe situation when the producer process releasing memory which is still in useby the consumer.

## producer# send tensors, do somethingevent.wait()
## consumer# receive tensors and use themevent.set()
  1. Don’t pass received tensors.

# not going to workx=queue.get()queue_2.put(x)
# you need to create a process-local copyx=queue.get()x_clone=x.clone()queue_2.put(x_clone)
# putting and getting from the same queue in the same process will likely end up with segfaultqueue.put(tensor)x=queue.get()

Sharing strategies#

This section provides a brief overview into how different sharing strategieswork. Note that it applies only to CPU tensor - CUDA tensors will always usethe CUDA API, as that’s the only way they can be shared.

File descriptor -file_descriptor#

Note

This is the default strategy (except for macOS and OS X where it’s notsupported).

This strategy will use file descriptors as shared memory handles. Whenever astorage is moved to shared memory, a file descriptor obtained fromshm_openis cached with the object, and when it’s going to be sent to other processes,the file descriptor will be transferred (e.g. via UNIX sockets) to it. Thereceiver will also cache the file descriptor andmmap it, to obtain a sharedview onto the storage data.

Note that if there will be a lot of tensors shared, this strategy will keep alarge number of file descriptors open most of the time. If your system has lowlimits for the number of open file descriptors, and you can’t raise them, youshould use thefile_system strategy.

File system -file_system#

This strategy will use file names given toshm_open to identify the sharedmemory regions. This has a benefit of not requiring the implementation to cachethe file descriptors obtained from it, but at the same time is prone to sharedmemory leaks. The file can’t be deleted right after its creation, because otherprocesses need to access it to open their views. If the processes fatallycrash, or are killed, and don’t call the storage destructors, the files willremain in the system. This is very serious, because they keep using up thememory until the system is restarted, or they’re freed manually.

To counter the problem of shared memory file leaks,torch.multiprocessingwill spawn a daemon namedtorch_shm_manager that will isolate itself fromthe current process group, and will keep track of all shared memory allocations.Once all processes connected to it exit, it will wait a moment to ensure therewill be no new connections, and will iterate over all shared memory filesallocated by the group. If it finds that any of them still exist, they will bedeallocated. We’ve tested this method and it proved to be robust to variousfailures. Still, if your system has high enough limits, andfile_descriptoris a supported strategy, we do not recommend switching to this one.

Spawning subprocesses#

Note

Available for Python >= 3.4.

This depends on thespawn start method in Python’smultiprocessing package.

Spawning a number of subprocesses to perform some function can be doneby creatingProcess instances and callingjoin to wait fortheir completion. This approach works fine when dealing with a singlesubprocess but presents potential issues when dealing with multipleprocesses.

Namely, joining processes sequentially implies they will terminatesequentially. If they don’t, and the first process does not terminate,the process termination will go unnoticed. Also, there are no nativefacilities for error propagation.

Thespawn function below addresses these concerns and takes careof error propagation, out of order termination, and will activelyterminate processes upon detecting an error in one of them.

torch.multiprocessing.spawn.spawn(fn,args=(),nprocs=1,join=True,daemon=False,start_method='spawn')[source]#

Spawnsnprocs processes that runfn withargs.

If one of the processes exits with a non-zero exit status, theremaining processes are killed and an exception is raised with thecause of termination. In the case an exception was caught in thechild process, it is forwarded and its traceback is included inthe exception raised in the parent process.

Parameters
  • fn (function) –

    Function is called as the entrypoint of thespawned process. This function must be defined at the toplevel of a module so it can be pickled and spawned. Thisis a requirement imposed by multiprocessing.

    The function is called asfn(i,*args), wherei isthe process index andargs is the passed through tupleof arguments.

  • args (tuple) – Arguments passed tofn.

  • nprocs (int) – Number of processes to spawn.

  • join (bool) – Perform a blocking join on all processes.

  • daemon (bool) – The spawned processes’ daemon flag. If set to True,daemonic processes will be created.

  • start_method (str) – (deprecated) this method will always usespawnas the start method. To use a different start methodusestart_processes().

Returns

None ifjoin isTrue,ProcessContext ifjoin isFalse

classtorch.multiprocessing.SpawnContext[source]#

Returned byspawn() when called withjoin=False.

join(timeout=None,grace_period=None)[source]#

Join one or more processes within spawn context.

Attempt to join one or more processes in this spawn context.If one of them exited with a non-zero exit status, this functionkills the remaining processes (optionally with a grace period)and raises an exception with the cause of the first process exiting.

ReturnsTrue if all processes have been joined successfully,False if there are more processes that need to be joined.

Parameters
  • timeout (float) – Wait this long (in seconds) before giving up on waiting.

  • grace_period (float) – When any processes fail, wait this long (in seconds)for others to shutdown gracefully before terminating them. If theystill don’t exit, wait another grace period before killing them.

Rate this Page

© Copyright PyTorch Contributors.

Built with thePyData Sphinx Theme 0.15.4.