Overview

MPI for Python provides an object oriented approach to message passingwhich grounds on the standard MPI-2 C++ bindings. The interface wasdesigned with focus in translating MPI syntax and semantics ofstandard MPI-2 bindings for C++ to Python. Any user of the standardC/C++ MPI bindings should be able to use this module without need oflearning a new interface.

Communicating Python Objects and Array Data

The Python standard library supports different mechanisms for datapersistence. Many of them rely on disk storage, butpickling andmarshaling can also work with memory buffers.

Thepickle modules provide user-extensible facilities toserialize general Python objects using ASCII or binary formats. Themarshal module provides facilities to serialize built-in Pythonobjects using a binary format specific to Python, but independent ofmachine architecture issues.

MPI for Python can communicate any built-in or user-defined Pythonobject taking advantage of the features provided by thepicklemodule. These facilities will be routinely used to build binaryrepresentations of objects to communicate (at sending processes), andrestoring them back (at receiving processes).

Although simple and general, the serialization approach (i.e.,pickling andunpickling) previously discussed imposes importantoverheads in memory as well as processor usage, especially in thescenario of objects with large memory footprints beingcommunicated. Pickling general Python objects, ranging from primitiveor container built-in types to user-defined classes, necessarilyrequires computer resources. Processing is also needed fordispatching the appropriate serialization method (that depends on thetype of the object) and doing the actual packing. Additional memory isalways needed, and if its total amount is not knowna priori, manyreallocations can occur. Indeed, in the case of large numeric arrays,this is certainly unacceptable and precludes communication of objectsoccupying half or more of the available memory resources.

MPI for Python supports direct communication of any object exportingthe single-segment buffer interface. This interface is a standardPython mechanism provided by some types (e.g., strings and numericarrays), allowing access in the C side to a contiguous memory buffer(i.e., address and length) containing the relevant data. This feature,in conjunction with the capability of constructing user-defined MPIdatatypes describing complicated memory layouts, enables theimplementation of many algorithms involving multidimensional numericarrays (e.g., image processing, fast Fourier transforms, finitedifference schemes on structured Cartesian grids) directly in Python,with negligible overhead, and almost as fast as compiled Fortran, C,or C++ codes.

Communicators

InMPI for Python,Comm is the base class of communicators. TheIntracomm andIntercomm classes are subclasses of theCommclass. TheComm.Is_inter method (andComm.Is_intra, provided forconvenience but not part of the MPI specification) is defined forcommunicator objects and can be used to determine the particularcommunicator class.

The two predefined intracommunicator instances are available:COMM_SELF andCOMM_WORLD. From them, new communicators can becreated as needed.

The number of processes in a communicator and the calling process rankcan be respectively obtained with methodsComm.Get_size andComm.Get_rank. The associated process group can be retrieved from acommunicator by calling theComm.Get_group method, which returns aninstance of theGroup class. Set operations withGroup objectslike likeGroup.Union,Group.Intersection andGroup.Differenceare fully supported, as well as the creation of new communicators fromthese groups usingComm.Create andIntracomm.Create_group.

New communicator instances can be obtained with theComm.Clone,Comm.Dup andComm.Split methods, as well methodsIntracomm.Create_intercomm andIntercomm.Merge.

Virtual topologies (Cartcomm,Graphcomm andDistgraphcommclasses, which are specializations of theIntracomm class) are fullysupported. New instances can be obtained from intracommunicatorinstances with factory methodsIntracomm.Create_cart andIntracomm.Create_graph.

Point-to-Point Communications

Point to point communication is a fundamental capability of messagepassing systems. This mechanism enables the transmission of databetween a pair of processes, one side sending, the other receiving.

MPI provides a set ofsend andreceive functions allowing thecommunication oftyped data with an associatedtag. The typeinformation enables the conversion of data representation from onearchitecture to another in the case of heterogeneous computingenvironments; additionally, it allows the representation ofnon-contiguous data layouts and user-defined datatypes, thus avoidingthe overhead of (otherwise unavoidable) packing/unpackingoperations. The tag information allows selectivity of messages at thereceiving end.

Blocking Communications

MPI provides basic send and receive functions that areblocking.These functions block the caller until the data buffers involved inthe communication can be safely reused by the application program.

InMPI for Python, theComm.Send,Comm.Recv andComm.Sendrecvmethods of communicator objects provide support for blockingpoint-to-point communications withinIntracomm andIntercomminstances. These methods can communicate memory buffers. The variantsComm.send,Comm.recv andComm.sendrecv can communicate generalPython objects.

Nonblocking Communications

On many systems, performance can be significantly increased byoverlapping communication and computation. This is particularly trueon systems where communication can be executed autonomously by anintelligent, dedicated communication controller.

MPI providesnonblocking send and receive functions. They allow thepossible overlap of communication and computation. Non-blockingcommunication always come in two parts: posting functions, which beginthe requested operation; and test-for-completion functions, whichallow to discover whether the requested operation has completed.

InMPI for Python, theComm.Isend andComm.Irecv methodsinitiate send and receive operations, respectively. These methodsreturn aRequest instance, uniquely identifying the startedoperation. Its completion can be managed using theRequest.Test,Request.Wait, andRequest.Cancel methods. The management ofRequest objects and associated memory buffers involved incommunication requires a careful, rather low-level coordination. Usersmust ensure that objects exposing their memory buffers are notaccessed at the Python level while they are involved in nonblockingmessage-passing operations.

Persistent Communications

Often a communication with the same argument list is repeatedlyexecuted within an inner loop. In such cases, communication can befurther optimized by using persistent communication, a particular caseof nonblocking communication allowing the reduction of the overheadbetween processes and communication controllers. Furthermore , thiskind of optimization can also alleviate the extra call overheadsassociated to interpreted, dynamic languages like Python.

InMPI for Python, theComm.Send_init andComm.Recv_init methodscreate persistent requests for a send and receive operation,respectively. These methods return an instance of thePrequestclass, a subclass of theRequest class. The actual communication canbe effectively started using thePrequest.Start method, and itscompletion can be managed as previously described.

Collective Communications

Collective communications allow the transmittal of data betweenmultiple processes of a group simultaneously. The syntax and semanticsof collective functions is consistent with point-to-pointcommunication. Collective functions communicatetyped data, butmessages are not paired with an associatedtag; selectivity ofmessages is implied in the calling order.

The more commonly used collective communication operations are thefollowing.

  • Barrier synchronization across all group members.

  • Global communication functions

    • Broadcast data from one member to all members of a group.

    • Gather data from all members to one member of a group.

    • Scatter data from one member to all members of a group.

  • Global reduction operations such as sum, maximum, minimum, etc.

InMPI for Python, theComm.Bcast,Comm.Scatter,Comm.Gather,Comm.Allgather,Comm.Alltoall methods provide support forcollective communications of memory buffers. The lower-case variantsComm.bcast,Comm.scatter,Comm.gather,Comm.allgather andComm.alltoall can communicate general Python objects. The vectorvariants (which can communicate different amounts of data to eachprocess)Comm.Scatterv,Comm.Gatherv,Comm.Allgatherv,Comm.Alltoallv andComm.Alltoallw are also supported, they canonly communicate objects exposing memory buffers.

Global reduction operations on memory buffers are accessible throughtheComm.Reduce,Comm.Reduce_scatter,Comm.Allreduce,Intracomm.Scan andIntracomm.Exscan methods. The lower-casevariantsComm.reduce,Comm.allreduce,Intracomm.scan andIntracomm.exscan can communicate general Python objects; however,the actual required reduction computations are performed sequentiallyat some process. All the predefined (i.e.,SUM,PROD,MAX, etc.)reduction operations can be applied.

Support for GPU-aware MPI

Several MPI implementations, including Open MPI and MVAPICH, supportpassing GPU pointers to MPI calls to avoid explicit data movementbetween host and device. On the Python side, support for handling GPUarrays have been implemented in many libraries related GPU computationsuch asCuPy,Numba,PyTorch, andPyArrow. To maximizeinteroperability across library boundaries, two kinds of zero-copydata exchange protocols have been defined and agreed upon:DLPack andCUDAArrayInterface(CAI).

MPI for Python provides an experimental support for GPU-aware MPI.This feature requires:

  1. mpi4py is built against a GPU-aware MPI library.

  2. The Python GPU arrays are compliant with either of the protocols.

See theTutorial section for further information. We note that

  • Whether or not a MPI call can work for GPU arrays depends on theunderlying MPI implementation, not on mpi4py.

  • This support is currently experimental and subject to change in thefuture.

Dynamic Process Management

In the context of the MPI-1 specification, a parallel application isstatic; that is, no processes can be added to or deleted from arunning application after it has been started. Fortunately, thislimitation was addressed in MPI-2. The new specification added aprocess management model providing a basic interface between anapplication and external resources and process managers.

This MPI-2 extension can be really useful, especially for sequentialapplications built on top of parallel modules, or parallelapplications with a client/server model. The MPI-2 process modelprovides a mechanism to create new processes and establishcommunication between them and the existing MPI application. It alsoprovides mechanisms to establish communication between two existingMPI applications, even when one did notstart the other.

InMPI for Python, new independent process groups can be created bycalling theIntracomm.Spawn method within an intracommunicator.This call returns a new intercommunicator (i.e., anIntercomminstance) at the parent process group. The child process group canretrieve the matching intercommunicator by calling theComm.Get_parent class method. At each side, the newintercommunicator can be used to perform point to point and collectivecommunications between the parent and child groups of processes.

Alternatively, disjoint groups of processes can establishcommunication using a client/server approach. Any server applicationmust first call theOpen_port function to open aport and thePublish_name function to publish a providedservice, and next calltheIntracomm.Accept method. Any client applications can first finda publishedservice by calling theLookup_name function, whichreturns theport where a server can be contacted; and next call theIntracomm.Connect method. BothIntracomm.Accept andIntracomm.Connect methods return anIntercomm instance. Whenconnection between client/server processes is no longer needed, all ofthem must cooperatively call theComm.Disconnectmethod. Additionally, server applications should release resources bycalling theUnpublish_name andClose_port functions.

One-Sided Communications

One-sided communications (also calledRemote Memory Access,RMA)supplements the traditional two-sided, send/receive based MPIcommunication model with a one-sided, put/get basedinterface. One-sided communication that can take advantage of thecapabilities of highly specialized network hardware. Additionally,this extension lowers latency and software overhead in applicationswritten using a shared-memory-like paradigm.

The MPI specification revolves around the use of objects calledwindows; they intuitively specify regions of a process’s memory thathave been made available for remote read and write operations. Thepublished memory blocks can be accessed through three functions forput (remote send), get (remote write), and accumulate (remote updateor reduction) data items. A much larger number of functions supportdifferent synchronization styles; the semantics of thesesynchronization operations are fairly complex.

InMPI for Python, one-sided operations are available by usinginstances of theWin class. New window objects are created bycalling theWin.Create method at all processes within a communicatorand specifying a memory buffer . When a window instance is no longerneeded, theWin.Free method should be called.

The three one-sided MPI operations for remote write, read andreduction are available through calling the methodsWin.Put,Win.Get, andWin.Accumulate respectively within aWin instance.These methods need an integer rank identifying the target process andan integer offset relative the base address of the remote memory blockbeing accessed.

The one-sided operations read, write, and reduction are implicitlynonblocking, and must be synchronized by using two primary modes.Active target synchronization requires the origin process to call theWin.Start andWin.Complete methods at the origin process, andtarget process cooperates by calling theWin.Post andWin.Waitmethods. There is also a collective variant provided by theWin.Fence method. Passive target synchronization is more lenient,only the origin process calls theWin.Lock andWin.Unlockmethods. Locks are used to protect remote accesses to the lockedremote window and to protect local load/store accesses to a lockedlocal window.

Parallel Input/Output

The POSIX standard provides a model of a widely portable filesystem. However, the optimization needed for parallel input/outputcannot be achieved with this generic interface. In order to ensureefficiency and scalability, the underlying parallel input/outputsystem must provide a high-level interface supporting partitioning offile data among processes and a collective interface supportingcomplete transfers of global data structures between process memoriesand files. Additionally, further efficiencies can be gained viasupport for asynchronous input/output, strided accesses to data, andcontrol over physical file layout on storage devices. This scenariomotivated the inclusion in the MPI-2 standard of a custom interface inorder to support more elaborated parallel input/output operations.

The MPI specification for parallel input/output revolves around theuse objects calledfiles. As defined by MPI, files are not justcontiguous byte streams. Instead, they are regarded as orderedcollections oftyped data items. MPI supports sequential or randomaccess to any integral set of these items. Furthermore, files areopened collectively by a group of processes.

The common patterns for accessing a shared file (broadcast, scatter,gather, reduction) is expressed by using user-defined datatypes.Compared to the communication patterns of point-to-point andcollective communications, this approach has the advantage of addedflexibility and expressiveness. Data access operations (read andwrite) are defined for different kinds of positioning (using explicitoffsets, individual file pointers, and shared file pointers),coordination (non-collective and collective), and synchronism(blocking, nonblocking, and split collective with begin/end phases).

InMPI for Python, all MPI input/output operations are performedthrough instances of theFile class. File handles are obtained bycalling theFile.Open method at all processes within a communicatorand providing a file name and the intended access mode. After use,they must be closed by calling theFile.Close method. Files evencan be deleted by calling methodFile.Delete.

After creation, files are typically associated with a per-processview. The view defines the current set of data visible andaccessible from an open file as an ordered set of elementarydatatypes. This data layout can be set and queried with theFile.Set_view andFile.Get_view methods respectively.

Actual input/output operations are achieved by many methods combiningread and write calls with different behavior regarding positioning,coordination, and synchronism. Summing up,MPI for Python providesthe thirty (30) methods defined in MPI-2 for reading from or writingto files using explicit offsets or file pointers (individual orshared), in blocking or nonblocking and collective or noncollectiveversions.

Environmental Management

Initialization and Exit

Module functionsInit orInit_thread andFinalize provide MPIinitialization and finalization respectively. Module functionsIs_initialized andIs_finalized provide the respective tests forinitialization and finalization.

Note

MPI_Init() orMPI_Init_thread() is actually calledwhen you import theMPI module from thempi4py package, but only if MPI is not alreadyinitialized. In such case, callingInit orInit_thread fromPython is expected to generate an MPI error, and in turn anexception will be raised.

Note

MPI_Finalize() is registered (by using Python C/APIfunctionPy_AtExit()) for being automatically called whenPython processes exit, but only ifmpi4py actuallyinitialized MPI. Therefore, there is no need to callFinalizefrom Python to ensure MPI finalization.

Implementation Information

  • The MPI version number can be retrieved from module functionGet_version. It returns a two-integer tuple(version,subversion).

  • TheGet_processor_name function can be used to access theprocessor name.

  • The values of predefined attributes attached to the worldcommunicator can be obtained by calling theComm.Get_attr methodwithin theCOMM_WORLD instance.

Timers

MPI timer functionalities are available through theWtime andWtick functions.

Error Handling

In order to facilitate handle sharing with other Python modulesinterfacing MPI-based parallel libraries, the predefined MPI errorhandlersERRORS_RETURN andERRORS_ARE_FATAL can be assigned to andretrieved from communicators using methodsComm.Set_errhandler andComm.Get_errhandler, and similarly for windows and files. New customerror handlers can be created withComm.Create_errhandler.

When the predefined error handlerERRORS_RETURN is set, errorsreturned from MPI calls within Python code will raise an instance ofthe exception classException, which is a subclass of the standardPython exceptionRuntimeError.

Note

After import, mpi4py overrides the default MPI rules governinginheritance of error handlers. TheERRORS_RETURN error handler isset in the predefinedCOMM_SELF andCOMM_WORLD communicators,as well as any newComm,Win, orFile instance createdthrough mpi4py. If you ever pass such handles to C/C++/Fortranlibrary code, it is recommended to set theERRORS_ARE_FATAL errorhandler on them to ensure MPI errors do not pass silently.

Warning

Importing withfrommpi4py.MPIimport* will cause a nameclashing with the standard PythonException base class.