Using Parallelism¶
Note
This page uses two different syntax variants:
Cython specific
cdef
syntax, which was designed to make type declarationsconcise and easily readable from a C/C++ perspective.Pure Python syntax which allows static Cython type declarations inpure Python code,followingPEP-484 type hintsandPEP 526 variable annotations.
To make use of C data types in Python syntax, you need to import the special
cython
module in the Python module that you want to compile, e.g.importcython
If you use the pure Python syntax we strongly recommend you use a recentCython 3 release, since significant improvements have been made herecompared to the 0.29.x releases.
Cython supports native parallelism through thecython.parallel
module. To use this kind of parallelism, theGIL must be released(seeReleasing the GIL).It currently supports OpenMP, but later on more backends might be supported.
Note
Functionality in this module may only be used from the main threador parallel regions due to OpenMP restrictions.
- cython.parallel.prange([start,]stop[,step][,nogil=False][,use_threads_if=CONDITION][,schedule=None[,chunksize=None]][,num_threads=None])¶
This function can be used for parallel loops. OpenMP automaticallystarts a thread pool and distributes the work according to the scheduleused.
Thread-locality and reductions are automatically inferred for variables.
If you assign to a variable in a prange block, it becomes lastprivate, meaning that thevariable will contain the value from the last iteration. If you use aninplace operator on a variable, it becomes a reduction, meaning that thevalues from the thread-local copies of the variable will be reduced withthe operator and assigned to the original variable after the loop. Theindex variable is always lastprivate.Variables assigned to in a parallel with block will be private and unusableafter the block, as there is no concept of a sequentially last value.
- Parameters:
start – The index indicating the start of the loop (same as the start argument in range).
stop – The index indicating when to stop the loop (same as the stop argument in range).
step – An integer giving the step of the sequence (same as the step argument in range).It must not be 0.
nogil – This function can only be used with the GIL released.If
nogil
is true, the loop will be wrapped in a nogil section.use_threads_if – The loop is run in multiple threads only if
CONDITION
is evaluated as true. Otherwise the code is run sequentially. Runningthe loop sequentially can be handy in the cases when the cost of spawningthreads is greater than the benefit of running the loop in parallel(e.g. for small data sets).schedule –
The
schedule
is passed to OpenMP and can be one of the following:- static:
If a chunksize is provided, iterations are distributed to allthreads ahead of time in blocks of the given chunksize. If nochunksize is given, the iteration space is divided into chunks thatare approximately equal in size, and at most one chunk is assignedto each thread in advance.
This is most appropriate when the scheduling overhead matters andthe problem can be cut down into equally sized chunks that areknown to have approximately the same runtime.
- dynamic:
The iterations are distributed to threads as they request them,with a default chunk size of 1.
This is suitable when the runtime of each chunk differs and is notknown in advance and therefore a larger number of smaller chunksis used in order to keep all threads busy.
- guided:
As with dynamic scheduling, the iterations are distributed tothreads as they request them, but with decreasing chunk size. Thesize of each chunk is proportional to the number of unassignediterations divided by the number of participating threads,decreasing to 1 (or the chunksize if provided).
This has an advantage over pure dynamic scheduling when it turnsout that the last chunks take more time than expected or areotherwise being badly scheduled, so that most threads start runningidle while the last chunks are being worked on by only a smallernumber of threads.
- runtime:
The schedule and chunk size are taken from the runtime schedulingvariable, which can be set through the
openmp.omp_set_schedule()
function call, or theOMP_SCHEDULE
environment variable. Note thatthis essentially disables any static compile time optimisations ofthe scheduling code itself and may therefore show a slightly worseperformance than when the same scheduling policy is staticallyconfigured at compile time.The default schedule is implementation defined. For more information consultthe OpenMP specification[1].
num_threads – The
num_threads
argument indicates how many threads the team should consist of. If not given,OpenMP will decide how many threads to use. Typically this is the number of cores available onthe machine. However, this may be controlled through theomp_set_num_threads()
function, orthrough theOMP_NUM_THREADS
environment variable.chunksize – The
chunksize
argument indicates the chunksize to be used for dividing the iterations among threads.This is only valid forstatic
,dynamic
andguided
scheduling, and is optional. Different chunksizesmay give substantially different performance results, depending on the schedule, the load balance it provides,the scheduling overhead and the amount of false sharing (if any).
Example with a reduction:
fromcython.parallelimportprangei=cython.declare(cython.int)n=cython.declare(cython.int,30)sum=cython.declare(cython.int,0)foriinprange(n,nogil=True):sum+=iprint(sum)
fromcython.parallelimportprangecdefinticdefintn=30cdefintsum=0foriinprange(n,nogil=True):sum+=iprint(sum)
Example with atyped memoryview (e.g. a NumPy array)
fromcython.parallelimportprangedeffunc(x:cython.double[:],alpha:cython.double):i:cython.Py_ssize_tforiinprange(x.shape[0],nogil=True):x[i]=alpha*x[i]
fromcython.parallelimportprangedeffunc(double[:]x,doublealpha):cdefPy_ssize_tiforiinprange(x.shape[0],nogil=True):x[i]=alpha*x[i]
Example with conditional parallelism:
fromcython.parallelimportprangedefpsum(n:cython.int):i:cython.intsum:cython.int=0foriinprange(n,nogil=True,use_threads_if=n>1000):sum+=ireturnsumpsum(30)# Executed sequentiallypsum(10000)# Executed in parallel
fromcython.parallelimportprangedefpsum(intn):cdefinticdefintsum=0foriinprange(n,nogil=True,use_threads_if=n>1000):sum+=ireturnsumpsum(30)# Executed sequentiallypsum(10000)# Executed in parallel
- cython.parallel.parallel(num_threads=None,use_threads_if=CONDITION)¶
This directive can be used as part of a
with
statement to execute codesequences in parallel. This is currently useful to setup thread-localbuffers used by aprange
. A containedprange
will be a worksharing loopthat is not parallel, so any variable assigned to in the parallel sectionis also private to theprange
. Variables that are private in the parallelblock are unavailable after the parallel block.Example with thread-local buffers
fromcython.parallelimportparallel,prangefromcython.cimports.libc.stdlibimportabort,malloc,free@cython.nogil@cython.cfuncdeffunc(buf:cython.p_int)->cython.void:pass# ...idx=cython.declare(cython.Py_ssize_t)i=cython.declare(cython.Py_ssize_t)j=cython.declare(cython.Py_ssize_t)n=cython.declare(cython.Py_ssize_t,100)local_buf=cython.declare(cython.p_int)size=cython.declare(cython.size_t,10)withcython.nogil,parallel():local_buf:cython.p_int=cython.cast(cython.p_int,malloc(cython.sizeof(cython.int)*size))iflocal_bufiscython.NULL:abort()# populate our local buffer in a sequential loopforiinrange(size):local_buf[i]=i*2# share the work using the thread-local buffer(s)forjinprange(n,schedule='guided'):func(local_buf)free(local_buf)
fromcython.parallelimportparallel,prangefromlibc.stdlibcimportabort,malloc,freecdefvoidfunc(int*buf)noexceptnogil:pass# ...cdefPy_ssize_tidx,i,j,n=100cdefint *local_bufcdefsize_tsize=10withnogil,parallel():local_buf=<int*>malloc(sizeof(int)*size)iflocal_bufisNULL:abort()# populate our local buffer in a sequential loopforiinrange(size):local_buf[i]=i*2# share the work using the thread-local buffer(s)forjinprange(n,schedule='guided'):func(local_buf)free(local_buf)
Later on sections might be supported in parallel blocks, to distributecode sections of work among threads.
- cython.parallel.threadid()¶
Returns the id of the thread. For n threads, the ids will range from 0 ton-1.
Compiling¶
To actually use the OpenMP support, you need to tell the C or C++ compiler toenable OpenMP. For gcc this can be done as follows in asetup.py
:
fromsetuptoolsimportExtension,setupfromCython.Buildimportcythonizeext_modules=[Extension("hello",["hello.py"],extra_compile_args=['-fopenmp'],extra_link_args=['-fopenmp'],)]setup(name='hello-parallel-world',ext_modules=cythonize(ext_modules),)
fromsetuptoolsimportExtension,setupfromCython.Buildimportcythonizeext_modules=[Extension("hello",["hello.pyx"],extra_compile_args=['-fopenmp'],extra_link_args=['-fopenmp'],)]setup(name='hello-parallel-world',ext_modules=cythonize(ext_modules),)
For the Microsoft Visual C++ compiler, use'/openmp'
instead of'-fopenmp'
for the'extra_compile_args'
option. Don’t add any OpenMP flags to the'extra_link_args'
option.
Breaking out of loops¶
The parallel with and prange blocks support the statements break, continue andreturn in nogil mode. Additionally, it is valid to use awithgil
blockinside these blocks, and have exceptions propagate from them.However, because the blocks use OpenMP, they can not just be left, so theexiting procedure is best-effort. Forprange()
this means that the loopbody is skipped after the first break, return or exception for any subsequentiteration in any thread. It is undefined which value shall be returned ifmultiple different values may be returned, as the iterations are in noparticular order:
fromcython.parallelimportprange@cython.exceptval(-1)@cython.cfuncdeffunc(n:cython.Py_ssize_t)->cython.int:i:cython.Py_ssize_tforiinprange(n,nogil=True):ifi==8:withcython.gil:raiseException()elifi==4:breakelifi==2:returni
fromcython.parallelimportprangecdefintfunc(Py_ssize_tn)except-1:cdefPy_ssize_tiforiinprange(n,nogil=True):ifi==8:withgil:raiseException()elifi==4:breakelifi==2:returni
In the example above it is undefined whether an exception shall be raised,whether it will simply break or whether it will return 2.
Using OpenMP Functions¶
OpenMP functions can be used by cimportingopenmp
:
fromcython.parallelimportparallelfromcython.cimports.openmpimportomp_set_dynamic,omp_get_num_threadsnum_threads=cython.declare(cython.int)omp_set_dynamic(1)withcython.nogil,parallel():num_threads=omp_get_num_threads()# ...
fromcython.parallelcimportparallelcimportopenmpcdefintnum_threadsopenmp.omp_set_dynamic(1)withnogil,parallel():num_threads=openmp.omp_get_num_threads()# ...
References
[1]