Writing parallel code with Cython¶
Note
This page uses two different syntax variants:
Cython specific
cdef
syntax, which was designed to make type declarationsconcise and easily readable from a C/C++ perspective.Pure Python syntax which allows static Cython type declarations inpure Python code,followingPEP-484 type hintsandPEP 526 variable annotations.
To make use of C data types in Python syntax, you need to import the special
cython
module in the Python module that you want to compile, e.g.importcython
If you use the pure Python syntax we strongly recommend you use a recentCython 3 release, since significant improvements have been made herecompared to the 0.29.x releases.
One method of speeding up your Cython code is parallelization:you write code that can be run on multiple cores of your CPU simultaneously.For code that lends itself to parallelization this can produce quitedramatic speed-ups, equal to the number of cores your CPU has (for examplea 4× speed-up on a 4-core CPU).
This tutorial assumes that you are already familiar with Cython’s“typed memoryviews” (since code using memoryviews is oftenthe sort of code that’s easy to parallelize with Cython), and also that you’resomewhat familiar with the pitfalls of writing parallel code in general(it aims to be a Cython tutorial rather than a complete introductionto parallel programming).
Before starting, a few notes:
Not all code can be parallelized - for some code the algorithm simplyrelies on being executed in order and you should not attempt toparallelize it. A cumulative sum is a good example.
Not all code is worth parallelizing. There’s a reasonable amount ofoverhead in starting a parallel section and so you need to make surethat you’re operating on enough data to make this overhead worthwhile.Additionally, make sure that you are doing actual work on the data!Multiple threads simply reading the same data tends not to parallelizetoo well. If in doubt, time it.
Cython requires the contents of parallel blocks to be
nogil
. Ifyour algorithm requires access to Python objects then it may not besuitable for parallelization.Cython’s inbuilt parallelization uses the OpenMP constructs
ompparallelfor
andompparallel
. These are idealfor parallelizing relatively small, self-contained blocks of code(especially loops). However, If you want to use other models ofparallelization such as spawning and waiting for tasks, oroff-loading some “side work” to a continuously running secondarythread, then you might be better using other methods (such asPython’sthreading
module).Actually implementing your parallel Cython code should probably beone of the last steps in your optimization. You should start withsome working serial code first. However, it’s worth planning forearly since it may affect your choice of algorithm.
This tutorial does not aim to explore all the options available tocustomize parallelization. See themain parallelism documentation for details.You should also be aware that a lot of the choices Cython makesabout how your code is parallelized are fairly fixed and if you wantspecific OpenMP behaviour that Cython doesn’t provide by default youmay be better writing it in C yourself.
Compilation¶
OpenMP requires support from your C/C++ compiler. This support isusually enabled through a special command-line argument:on GCC this is-fopenmp
while on MSVC it is/openmp
. If your compiler doesn’t support OpenMP (or if youforget to pass the argument) then the code will usually stillcompile but will not run in parallel.
The followingsetup.py
file can be used to compile theexamples in this tutorial:
fromsetuptoolsimportExtension,setupfromCython.Buildimportcythonizeimportsysifsys.platform.startswith("win"):openmp_arg='/openmp'else:openmp_arg='-fopenmp'ext_modules=[Extension("*",["*.pyx"],extra_compile_args=[openmp_arg],extra_link_args=[openmp_arg],),Extension("*",["*.pyx"],extra_compile_args=[openmp_arg],extra_link_args=[openmp_arg],)]setup(name='parallel-tutorial',ext_modules=cythonize(ext_modules),)
Element-wise parallel operations¶
The easiest and most common parallel operation in Cython is toiterate across an array element-wise, performing the sameoperation on each array element. In the simple examplebelow we calculate thesin
of every element in an array:
fromcython.parallelcimportprangecimportcythonfromlibc.mathcimportsinimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defdo_sine(double[:,:]input):cdefdouble[:,:]output=np.empty_like(input)cdefPy_ssize_ti,jforiinprange(input.shape[0],nogil=True):forjinrange(input.shape[1]):output[i,j]=sin(input[i,j])returnnp.asarray(output)
fromcython.parallelimportprangeimportcythonfromcython.cimports.libc.mathimportsinimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defdo_sine(input:cython.double[:,:]):output:cython.double[:,:]=np.empty_like(input)i:cython.Py_ssize_tj:cython.Py_ssize_tforiinprange(input.shape[0],nogil=True):forjinrange(input.shape[1]):output[i,j]=sin(input[i,j])returnnp.asarray(output)
We parallelize the outermost loop. This is usually a good ideasince there is some overhead to entering and leaving a parallel block.However, you should also consider the likely size of your arrays.Ifinput
usually had a size of(2,10000000)
then parallelizingover the dimension of length2
would likely be a worse choice.
The body of the loop itself isnogil
- i.e. you cannot perform“Python” operations. This is a fairly strong limitation and if youfind that you need to use the GIL then it is likely that Cython’sparallelization features are not suitable for you. It is possibleto throw exceptions from within the loop, however – Cython simplyregains the GIL and raises an exception, then terminates the loopon all threads.
It’s necessary to explicitly type the loop variablei
as aC integer. For a non-parallel loop Cython can infer this, but itdoes not currently infer the loop variable for parallel loops,so not typingi
will lead to compile errors since it will bea Python object and so unusable without the GIL.
The C code generated is shown below, for the benefit of experiencedusers of OpenMP. It is simplified a little for readability here:
#pragma omp parallel{#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j)for(__pyx_t_8=0;__pyx_t_8<__pyx_t_9;__pyx_t_8++){i=__pyx_t_8;/* body goes here */}}
Private variables¶
One useful point to note from the generated C code above - variablesused in the loops likei
andj
are marked asfirstprivate
andlastprivate
. Within the loop each thread has its own copy ofthe data, the data is initializedaccording to its value before the loop, and after the loop the “global”copy is set equal to the last iteration (i.e. as if the loop were runin serial).
The basic rules that Cython applies are:
C scalar variables within a
prange
block are madefirstprivate
andlastprivate
,C scalar variables assigned within aparallel blockare
private
(which means they can’t be used to pass data inand out of the block),array variables (e.g. memoryviews) are not made private. InsteadCython assumes that you have structured your loop so that each iterationis acting on different data,
Python objects are also not made private, although access to themis controlled via Python’s GIL.
Cython does not currently provide much opportunity of override thesechoices.
Reductions¶
The second most common parallel operation in Cython is the “reduction”operation. A common example is to accumulate a sum over the wholearray, such as in the calculation of a vector norm below:
fromcython.parallelcimportprangecimportcythonfromlibc.mathcimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defl2norm(double[:]x):cdefdoubletotal=0cdefPy_ssize_tiforiinprange(x.shape[0],nogil=True):total+=x[i]*x[i]returnsqrt(total)
fromcython.parallelimportprangeimportcythonfromcython.cimports.libc.mathimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defl2norm(x:cython.double[:]):total:cython.double=0i:cython.Py_ssize_tforiinprange(x.shape[0],nogil=True):total+=x[i]*x[i]returnsqrt(total)
Cython is able to infer reductions for+=
,*=
,-=
,&=
,|=
, and^=
. These only apply to C scalar variablesso you cannot easily reduce a 2D memoryview to a 1D memoryview forexample.
The C code generated is approximately:
#pragma omp parallel reduction(+:total){#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){i=__pyx_t_2;total=total+/* some indexing code */;}}
parallel
blocks¶
Much less frequently used thanprange
is Cython’sparallel
operator.parallel
generates a block of code that is run simultaneouslyon multiple threads at once. Unlikeprange
, however, work isnot automatically divided between threads.
Here we present three common uses for theparallel
block:
Stringing together prange blocks¶
There is some overhead in entering and leaving a parallelized section.Therefore, if you have multiple parallel sections with smallserial sections in between it can be more efficient towrite one large parallel block. Any small serialsections are duplicated, but the overhead is reduced.
In the example below we do an in-place normalization of a vector.The first parallel loop calculates the norm, the second parallelloop applies the norm to the vector, and we avoid jumping in and out of serialcode in between.
fromcython.parallelcimportparallel,prangecimportcythonfromlibc.mathcimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defnormalize(double[:]x):cdefPy_ssize_ticdefdoubletotal=0cdefdoublenormwithnogil,parallel():foriinprange(x.shape[0]):total+=x[i]*x[i]norm=sqrt(total)foriinprange(x.shape[0]):x[i]/=norm
fromcython.parallelimportparallel,prangeimportcythonfromcython.cimports.libc.mathimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defnormalize(x:cython.double[:]):i:cython.Py_ssize_ttotal:cython.double=0norm:cython.doublewithcython.nogil,parallel():foriinprange(x.shape[0]):total+=x[i]*x[i]norm=sqrt(total)foriinprange(x.shape[0]):x[i]/=norm
The C code is approximately:
#pragma omp parallel private(norm) reduction(+:total){/* some calculations of array size... */#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){/* ... */}norm=sqrt(total);#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){/* ... */}}
Allocating “scratch space” for each thread¶
Suppose that each thread requires a small amount of scratch spaceto work in. They cannot share scratch space because that wouldlead to data races. In this case the allocation and deallocationis done in a parallel section (so occurs on a per-thread basis)surrounding a loop which then uses the scratch space.
Our example here uses C++ to find the median of each column ina 2D array (just a parallel version ofnumpy.median(x,axis=0)
).We must reorder each column to find the median of it, but don’t wantto modify the input array. Therefore, we allocate a C++ vector perthread to use as scratch space, and work in that. For efficiencythe vector is allocated outside theprange
loop.
# distutils: language = c++fromcython.parallelcimportparallel,prangefromlibcpp.vectorcimportvectorfromlibcpp.algorithmcimportnth_elementcimportcythonfromcython.operatorcimportdereferenceimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defmedian_along_axis0(constdouble[:,:]x):cdefdouble[::1]out=np.empty(x.shape[1])cdefPy_ssize_ti,jcdefvector[double] *scratchcdefvector[double].iteratormedian_itwithnogil,parallel():# allocate scratch space per loopscratch=newvector[double](x.shape[0])try:foriinprange(x.shape[1]):# copy row into scratch spaceforjinrange(x.shape[0]):dereference(scratch)[j]=x[j,i]median_it=scratch.begin()+scratch.size()//2nth_element(scratch.begin(),median_it,scratch.end())# for the sake of a simple example, don't handle even lengths...out[i]=dereference(median_it)finally:delscratchreturnnp.asarray(out)
# distutils: language = c++fromcython.parallelimportparallel,prangefromcython.cimports.libc.stdlibimportmalloc,freefromcython.cimports.libcpp.algorithmimportnth_elementimportcythonfromcython.operatorimportdereferenceimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defmedian_along_axis0(x:cython.double[:,:]):out:cython.double[::1]=np.empty(x.shape[1])i:cython.Py_ssize_tj:cython.Py_ssize_tscratch:cython.p_doublemedian_it:cython.p_doublewithcython.nogil,parallel():# allocate scratch space per loopscratch=cython.cast(cython.p_double,malloc(cython.sizeof(cython.double)*x.shape[0]))try:foriinprange(x.shape[1]):# copy row into scratch spaceforjinrange(x.shape[0]):scratch[j]=x[j,i]median_it=scratch+x.shape[0]//2nth_element(scratch,median_it,scratch+x.shape[0])# for the sake of a simple example, don't handle even lengths...out[i]=dereference(median_it)finally:free(scratch)returnnp.asarray(out)
Note
Pure and classic syntax examples are not quite identicalsince pure Python syntax does not support C++ “new”, so we allocate thescratch space slightly differently
In the generated code thescratch
variable is marked asprivate
in the outer parallel block. A rough outline is:
#pragma omp parallel private(scratch){scratch=newstd::vector<double>((x.shape[0]))#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it)for(__pyx_t_9=0;__pyx_t_9<__pyx_t_10;__pyx_t_9++){i=__pyx_t_9;/* implementation goes here */}/* some exception handling detail omitted */deletescratch;}
Performing different tasks on each thread¶
Finally, if you manually specify the number of threads andthen identify each thread usingomp.get_thread_num()
you can manually split work between threads. This isa fairly rare use-case in Cython, and probably suggeststhat thethreading
module is more suitable for whatyou’re trying to do. However it is an option.
fromcython.parallelcimportparallelfromopenmpcimportomp_get_thread_numcdefvoidlong_running_task1()noexceptnogil:passcdefvoidlong_running_task2()noexceptnogil:passdefdo_two_tasks():cdefintthread_numwithnogil,parallel(num_threads=2):thread_num=omp_get_thread_num()ifthread_num==0:long_running_task1()elifthread_num==1:long_running_task2()
fromcython.parallelimportparallelfromcython.cimports.openmpimportomp_get_thread_numimportcython@cython.cfunc@cython.nogildeflong_running_task1()->cython.void:pass@cython.cfunc@cython.nogildeflong_running_task2()->cython.void:passdefdo_two_tasks():thread_num:cython.intwithcython.nogil,parallel(num_threads=2):thread_num=omp_get_thread_num()ifthread_num==0:long_running_task1()elifthread_num==1:long_running_task2()
The utility of this kind of block is limited by the fact thatvariables assigned to in the block areprivate
to each thread,so cannot be accessed in the serial section afterwards.
The generated C code for the example above is fairly simple:
#pragma omp parallel private(thread_num){thread_num=omp_get_thread_num();switch(thread_num){/* ... */}}