Writing parallel code with Cython

Note

This page uses two different syntax variants:

  • Cython specificcdef syntax, which was designed to make type declarationsconcise and easily readable from a C/C++ perspective.

  • Pure Python syntax which allows static Cython type declarations inpure Python code,followingPEP-484 type hintsandPEP 526 variable annotations.

    To make use of C data types in Python syntax, you need to import the specialcython module in the Python module that you want to compile, e.g.

    importcython

    If you use the pure Python syntax we strongly recommend you use a recentCython 3 release, since significant improvements have been made herecompared to the 0.29.x releases.

One method of speeding up your Cython code is parallelization:you write code that can be run on multiple cores of your CPU simultaneously.For code that lends itself to parallelization this can produce quitedramatic speed-ups, equal to the number of cores your CPU has (for examplea 4× speed-up on a 4-core CPU).

This tutorial assumes that you are already familiar with Cython’s“typed memoryviews” (since code using memoryviews is oftenthe sort of code that’s easy to parallelize with Cython), and also that you’resomewhat familiar with the pitfalls of writing parallel code in general(it aims to be a Cython tutorial rather than a complete introductionto parallel programming).

Before starting, a few notes:

  • Not all code can be parallelized - for some code the algorithm simplyrelies on being executed in order and you should not attempt toparallelize it. A cumulative sum is a good example.

  • Not all code is worth parallelizing. There’s a reasonable amount ofoverhead in starting a parallel section and so you need to make surethat you’re operating on enough data to make this overhead worthwhile.Additionally, make sure that you are doing actual work on the data!Multiple threads simply reading the same data tends not to parallelizetoo well. If in doubt, time it.

  • Cython requires the contents of parallel blocks to benogil. Ifyour algorithm requires access to Python objects then it may not besuitable for parallelization.

  • Cython’s inbuilt parallelization uses the OpenMP constructsompparallelfor andompparallel. These are idealfor parallelizing relatively small, self-contained blocks of code(especially loops). However, If you want to use other models ofparallelization such as spawning and waiting for tasks, oroff-loading some “side work” to a continuously running secondarythread, then you might be better using other methods (such asPython’sthreading module).

  • Actually implementing your parallel Cython code should probably beone of the last steps in your optimization. You should start withsome working serial code first. However, it’s worth planning forearly since it may affect your choice of algorithm.

This tutorial does not aim to explore all the options available tocustomize parallelization. See themain parallelism documentation for details.You should also be aware that a lot of the choices Cython makesabout how your code is parallelized are fairly fixed and if you wantspecific OpenMP behaviour that Cython doesn’t provide by default youmay be better writing it in C yourself.

Compilation

OpenMP requires support from your C/C++ compiler. This support isusually enabled through a special command-line argument:on GCC this is-fopenmp while on MSVC it is/openmp. If your compiler doesn’t support OpenMP (or if youforget to pass the argument) then the code will usually stillcompile but will not run in parallel.

The followingsetup.py file can be used to compile theexamples in this tutorial:

fromsetuptoolsimportExtension,setupfromCython.Buildimportcythonizeimportsysifsys.platform.startswith("win"):openmp_arg='/openmp'else:openmp_arg='-fopenmp'ext_modules=[Extension("*",["*.pyx"],extra_compile_args=[openmp_arg],extra_link_args=[openmp_arg],),Extension("*",["*.pyx"],extra_compile_args=[openmp_arg],extra_link_args=[openmp_arg],)]setup(name='parallel-tutorial',ext_modules=cythonize(ext_modules),)

Element-wise parallel operations

The easiest and most common parallel operation in Cython is toiterate across an array element-wise, performing the sameoperation on each array element. In the simple examplebelow we calculate thesin of every element in an array:

fromcython.parallelcimportprangecimportcythonfromlibc.mathcimportsinimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defdo_sine(double[:,:]input):cdefdouble[:,:]output=np.empty_like(input)cdefPy_ssize_ti,jforiinprange(input.shape[0],nogil=True):forjinrange(input.shape[1]):output[i,j]=sin(input[i,j])returnnp.asarray(output)

We parallelize the outermost loop. This is usually a good ideasince there is some overhead to entering and leaving a parallel block.However, you should also consider the likely size of your arrays.Ifinput usually had a size of(2,10000000) then parallelizingover the dimension of length2 would likely be a worse choice.

The body of the loop itself isnogil - i.e. you cannot perform“Python” operations. This is a fairly strong limitation and if youfind that you need to use the GIL then it is likely that Cython’sparallelization features are not suitable for you. It is possibleto throw exceptions from within the loop, however – Cython simplyregains the GIL and raises an exception, then terminates the loopon all threads.

It’s necessary to explicitly type the loop variablei as aC integer. For a non-parallel loop Cython can infer this, but itdoes not currently infer the loop variable for parallel loops,so not typingi will lead to compile errors since it will bea Python object and so unusable without the GIL.

The C code generated is shown below, for the benefit of experiencedusers of OpenMP. It is simplified a little for readability here:

#pragma omp parallel{#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j)for(__pyx_t_8=0;__pyx_t_8<__pyx_t_9;__pyx_t_8++){i=__pyx_t_8;/* body goes here */}}

Private variables

One useful point to note from the generated C code above - variablesused in the loops likei andj are marked asfirstprivateandlastprivate. Within the loop each thread has its own copy ofthe data, the data is initializedaccording to its value before the loop, and after the loop the “global”copy is set equal to the last iteration (i.e. as if the loop were runin serial).

The basic rules that Cython applies are:

  • C scalar variables within aprange block are madefirstprivate andlastprivate,

  • C scalar variables assigned within aparallel blockareprivate (which means they can’t be used to pass data inand out of the block),

  • array variables (e.g. memoryviews) are not made private. InsteadCython assumes that you have structured your loop so that each iterationis acting on different data,

  • Python objects are also not made private, although access to themis controlled via Python’s GIL.

Cython does not currently provide much opportunity of override thesechoices.

Reductions

The second most common parallel operation in Cython is the “reduction”operation. A common example is to accumulate a sum over the wholearray, such as in the calculation of a vector norm below:

fromcython.parallelcimportprangecimportcythonfromlibc.mathcimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defl2norm(double[:]x):cdefdoubletotal=0cdefPy_ssize_tiforiinprange(x.shape[0],nogil=True):total+=x[i]*x[i]returnsqrt(total)

Cython is able to infer reductions for+=,*=,-=,&=,|=, and^=. These only apply to C scalar variablesso you cannot easily reduce a 2D memoryview to a 1D memoryview forexample.

The C code generated is approximately:

#pragma omp parallel reduction(+:total){#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){i=__pyx_t_2;total=total+/* some indexing code */;}}

parallel blocks

Much less frequently used thanprange is Cython’sparalleloperator.parallel generates a block of code that is run simultaneouslyon multiple threads at once. Unlikeprange, however, work isnot automatically divided between threads.

Here we present three common uses for theparallel block:

Stringing together prange blocks

There is some overhead in entering and leaving a parallelized section.Therefore, if you have multiple parallel sections with smallserial sections in between it can be more efficient towrite one large parallel block. Any small serialsections are duplicated, but the overhead is reduced.

In the example below we do an in-place normalization of a vector.The first parallel loop calculates the norm, the second parallelloop applies the norm to the vector, and we avoid jumping in and out of serialcode in between.

fromcython.parallelcimportparallel,prangecimportcythonfromlibc.mathcimportsqrt@cython.boundscheck(False)@cython.wraparound(False)defnormalize(double[:]x):cdefPy_ssize_ticdefdoubletotal=0cdefdoublenormwithnogil,parallel():foriinprange(x.shape[0]):total+=x[i]*x[i]norm=sqrt(total)foriinprange(x.shape[0]):x[i]/=norm

The C code is approximately:

#pragma omp parallel private(norm) reduction(+:total){/* some calculations of array size... */#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){/* ... */}norm=sqrt(total);#pragma omp for firstprivate(i) lastprivate(i)for(__pyx_t_2=0;__pyx_t_2<__pyx_t_3;__pyx_t_2++){/* ... */}}

Allocating “scratch space” for each thread

Suppose that each thread requires a small amount of scratch spaceto work in. They cannot share scratch space because that wouldlead to data races. In this case the allocation and deallocationis done in a parallel section (so occurs on a per-thread basis)surrounding a loop which then uses the scratch space.

Our example here uses C++ to find the median of each column ina 2D array (just a parallel version ofnumpy.median(x,axis=0)).We must reorder each column to find the median of it, but don’t wantto modify the input array. Therefore, we allocate a C++ vector perthread to use as scratch space, and work in that. For efficiencythe vector is allocated outside theprange loop.

# distutils: language = c++fromcython.parallelcimportparallel,prangefromlibcpp.vectorcimportvectorfromlibcpp.algorithmcimportnth_elementcimportcythonfromcython.operatorcimportdereferenceimportnumpyasnp@cython.boundscheck(False)@cython.wraparound(False)defmedian_along_axis0(constdouble[:,:]x):cdefdouble[::1]out=np.empty(x.shape[1])cdefPy_ssize_ti,jcdefvector[double] *scratchcdefvector[double].iteratormedian_itwithnogil,parallel():# allocate scratch space per loopscratch=newvector[double](x.shape[0])try:foriinprange(x.shape[1]):# copy row into scratch spaceforjinrange(x.shape[0]):dereference(scratch)[j]=x[j,i]median_it=scratch.begin()+scratch.size()//2nth_element(scratch.begin(),median_it,scratch.end())# for the sake of a simple example, don't handle even lengths...out[i]=dereference(median_it)finally:delscratchreturnnp.asarray(out)

Note

Pure and classic syntax examples are not quite identicalsince pure Python syntax does not support C++ “new”, so we allocate thescratch space slightly differently

In the generated code thescratch variable is marked asprivate in the outer parallel block. A rough outline is:

#pragma omp parallel private(scratch){scratch=newstd::vector<double>((x.shape[0]))#pragma omp for firstprivate(i) lastprivate(i) lastprivate(j) lastprivate(median_it)for(__pyx_t_9=0;__pyx_t_9<__pyx_t_10;__pyx_t_9++){i=__pyx_t_9;/* implementation goes here */}/* some exception handling detail omitted */deletescratch;}

Performing different tasks on each thread

Finally, if you manually specify the number of threads andthen identify each thread usingomp.get_thread_num()you can manually split work between threads. This isa fairly rare use-case in Cython, and probably suggeststhat thethreading module is more suitable for whatyou’re trying to do. However it is an option.

fromcython.parallelcimportparallelfromopenmpcimportomp_get_thread_numcdefvoidlong_running_task1()noexceptnogil:passcdefvoidlong_running_task2()noexceptnogil:passdefdo_two_tasks():cdefintthread_numwithnogil,parallel(num_threads=2):thread_num=omp_get_thread_num()ifthread_num==0:long_running_task1()elifthread_num==1:long_running_task2()

The utility of this kind of block is limited by the fact thatvariables assigned to in the block areprivate to each thread,so cannot be accessed in the serial section afterwards.

The generated C code for the example above is fairly simple:

#pragma omp parallel private(thread_num){thread_num=omp_get_thread_num();switch(thread_num){/* ... */}}