Cython for NumPy users¶
Note
This page uses two different syntax variants:
Cython specific
cdef
syntax, which was designed to make type declarationsconcise and easily readable from a C/C++ perspective.Pure Python syntax which allows static Cython type declarations inpure Python code,followingPEP-484 type hintsandPEP 526 variable annotations.
To make use of C data types in Python syntax, you need to import the special
cython
module in the Python module that you want to compile, e.g.importcython
If you use the pure Python syntax we strongly recommend you use a recentCython 3 release, since significant improvements have been made herecompared to the 0.29.x releases.
This tutorial is aimed at NumPy users who have no experience with Cython atall. If you have some knowledge of Cython you may want to skip to the‘’Efficient indexing’’ section.
The main scenario considered is NumPy end-use rather than NumPy/SciPydevelopment. The reason is that Cython is not (yet) able to support functionsthat are generic with respect to the number of dimensions in ahigh-level fashion. This restriction is much more severe for SciPy developmentthan more specific, “end-user” functions. See the last section for moreinformation on this.
The style of this tutorial will not fit everybody, so you can also consider:
Kurt Smith’svideo tutorial of Cython at SciPy 2015.The slides and notebooks of this talk areon github.
Basic Cython documentation (seeCython front page).
Cython at a glance¶
Cython is a compiler which compiles Python-like code files to C code. Still,‘’Cython is not a Python to C translator’’. That is, it doesn’t take your fullprogram and “turn it into C” – rather, the result makes full use of thePython runtime environment. A way of looking at it may be that your code isstill Python in that it runs within the Python runtime environment, but ratherthan compiling to interpreted Python bytecode one compiles to native machinecode (but with the addition of extra syntax for easy embedding of fasterC-like code).
This has two important consequences:
Speed. How much depends very much on the program involved though. Typical Python numerical programs would tend to gain very little as most time is spent in lower-level C that is used in a high-level fashion. However for-loop-style programs can gain many orders of magnitude, when typing information is added (and is so made possible as a realistic alternative).
Easy calling into C code. One of Cython’s purposes is to allow easy wrappingof C libraries. When writing code in Cython you can call into C code aseasily as into Python code.
Very few Python constructs are not yet supported, though making Cython compile allPython code is a stated goal, you can see the differences with Python inlimitations.
Your Cython environment¶
Using Cython consists of these steps:
Write a
.pyx
source fileRun the Cython compiler to generate a C file
Run a C compiler to generate a compiled library
Run the Python interpreter and ask it to import the module
However there are several options to automate these steps:
TheSAGE mathematics software system providesexcellent support for using Cython and NumPy from an interactive commandline or through a notebook interface (likeMaple/Mathematica). Seethis documentation.
Cython can be used as an extension within a Jupyter notebook,making it easy to compile and use Cython code with just a
%%cython
at the top of a cell. For more information seeUsing the Jupyter Notebook.A version of pyximport is shipped with Cython,so that you can import pyx-files dynamically into Python andhave them compiled automatically (SeeCompiling with pyximport).
Cython supports setuptools so that you can very easily create build scriptswhich automate the process, this is the preferred method forCython implemented libraries and packages.SeeBasic setup.py.
Manual compilation (see below)
Note
If using another interactive command line environment than SAGE, likeIPython or Python itself, it is important that you restart the processwhen you recompile the module. It is not enough to issue an “import”statement again.
Installation¶
If you already have a C compiler, just do:
pipinstallCython
otherwise, seethe installation page.
As of this writing SAGE comes with an older release of Cython than requiredfor this tutorial. So if using SAGE you should download the newest Cython andthen execute :
$cdpath/to/cython-distro$path-to-sage/sage-pythonsetup.pyinstall
This will install the newest Cython into SAGE.
Compilation¶
Manual compilation¶
As it is always important to know what is going on, I’ll describe the manualmethod here. First Cython is run:
$cythonyourmod.pyx
This createsyourmod.c
which is the C source for a Python extensionmodule. A useful additional switch is-a
which will generate a documentyourmod.html
) that shows which Cython code translates to which C codeline by line.
Then we compile the C file. This may vary according to your system, but the Cfile should be built like Python was built. Python documentation for writingextensions should have some details. On Linux this often means somethinglike:
$gcc-shared-pthread-fPIC-fwrapv-O2-Wall-fno-strict-aliasing-I/usr/include/python2.7-oyourmod.soyourmod.c
gcc
should have access to the NumPy C header files so if they are notinstalled at/usr/include/numpy
or similar you may need to pass anotheroption for those. You only need to provide the NumPy headers if you write:
cimportnumpy
in your Cython code.
This createsyourmod.so
in the same directory, which is importable byPython by using a normalimportyourmod
statement.
Compilation using setuptools¶
Setuptools allows us to create setup.py file to automate compilation of both Cython files and generated C files.:
fromsetuptoolsimportExtension,setupfromCython.Buildimportcythonizeimportnumpyextensions=[Extension("*",["*.pyx"],include_dirs=[numpy.get_include()]),]setup(name="My hello app",ext_modules=cythonize(extensions),)
The path to the NumPy headers is passed to the C compiler via theinclude_dirs=[numpy.get_include()]
parameter.
Note
Using memoryviews or importing NumPy withimportnumpy
does not mean thatyou have to add the path to NumPy include files. You need to add this path onlyif you usecimportnumpy
.
Despite this, you may still get warnings like the following from the compiler,because Cython is not disabling the usage of the old deprecated Numpy API:
.../include/numpy/npy_1_7_deprecated_api.h:15:2:warning:#warning "Using deprecated NumPy API, disable it by " "#defining NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]
In Cython 3.0, you can get rid of this warning by defining the C macroNPY_NO_DEPRECATED_API
asNPY_1_7_API_VERSION
in your build, e.g.:
# distutils: define_macros=NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION
or (see below):
Extension(...,define_macros=[("NPY_NO_DEPRECATED_API","NPY_1_7_API_VERSION")],)
With older Cython releases, setting this macro will fail the C compilation,because Cython generates code that uses this deprecated C-API. However, thewarning has no negative effects even in recent NumPy versions.You can ignore it until you (or your library’s users) switch to a newer NumPyversion that removes this long deprecated API, in which case you also need touse Cython 3.0 or later. Thus, the earlier you switch to Cython 3.0, thebetter for your users.
The first Cython program¶
You can easily execute the code of this tutorial bydownloadingthe Jupyter notebook.
The code below does the equivalent of this function in numpy:
defcompute_np(array_1,array_2,a,b,c):returnnp.clip(array_1,2,10)*a+array_2*b+c
We’ll say thatarray_1
andarray_2
are 2D NumPy arrays of integer type anda
,b
andc
are three Python integers.
This function uses NumPy and is already really fast, so it might be a bit overkillto do it again with Cython. This is for demonstration purposes. Nonetheless, wewill show that we achieve a better speed and memory efficiency than NumPy at the cost of more verbosity.
This code computes the function with the loops over the two dimensions being unrolled.It is both valid Python and valid Cython code. I’ll refer to it as bothcompute_py.py
for the Python version andcompute_cy.pyx
for theCython version – Cython uses.pyx
as its file suffix (but it can also compile.py
files).
importnumpyasnpdefclip(a,min_value,max_value):returnmin(max(a,min_value),max_value)defcompute(array_1,array_2,a,b,c):""" This function must implement the formula np.clip(array_1, 2, 10) * a + array_2 * b + c array_1 and array_2 are 2D. """x_max=array_1.shape[0]y_max=array_1.shape[1]assertarray_1.shape==array_2.shaperesult=np.zeros((x_max,y_max),dtype=array_1.dtype)forxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult[x,y]=tmp+creturnresult
This should be compiled to producecompute_cy.so
for Linux systems(on Windows systems, this will be a.pyd
file). Werun a Python session to test both the Python version (imported from.py
-file) and the compiled Cython module.
In [1]:importnumpyasnpIn [2]:array_1=np.random.uniform(0,1000,size=(3000,2000)).astype(np.intc)In [3]:array_2=np.random.uniform(0,1000,size=(3000,2000)).astype(np.intc)In [4]:a=4In [5]:b=3In [6]:c=9In [7]:defcompute_np(array_1,array_2,a,b,c): ...:returnnp.clip(array_1,2,10)*a+array_2*b+cIn [8]:%timeit compute_np(array_1, array_2, a, b, c)103 ms ± 4.16 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)In [9]:importcompute_pyIn [10]:compute_py.compute(array_1,array_2,a,b,c)1min 10s ± 844 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [11]:importcompute_cyIn [12]:compute_cy.compute(array_1,array_2,a,b,c)56.5 s ± 587 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
There’s not such a huge difference yet; because the C code still does exactlywhat the Python interpreter does (meaning, for instance, that a new object isallocated for each number used).
You can look at the Python interaction and the generated Ccode by using-a
when calling Cython from the commandline,%%cython-a
when using a Jupyter Notebook, or by usingcythonize('compute_cy.pyx',annotate=True)
when using asetup.py
.Look at the generated html file and see whatis needed for even the simplest statements. You get the point quickly. We needto give Cython more information; we need to add types.
Adding types¶
To add types we use custom Cython syntax, so we are now breaking Python sourcecompatibility.Read the comments!
importnumpyasnpimportcython# We now need to fix a datatype for our arrays. I've used the variable# DTYPE for this, which is assigned to the usual NumPy runtime# type info object.DTYPE=np.intc# @cython.cfunc means here that this function is a plain C function (so faster).# To get all the benefits, we type the arguments and the return value.@cython.exceptval(check=False)@cython.cfuncdefclip(a:cython.int,min_value:cython.int,max_value:cython.int)->cython.int:returnmin(max(a,min_value),max_value)defcompute(array_1,array_2,a:cython.int,b:cython.int,c:cython.int):# Annotation is also used within functions to type variables. It# can only be used at the top indentation level (there are non-trivial# problems with allowing them in other places, though we'd love to see# good and thought out proposals for it).x_max:cython.Py_ssize_t=array_1.shape[0]y_max:cython.Py_ssize_t=array_1.shape[1]assertarray_1.shape==array_2.shapeassertarray_1.dtype==DTYPEassertarray_2.dtype==DTYPEresult=np.zeros((x_max,y_max),dtype=DTYPE)# It is very important to type ALL your variables. You do not get any# warnings if not, only much slower code (they are implicitly typed as# Python objects).# For the "tmp" variable, we want to use the same data type as is# stored in the array, so we use int because it correspond to np.intc.# NB! An important side-effect of this is that if "tmp" overflows its# datatype size, it will simply wrap around like in C, rather than raise# an error like in Python.tmp:cython.int# cython.Py_ssize_t is the proper C type for Python array indices.x:cython.Py_ssize_ty:cython.Py_ssize_tforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult[x,y]=tmp+creturnresult

importnumpyasnp# We now need to fix a datatype for our arrays. I've used the variable# DTYPE for this, which is assigned to the usual NumPy runtime# type info object.DTYPE=np.intc# cdef means here that this function is a plain C function (so faster).# To get all the benefits, we type the arguments and the return value.cdefintclip(inta,intmin_value,intmax_value):returnmin(max(a,min_value),max_value)defcompute(array_1,array_2,inta,intb,intc):# The "cdef" keyword is also used within functions to type variables. It# can only be used at the top indentation level (there are non-trivial# problems with allowing them in other places, though we'd love to see# good and thought out proposals for it).cdefPy_ssize_tx_max=array_1.shape[0]cdefPy_ssize_ty_max=array_1.shape[1]assertarray_1.shape==array_2.shapeassertarray_1.dtype==DTYPEassertarray_2.dtype==DTYPEresult=np.zeros((x_max,y_max),dtype=DTYPE)# It is very important to type ALL your variables. You do not get any# warnings if not, only much slower code (they are implicitly typed as# Python objects).# For the "tmp" variable, we want to use the same data type as is# stored in the array, so we use int because it correspond to np.intc.# NB! An important side-effect of this is that if "tmp" overflows its# datatype size, it will simply wrap around like in C, rather than raise# an error like in Python.cdefinttmp# Py_ssize_t is the proper C type for Python array indices.cdefPy_ssize_tx,yforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult[x,y]=tmp+creturnresult

At this point, have a look at the generated C code forcompute_cy.pyx
andcompute_typed.pyx
. Click on the lines to expand them and see corresponding C.
Especially have a look at thefor-loops
: Incompute_cy.c
, these are ~20 linesof C code to set up while incompute_typed.c
a normal C for loop is used.
After building this and continuing my (very informal) benchmarks, I get:
In [13]:%timeit compute_typed.compute(array_1, array_2, a, b, c)26.5 s ± 422 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
So adding types does make the code faster, but nowherenear the speed of NumPy?
What happened is that most of the time spend in this code is spent in the following lines,and those lines are slower to execute than in pure Python:
tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult[x,y]=tmp+c
So what made those line so much slower than in the pure Python version?
array_1
andarray_2
are still NumPy arrays, so Python objects, and expectPython integers as indexes. Here we pass C int values. So every timeCython reaches this line, it has to convert all the C integers to Pythonint objects. Since this line is called very often, it outweighs the speedbenefits of the pure C loops that were created from therange()
earlier.
Furthermore,tmp*a+array_2[x,y]*b
returns a Python integerandtmp
is a C integer, so Cython has to do type conversions again.In the end those types conversions add up. And made our computation reallyslow. But this problem can be solved easily by using memoryviews.
Efficient indexing with memoryviews¶
There are still two bottlenecks that degrade the performance, and that is the array lookupsand assignments, as well as C/Python types conversion.The[]
-operator still uses full Python operations –what we would like to do instead is to access the data buffer directly at Cspeed.
What we need to do then is to type the contents of thendarray
objects.We do this with a memoryview. There isa page in the Cython documentation dedicated to it.
In short, memoryviews are C structures that can hold a pointer to the dataof a NumPy array and all the necessary buffer metadata to provide efficientand safe access: dimensions, strides, item size, item type information, etc…They also support slices, so they work even ifthe NumPy array isn’t contiguous in memory.They can be indexed by C integers, thus allowing fast access to theNumPy array data.
Here is how to declare a memoryview of integers:
foo:cython.int[:]# 1D memoryviewfoo:cython.int[:,:]# 2D memoryviewfoo:cython.int[:,:,:]# 3D memoryview...# You get the idea.
cdefint [:]foo# 1D memoryviewcdefint [:,:]foo# 2D memoryviewcdefint [:,:,:]foo# 3D memoryview...# You get the idea.
No data is copied from the NumPy array to the memoryview in our example.As the name implies, it is only a “view” of the memory. So we can use theviewresult_view
for efficient indexing and at the end return the real NumPyarrayresult
that holds the data that we operated on.
Here is how to use them in our code:
importnumpyasnpimportcythonDTYPE=np.intc@cython.cfuncdefclip(a:cython.int,min_value:cython.int,max_value:cython.int)->cython.int:returnmin(max(a,min_value),max_value)defcompute(array_1:cython.int[:,:],array_2:cython.int[:,:],a:cython.int,b:cython.int,c:cython.int):x_max:cython.Py_ssize_t=array_1.shape[0]y_max:cython.Py_ssize_t=array_1.shape[1]# array_1.shape is now a C array, no it's not possible# to compare it simply by using == without a for-loop.# To be able to compare it to array_2.shape easily,# we convert them both to Python tuples.asserttuple(array_1.shape)==tuple(array_2.shape)result=np.zeros((x_max,y_max),dtype=DTYPE)result_view:cython.int[:,:]=resulttmp:cython.intx:cython.Py_ssize_ty:cython.Py_ssize_tforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
importnumpyasnpDTYPE=np.intccdefintclip(inta,intmin_value,intmax_value):returnmin(max(a,min_value),max_value)defcompute(int[:,:]array_1,int[:,:]array_2,inta,intb,intc):cdefPy_ssize_tx_max=array_1.shape[0]cdefPy_ssize_ty_max=array_1.shape[1]# array_1.shape is now a C array, no it's not possible# to compare it simply by using == without a for-loop.# To be able to compare it to array_2.shape easily,# we convert them both to Python tuples.asserttuple(array_1.shape)==tuple(array_2.shape)result=np.zeros((x_max,y_max),dtype=DTYPE)cdefint[:,:]result_view=resultcdefinttmpcdefPy_ssize_tx,yforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
Let’s see how much faster accessing is now.
In [22]:%timeit compute_memview.compute(array_1, array_2, a, b, c)22.9 ms ± 197 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
Note the importance of this change.We’re now 3081 times faster than an interpreted version of Python and 4.5 timesfaster than NumPy.
Memoryviews can be used with slices too, or evenwith Python arrays. Check out thememoryview page tosee what they can do for you.
Tuning indexing further¶
The array lookups are still slowed down by two factors:
Bounds checking is performed.
Negative indices are checked for and handled correctly. The code above isexplicitly coded so that it doesn’t use negative indices, and it(hopefully) always access within bounds.
With decorators, we can deactivate those checks:
...importcython@cython.boundscheck(False)# Deactivate bounds checking@cython.wraparound(False)# Deactivate negative indexing.defcompute(array_1:cython.int[:,:],array_2:cython.int[:,:],a:cython.int,b:cython.int,c:cython.int):...
...cimportcython@cython.boundscheck(False)# Deactivate bounds checking@cython.wraparound(False)# Deactivate negative indexing.defcompute(int[:,:]array_1,int[:,:]array_2,inta,intb,intc):...
Now bounds checking is not performed (and, as a side-effect, if you ‘’do’’happen to access out of bounds you will in the best case crash your programand in the worst case corrupt data). It is possible to switch bounds-checkingmode in many ways, seeCompiler directives for moreinformation.
In [23]:%timeit compute_index.compute(array_1, array_2, a, b, c)16.8 ms ± 25.4 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We’re faster than the NumPy version (6.2x). NumPy is really well written,but does not performs operation lazily, resulting in a lotof intermediate copy operations in memory. Our version isvery memory efficient and cache friendly because wecan execute the operations in a single run over the data.
Warning
Speed comes with some cost. Especially it can be dangerous to set typedobjects (likearray_1
,array_2
andresult_view
in our sample code) toNone
.Setting such objects toNone
is entirely legal, but all you can do with themis check whether they are None. All other use (attribute lookup or indexing)can potentially segfault or corrupt data (rather than raising exceptions asthey would in Python).
The actual rules are a bit more complicated but the main message is clear: Donot use typed objects without knowing that they are not set toNone
.
Declaring the NumPy arrays as contiguous¶
For extra speed gains, if you know that the NumPy arrays you areproviding are contiguous in memory, you can declare thememoryview as contiguous.
We give an example on an array that has 3 dimensions.If you want to give Cython the information that the data is C-contiguousyou have to declare the memoryview like this:
a:cython.int[:,:,::1]
cdefint [:,:,::1]a
If you want to give Cython the information that the data is Fortran-contiguousyou have to declare the memoryview like this:
a:cython.int[::1,:,:]
cdefint [::1,:,:]a
If all this makes no sense to you, you can skip this part, declaringarrays as contiguous constrains the usage of your functions as it rejects array slices as input.If you still want to understand what contiguous arrays areall about, you can seethis answer on StackOverflow.
For the sake of giving numbers, here are the speed gains that you shouldget by declaring the memoryviews as contiguous:
In [23]:%timeit compute_contiguous.compute(array_1, array_2, a, b, c)11.1 ms ± 30.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We’re now around nine times faster than the NumPy version, and 6300 timesfaster than the pure Python version!
Making the function cleaner¶
Declaring types can make your code quite verbose. If you don’t mindCython inferring the C types of your variables, you can usetheinfer_types=True
compiler directive at the top of the file.It will save you quite a bit of typing.
Note that since type declarations must happen at the top indentation level,Cython won’t infer the type of variables declared for the first timein other indentation levels. It would change too much the meaning ofour code. This is why, we must still declare manually the type of thetmp
,x
andy
variable.
And actually, manually giving the type of thetmp
variable willbe useful when using fused types.
# cython: infer_types=TrueimportnumpyasnpimportcythonDTYPE=np.intc@cython.cfuncdefclip(a:cython.int,min_value:cython.int,max_value:cython.int)->cython.int:returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(array_1:cython.int[:,::1],array_2:cython.int[:,::1],a:cython.int,b:cython.int,c:cython.int):x_max=array_1.shape[0]y_max=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)result=np.zeros((x_max,y_max),dtype=DTYPE)result_view:cython.int[:,::1]=resulttmp:cython.intx:cython.Py_ssize_ty:cython.Py_ssize_tforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
# cython: infer_types=TrueimportnumpyasnpcimportcythonDTYPE=np.intccdefintclip(inta,intmin_value,intmax_value):returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(int[:,::1]array_1,int[:,::1]array_2,inta,intb,intc):x_max=array_1.shape[0]y_max=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)result=np.zeros((x_max,y_max),dtype=DTYPE)cdefint[:,::1]result_view=resultcdefinttmpcdefPy_ssize_tx,yforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
We now do a speed test:
In [24]:%timeit compute_infer_types.compute(array_1, array_2, a, b, c)11.5 ms ± 261 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
Lo and behold, the speed has not changed.
More generic code¶
All those speed gains are nice, but adding types constrains our code.At the moment, it would mean that our function can only work withNumPy arrays with thenp.intc
type. Is it possible to make ourcode work for multiple NumPy data types?
Yes, with the help of a new feature called fused types.You can learn more about it atthis section of the documentation.It is similar to C++ ‘s templates. It generates multiple function declarationsat compile time, and then chooses the right one at run-time based on thetypes of the arguments provided. By comparing types in if-conditions, itis also possible to execute entirely different code paths dependingon the specific data type.
In our example, since we don’t have access anymore to the NumPy’s dtypeof our input arrays, we use thoseif-else
statements toknow what NumPy data type we should use for our output array.
In this case, our function now works for ints, doubles and floats.
# cython: infer_types=Trueimportnumpyasnpimportcythonmy_type=cython.fused_type(cython.int,cython.double,cython.longlong)@cython.exceptval(check=False)@cython.cfuncdefclip(a:my_type,min_value:my_type,max_value:my_type)->my_type:returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(array_1:my_type[:,::1],array_2:my_type[:,::1],a:my_type,b:my_type,c:my_type):x_max=array_1.shape[0]y_max=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)ifmy_typeiscython.int:dtype=np.intcelifmy_typeiscython.double:dtype=np.doubleelifmy_typeiscython.longlong:dtype=np.longlongresult=np.zeros((x_max,y_max),dtype=dtype)result_view:my_type[:,::1]=resulttmp:my_typex:cython.Py_ssize_ty:cython.Py_ssize_tforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
# cython: infer_types=Trueimportnumpyasnpcimportcythonctypedeffusedmy_type:intdoublelonglongcdefmy_typeclip(my_typea,my_typemin_value,my_typemax_value):returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(my_type[:,::1]array_1,my_type[:,::1]array_2,my_typea,my_typeb,my_typec):x_max=array_1.shape[0]y_max=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)ifmy_typeisint:dtype=np.intcelifmy_typeisdouble:dtype=np.doubleelifmy_typeiscython.longlong:dtype=np.longlongresult=np.zeros((x_max,y_max),dtype=dtype)cdefmy_type[:,::1]result_view=resultcdefmy_typetmpcdefPy_ssize_tx,yforxinrange(x_max):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
We can check that the output type is the right one:
>>>compute(array_1,array_2,a,b,c).dtypedtype('int32')>>>compute(array_1.astype(np.double),array_2.astype(np.double),a,b,c).dtypedtype('float64')
We now do a speed test:
In [25]:%timeit compute_fused_types.compute(array_1, array_2, a, b, c)11.5 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
More versions of the function are created at compile time. So it makessense that the speed doesn’t change for executing this function withintegers as before.
Using multiple threads¶
Cython has support for OpenMP. It also has some nice wrappers around it,like the functionprange()
. You can see more information about Cython andparallelism inUsing Parallelism. Since we do elementwise operations, we can easilydistribute the work among multiple threads. It’s important not to forget to pass thecorrect arguments to the compiler to enable OpenMP. When using the Jupyter notebook,you should use the cell magic like this:
%%cython --force# distutils: extra_compile_args=-fopenmp# distutils: extra_link_args=-fopenmp
For MSVC (on Windows) you should use/openmp
instead of-fopenmp
.The GIL must be released (seeReleasing the GIL), so this is why wedeclare ourclip()
functionnogil
.
# distutils: extra_compile_args=-fopenmp# distutils: extra_link_args=-fopenmpimportnumpyasnpimportcythonfromcython.parallelimportprangemy_type=cython.fused_type(cython.int,cython.double,cython.longlong)# We declare our plain c function nogil@cython.exceptval(check=False)@cython.nogil@cython.cfuncdefclip(a:my_type,min_value:my_type,max_value:my_type)->my_type:returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(array_1:my_type[:,::1],array_2:my_type[:,::1],a:my_type,b:my_type,c:my_type):x_max:cython.Py_ssize_t=array_1.shape[0]y_max:cython.Py_ssize_t=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)ifmy_typeiscython.int:dtype=np.intcelifmy_typeiscython.double:dtype=np.doubleelifmy_typeiscython.longlong:dtype=np.longlongresult=np.zeros((x_max,y_max),dtype=dtype)result_view:my_type[:,::1]=resulttmp:my_typex:cython.Py_ssize_ty:cython.Py_ssize_t# We use prange here.forxinprange(x_max,nogil=True):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
# distutils: extra_compile_args=-fopenmp# distutils: extra_link_args=-fopenmpimportnumpyasnpcimportcythonfromcython.parallelimportprangectypedeffusedmy_type:intdoublelonglong# We declare our plain c function nogilcdefmy_typeclip(my_typea,my_typemin_value,my_typemax_value)noexceptnogil:returnmin(max(a,min_value),max_value)@cython.boundscheck(False)@cython.wraparound(False)defcompute(my_type[:,::1]array_1,my_type[:,::1]array_2,my_typea,my_typeb,my_typec):cdefPy_ssize_tx_max=array_1.shape[0]cdefPy_ssize_ty_max=array_1.shape[1]asserttuple(array_1.shape)==tuple(array_2.shape)ifmy_typeisint:dtype=np.intcelifmy_typeisdouble:dtype=np.doubleelifmy_typeiscython.longlong:dtype=np.longlongresult=np.zeros((x_max,y_max),dtype=dtype)cdefmy_type[:,::1]result_view=resultcdefmy_typetmpcdefPy_ssize_tx,y# We use prange here.forxinprange(x_max,nogil=True):foryinrange(y_max):tmp=clip(array_1[x,y],2,10)tmp=tmp*a+array_2[x,y]*bresult_view[x,y]=tmp+creturnresult
Note
Currently, Cython is checking whether there was a raised exception after every call of the functionclip()
.Checking a raised exception requires the GIL to be held which causes overhead inside anogil loop.The need to check here is a bug with functions returning a fused type (see#5586 for the details).To avoid acquiring the GIL, the function is declared asnoexcept
or@cython.exceptval(check=False)
. See theError return values section for more details.
We can have substantial speed gains for minimal effort:
In [25]:%timeit compute_prange.compute(array_1, array_2, a, b, c)9.33 ms ± 412 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)
We’re now 7558 times faster than the pure Python version and 11.1 times fasterthan NumPy!
Where to go from here?¶
If you want to learn how to make use ofBLASorLAPACK with Cython, you can watchthe presentation of Ian Henriksen at SciPy 2015.
If you want to learn how to use Pythran as backend in Cython, youcan see how inPythran as a NumPy backend.Note that using Pythran only works with theold buffer syntax and not yet with memoryviews.