How to optimize for speed#
The following gives some practical guidelines to help you write efficientcode for the scikit-learn project.
Note
While it is always useful to profile your code so as tocheckperformance assumptions, it is also highly recommendedtoreview the literature to ensure that the implemented algorithmis the state of the art for the task before investing into costlyimplementation optimization.
Times and times, hours of efforts invested in optimizing complicatedimplementation details have been rendered irrelevant by the subsequentdiscovery of simplealgorithmic tricks, or by using another algorithmaltogether that is better suited to the problem.
The sectionA simple algorithmic trick: warm restarts gives an example of such a trick.
Python, Cython or C/C++?#
In general, the scikit-learn project emphasizes thereadability ofthe source code to make it easy for the project users to dive into thesource code so as to understand how the algorithm behaves on their databut also for ease of maintainability (by the developers).
When implementing a new algorithm is thus recommended tostartimplementing it in Python using Numpy and Scipy by taking care of avoidinglooping code using the vectorized idioms of those libraries. In practicethis means trying toreplace any nested for loops by calls to equivalentNumpy array methods. The goal is to avoid the CPU wasting time in thePython interpreter rather than crunching numbers to fit your statisticalmodel. It’s generally a good idea to consider NumPy and SciPy performance tips:https://scipy.github.io/old-wiki/pages/PerformanceTips
Sometimes however an algorithm cannot be expressed efficiently in simplevectorized Numpy code. In this case, the recommended strategy is thefollowing:
Profile the Python implementation to find the main bottleneck andisolate it in adedicated module level function. This functionwill be reimplemented as a compiled extension module.
If there exists a well maintained BSD or MITC/C++ implementationof the same algorithm that is not too big, you can write aCython wrapper for it and include a copy of the source codeof the library in the scikit-learn source tree: this strategy isused for the classes
svm.LinearSVC
,svm.SVC
andlinear_model.LogisticRegression
(wrappers for liblinearand libsvm).Otherwise, write an optimized version of your Python function usingCython directly. This strategy is usedfor the
linear_model.ElasticNet
andlinear_model.SGDClassifier
classes for instance.Move the Python version of the function in the tests and useit to check that the results of the compiled extension are consistentwith the gold standard, easy to debug Python version.
Once the code is optimized (not simple bottleneck spottable byprofiling), check whether it is possible to havecoarse grainedparallelism that is amenable tomulti-processing by using the
joblib.Parallel
class.
Profiling Python code#
In order to profile Python code we recommend to write a script thatloads and prepare you data and then use the IPython integrated profilerfor interactively exploring the relevant part for the code.
Suppose we want to profile the Non Negative Matrix Factorization moduleof scikit-learn. Let us setup a new IPython session and load the digitsdataset and as in theRecognizing hand-written digits example:
In[1]:fromsklearn.decompositionimportNMFIn[2]:fromsklearn.datasetsimportload_digitsIn[3]:X,_=load_digits(return_X_y=True)
Before starting the profiling session and engaging in tentativeoptimization iterations, it is important to measure the total executiontime of the function we want to optimize without any kind of profileroverhead and save it somewhere for later reference:
In[4]:%timeitNMF(n_components=16,tol=1e-2).fit(X)1loops,bestof3:1.7sperloop
To have a look at the overall performance profile using the%prun
magic command:
In[5]:%prun-lnmf.pyNMF(n_components=16,tol=1e-2).fit(X)14496functioncallsin1.682CPUsecondsOrderedby:internaltimeListreducedfrom90to9duetorestriction<'nmf.py'>ncallstottimepercallcumtimepercallfilename:lineno(function)360.6090.0171.4990.042nmf.py:151(_nls_subproblem)12630.1570.0000.1570.000nmf.py:18(_pos)10.0530.0531.6811.681nmf.py:352(fit_transform)6730.0080.0000.0570.000nmf.py:28(norm)10.0060.0060.0470.047nmf.py:42(_initialize_nmf)360.0010.0000.0100.000nmf.py:36(_sparseness)300.0010.0000.0010.000nmf.py:23(_neg)10.0000.0000.0000.000nmf.py:337(__init__)10.0000.0001.6811.681nmf.py:461(fit)
Thetottime
column is the most interesting: it gives the total time spentexecuting the code of a given function ignoring the time spent in executing thesub-functions. The real total time (local code + sub-function calls) is given bythecumtime
column.
Note the use of the-lnmf.py
that restricts the output to lines thatcontain the “nmf.py” string. This is useful to have a quick look at the hotspotof the nmf Python module itself ignoring anything else.
Here is the beginning of the output of the same command without the-lnmf.py
filter:
In[5]%prunNMF(n_components=16,tol=1e-2).fit(X)16159functioncallsin1.840CPUsecondsOrderedby:internaltimencallstottimepercallcumtimepercallfilename:lineno(function)28330.6530.0000.6530.000{numpy.core._dotblas.dot}460.6510.0141.6360.036nmf.py:151(_nls_subproblem)13970.1710.0000.1710.000nmf.py:18(_pos)27800.1670.0000.1670.000{method'sum'of'numpy.ndarray'objects}10.0640.0641.8401.840nmf.py:352(fit_transform)15420.0430.0000.0430.000{method'flatten'of'numpy.ndarray'objects}3370.0190.0000.0190.000{method'all'of'numpy.ndarray'objects}27340.0110.0000.1810.000fromnumeric.py:1185(sum)20.0100.0050.0100.005{numpy.linalg.lapack_lite.dgesdd}7480.0090.0000.0650.000nmf.py:28(norm)...
The above results show that the execution is largely dominated bydot product operations (delegated to blas). Hence there is probablyno huge gain to expect by rewriting this code in Cython or C/C++: inthis case out of the 1.7s total execution time, almost 0.7s are spentin compiled code we can consider optimal. By rewriting the rest of thePython code and assuming we could achieve a 1000% boost on this portion(which is highly unlikely given the shallowness of the Python loops),we would not gain more than a 2.4x speed-up globally.
Hence major improvements can only be achieved byalgorithmicimprovements in this particular example (e.g. trying to find operationsthat are both costly and useless to avoid computing them rather thantrying to optimize their implementation).
It is however still interesting to check what’s happening inside the_nls_subproblem
function which is the hotspot if we only considerPython code: it takes around 100% of the accumulated time of the module. Inorder to better understand the profile of this specific function, letus installline_profiler
and wire it to IPython:
pipinstallline_profiler
Under IPython 0.13+, first create a configuration profile:
ipythonprofilecreate
Then register the line_profiler extension in~/.ipython/profile_default/ipython_config.py
:
c.TerminalIPythonApp.extensions.append('line_profiler')c.InteractiveShellApp.extensions.append('line_profiler')
This will register the%lprun
magic command in the IPython terminal application and the other frontends such as qtconsole and notebook.
Now restart IPython and let us use this new toy:
In[1]:fromsklearn.datasetsimportload_digitsIn[2]:fromsklearn.decompositionimportNMF...:fromsklearn.decomposition._nmfimport_nls_subproblemIn[3]:X,_=load_digits(return_X_y=True)In[4]:%lprun-f_nls_subproblemNMF(n_components=16,tol=1e-2).fit(X)Timerunit:1e-06sFile:sklearn/decomposition/nmf.pyFunction:_nls_subproblematline137Totaltime:1.73153sLine# Hits Time Per Hit % Time Line Contents==============================================================137def_nls_subproblem(V,W,H_init,tol,max_iter):138"""Non-negative least square solver ... 170 """171485863122.10.3if(H_init<0).any():172raiseValueError("Negative values in H_init passed to NLS solver.")173174481392.90.0H=H_init175481121412336.35.8WtV=np.dot(W.T,V)1764816144336.30.8WtW=np.dot(W.T,W)177178# values justified in the paper179481443.00.0alpha=1180481132.40.0beta=0.118163818802.90.1forn_iterinrange(1,max_iter+1):182638195133305.910.2grad=np.dot(WtW,H)-WtV183638495761777.125.9proj_gradient=norm(grad[np.logical_or(grad<0,H>0)])18463824493.80.1ifproj_gradient<tol:185481302.70.0break186187147444743.00.2forinner_iterinrange(1,20):18814748383356.94.4Hn=H-alpha*grad189# Hn = np.where(Hn > 0, Hn, 0)1901474194239131.810.1Hn=_pos(Hn)19114744885833.12.5d=Hn-H1921474150407102.07.8gradd=np.sum(grad*d)1931474515390349.726.9dQd=np.sum(np.dot(WtW,d)*d)...
By looking at the top values of the%Time
column it is really easy topin-point the most expensive expressions that would deserve additional care.
Memory usage profiling#
You can analyze in detail the memory usage of any Python code with the help ofmemory_profiler. First,install the latest version:
pipinstall-Umemory_profiler
Then, setup the magics in a manner similar toline_profiler
.
Under IPython 0.11+, first create a configuration profile:
ipythonprofilecreate
Then register the extension in~/.ipython/profile_default/ipython_config.py
alongside the line profiler:
c.TerminalIPythonApp.extensions.append('memory_profiler')c.InteractiveShellApp.extensions.append('memory_profiler')
This will register the%memit
and%mprun
magic commands in theIPython terminal application and the other frontends such as qtconsole and notebook.
%mprun
is useful to examine, line-by-line, the memory usage of keyfunctions in your program. It is very similar to%lprun
, discussed in theprevious section. For example, from thememory_profiler
examples
directory:
In[1]fromexampleimportmy_funcIn[2]%mprun-fmy_funcmy_func()Filename:example.pyLine# Mem usage Increment Line Contents==============================================3@profile45.97MB0.00MBdefmy_func():513.61MB7.64MBa=[1]*(10**6)6166.20MB152.59MBb=[2]*(2*10**7)713.61MB-152.59MBdelb813.61MB0.00MBreturna
Another useful magic thatmemory_profiler
defines is%memit
, which isanalogous to%timeit
. It can be used as follows:
In[1]:importnumpyasnpIn[2]:%memitnp.zeros(1e7)maximumof3:76.402344MBperloop
For more details, see the docstrings of the magics, using%memit?
and%mprun?
.
Using Cython#
If profiling of the Python code reveals that the Python interpreteroverhead is larger by one order of magnitude or more than the cost of theactual numerical computation (e.g.for
loops over vector components,nested evaluation of conditional expression, scalar arithmetic…), itis probably adequate to extract the hotspot portion of the code as astandalone function in a.pyx
file, add static type declarations andthen use Cython to generate a C program suitable to be compiled as aPython extension module.
TheCython’s documentation contains a tutorial andreference guide for developing such a module.For more information about developing in Cython for scikit-learn, seeCython Best Practices, Conventions and Knowledge.
Profiling compiled extensions#
When working with compiled extensions (written in C/C++ with a wrapper ordirectly as Cython extension), the default Python profiler is useless:we need a dedicated tool to introspect what’s happening inside thecompiled extension itself.
Using yep and gperftools#
Easy profiling without special compilation options use yep:
Using a debugger, gdb#
It is helpful to use
gdb
to debug. In order to do so, one must usea Python interpreter built with debug support (debug symbols and properoptimization). To create a new conda environment (which you might needto deactivate and reactivate after building/installing) with a source-builtCPython interpreter:gitclonehttps://github.com/python/cpython.gitcondacreate-ndebug-scikit-devcondaactivatedebug-scikit-devcdcpythonmkdirdebugcddebug../configure--prefix=$CONDA_PREFIX--with-pydebugmakeEXTRA_CFLAGS='-DPy_DEBUG'-j<num_cores>makeinstall
Using gprof#
In order to profile compiled Python extensions one could usegprof
after having recompiled the project withgcc-pg
and using thepython-dbg
variant of the interpreter on debian / ubuntu: howeverthis approach requires to also havenumpy
andscipy
recompiledwith-pg
which is rather complicated to get working.
Fortunately there exist two alternative profilers that don’t require you torecompile everything.
Using valgrind / callgrind / kcachegrind#
kcachegrind#
yep
can be used to create a profiling report.kcachegrind
provides a graphical environment to visualize this report:
# Run yep to profile some python scriptpython-myep-cmy_file.py
# open my_file.py.callgrin with kcachegrindkcachegrindmy_file.py.prof
Note
yep
can be executed with the argument--lines
or-l
to compilea profiling report ‘line by line’.
Multi-core parallelism usingjoblib.Parallel
#
A simple algorithmic trick: warm restarts#
See the glossary entry forwarm_start