- User Guide
- Enhancing...
Enhancing performance#
In this part of the tutorial, we will investigate how to speed up certainfunctions operating on pandasDataFrame
using Cython, Numba andpandas.eval()
.Generally, using Cython and Numba can offer a larger speedup than usingpandas.eval()
but will require a lot more code.
Note
In addition to following the steps in this tutorial, users interested in enhancingperformance are highly encouraged to install therecommended dependencies for pandas.These dependencies are often not installed by default, but will offer speedimprovements if present.
Cython (writing C extensions for pandas)#
For many use cases writing pandas in pure Python and NumPy is sufficient. In somecomputationally heavy applications however, it can be possible to achieve sizablespeed-ups by offloading work tocython.
This tutorial assumes you have refactored as much as possible in Python, for exampleby trying to remove for-loops and making use of NumPy vectorization. It’s always worthoptimising in Python first.
This tutorial walks through a “typical” process of cythonizing a slow computation.We use anexample from the Cython documentationbut in the context of pandas. Our final cythonized solution is around 100 timesfaster than the pure Python solution.
Pure Python#
We have aDataFrame
to which we want to apply a function row-wise.
In [1]:df=pd.DataFrame( ...:{ ...:"a":np.random.randn(1000), ...:"b":np.random.randn(1000), ...:"N":np.random.randint(100,1000,(1000)), ...:"x":"x", ...:} ...:) ...:In [2]:dfOut[2]: a b N x0 0.469112 -0.218470 585 x1 -0.282863 -0.061645 841 x2 -1.509059 -0.723780 251 x3 -1.135632 0.551225 972 x4 1.212112 -0.497767 181 x.. ... ... ... ..995 -1.512743 0.874737 374 x996 0.933753 1.120790 246 x997 -0.308013 0.198768 157 x998 -0.079915 1.757555 977 x999 -1.010589 -1.115680 770 x[1000 rows x 4 columns]
Here’s the function in pure Python:
In [3]:deff(x): ...:returnx*(x-1) ...:In [4]:defintegrate_f(a,b,N): ...:s=0 ...:dx=(b-a)/N ...:foriinrange(N): ...:s+=f(a+i*dx) ...:returns*dx ...:
We achieve our result by usingDataFrame.apply()
(row-wise):
In [5]:%timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)84 ms +- 1.01 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)
Let’s take a look and see where the time is spent during this operationusing theprun ipython magic function:
# most time consuming 4 callsIn [6]:%prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1) # noqa E999 605956 function calls (605938 primitive calls) in 0.171 seconds Ordered by: internal time List reduced from 163 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 1000 0.099 0.000 0.152 0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f) 552423 0.053 0.000 0.053 0.000 <ipython-input-3-c138bdd570e3>:1(f) 3000 0.003 0.000 0.013 0.000 series.py:1095(__getitem__) 3000 0.002 0.000 0.006 0.000 series.py:1220(_get_value)
By far the majority of time is spend inside eitherintegrate_f
orf
,hence we’ll concentrate our efforts cythonizing these two functions.
Plain Cython#
First we’re going to need to import the Cython magic function to IPython:
In [7]:%load_ext Cython
Now, let’s simply copy our functions over to Cython:
In [8]:%%cython ...:def f_plain(x): ...: return x * (x - 1) ...:def integrate_f_plain(a, b, N): ...: s = 0 ...: dx = (b - a) / N ...: for i in range(N): ...: s += f_plain(a + i * dx) ...: return s * dx ...:
In [9]:%timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)47.2 ms +- 366 us per loop (mean +- std. dev. of 7 runs, 10 loops each)
This has improved the performance compared to the pure Python approach by one-third.
Declaring C types#
We can annotate the function variables and return types as well as usecdef
andcpdef
to improve performance:
In [10]:%%cython ....:cdef double f_typed(double x) except? -2: ....: return x * (x - 1) ....:cpdef double integrate_f_typed(double a, double b, int N): ....: cdef int i ....: cdef double s, dx ....: s = 0 ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....:
In [11]:%timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)7.75 ms +- 23.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
Annotating the functions with C types yields an over ten times performance improvement compared tothe original Python implementation.
Using ndarray#
When re-profiling, time is spent creating aSeries
from each row, and calling__getitem__
from boththe index and the series (three times for each row). These Python function calls are expensive andcan be improved by passing annp.ndarray
.
In [12]:%prun -l 4 df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1) 52533 function calls (52515 primitive calls) in 0.019 seconds Ordered by: internal time List reduced from 161 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 3000 0.003 0.000 0.012 0.000 series.py:1095(__getitem__) 3000 0.002 0.000 0.005 0.000 series.py:1220(_get_value) 3000 0.002 0.000 0.002 0.000 base.py:3777(get_loc) 3000 0.002 0.000 0.002 0.000 indexing.py:2765(check_dict_or_set_indexers)
In [13]:%%cython ....:cimport numpy as np ....:import numpy as np ....:cdef double f_typed(double x) except? -2: ....: return x * (x - 1) ....:cpdef double integrate_f_typed(double a, double b, int N): ....: cdef int i ....: cdef double s, dx ....: s = 0 ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....:cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b, ....: np.ndarray col_N): ....: assert (col_a.dtype == np.float64 ....: and col_b.dtype == np.float64 and col_N.dtype == np.dtype(int)) ....: cdef Py_ssize_t i, n = len(col_N) ....: assert (len(col_a) == len(col_b) == n) ....: cdef np.ndarray[double] res = np.empty(n) ....: for i in range(len(col_a)): ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i]) ....: return res ....:Content of stderr:In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5, from /home/runner/.cache/ipython/cython/_cython_magic_96d1519457caba8fa4f96b759be00659f51c6b18.c:1215:/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] 17 | #warning "Using deprecated NumPy API, disable it with " \ | ^~~~~~~
This implementation creates an array of zeros and inserts the resultofintegrate_f_typed
applied over each row. Looping over anndarray
is fasterin Cython than looping over aSeries
object.
Sinceapply_integrate_f
is typed to accept annp.ndarray
,Series.to_numpy()
calls are needed to utilize this function.
In [14]:%timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())834 us +- 2.87 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
Performance has improved from the prior implementation by almost ten times.
Disabling compiler directives#
The majority of the time is now spent inapply_integrate_f
. Disabling Cython’sboundscheck
andwraparound
checks can yield more performance.
In [15]:%prun -l 4 apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy()) 78 function calls in 0.001 seconds Ordered by: internal time List reduced from 21 to 4 due to restriction <4> ncalls tottime percall cumtime percall filename:lineno(function) 1 0.001 0.001 0.001 0.001 <string>:1(<module>) 1 0.000 0.000 0.001 0.001 {built-in method builtins.exec} 3 0.000 0.000 0.000 0.000 frame.py:4062(__getitem__) 3 0.000 0.000 0.000 0.000 base.py:541(to_numpy)
In [16]:%%cython ....:cimport cython ....:cimport numpy as np ....:import numpy as np ....:cdef np.float64_t f_typed(np.float64_t x) except? -2: ....: return x * (x - 1) ....:cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N): ....: cdef np.int64_t i ....: cdef np.float64_t s = 0.0, dx ....: dx = (b - a) / N ....: for i in range(N): ....: s += f_typed(a + i * dx) ....: return s * dx ....:@cython.boundscheck(False) ....:@cython.wraparound(False) ....:cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap( ....: np.ndarray[np.float64_t] col_a, ....: np.ndarray[np.float64_t] col_b, ....: np.ndarray[np.int64_t] col_N ....:): ....: cdef np.int64_t i, n = len(col_N) ....: assert len(col_a) == len(col_b) == n ....: cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64) ....: for i in range(n): ....: res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i]) ....: return res ....:Content of stderr:In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12, from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5, from /home/runner/.cache/ipython/cython/_cython_magic_3bb7bde31cdaf5ab952bfe5a612c6edef03550d0.c:1216:/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp] 17 | #warning "Using deprecated NumPy API, disable it with " \ | ^~~~~~~
In [17]:%timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())622 us +- 672 ns per loop (mean +- std. dev. of 7 runs, 1,000 loops each)
However, a loop indexeri
accessing an invalid location in an array would cause a segfault because memory access isn’t checked.For more aboutboundscheck
andwraparound
, see the Cython docs oncompiler directives.
Numba (JIT compilation)#
An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler withNumba.
Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,by decorating your function with@jit
.
Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.
Note
The@jit
compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.Considercaching your function to avoid compilation overhead each time your function is run.
Numba can be used in 2 ways with pandas:
Specify the
engine="numba"
keyword in select pandas methodsDefine your own Python function decorated with
@jit
and pass the underlying NumPy array ofSeries
orDataFrame
(usingSeries.to_numpy()
) into the function
pandas Numba Engine#
If Numba is installed, one can specifyengine="numba"
in select pandas methods to execute the method using Numba.Methods that supportengine="numba"
will also have anengine_kwargs
keyword that accepts a dictionary that allows one to specify"nogil"
,"nopython"
and"parallel"
keys with boolean values to pass into the@jit
decorator.Ifengine_kwargs
is not specified, it defaults to{"nogil":False,"nopython":True,"parallel":False}
unless otherwise specified.
Note
In terms of performance,the first time a function is run using the Numba engine will be slowas Numba will have some function compilation overhead. However, the JIT compiled functions are cached,and subsequent calls will be fast. In general, the Numba engine is performant witha larger amount of data points (e.g. 1+ million).
In [1]:data=pd.Series(range(1_000_000))# noqa: E225In [2]:roll=data.rolling(10)In [3]:deff(x): ...:returnnp.sum(x)+5# Run the first time, compilation time will affect performanceIn [4]:%timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)# Function is cached and performance will improveIn [5]:%timeit roll.apply(f, engine='numba', raw=True)188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)In [6]:%timeit roll.apply(f, engine='cython', raw=True)3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
If your compute hardware contains multiple CPUs, the largest performance gain can be realized by settingparallel
toTrue
to leverage more than 1 CPU. Internally, pandas leverages numba to parallelize computations over the columns of aDataFrame
;therefore, this performance benefit is only beneficial for aDataFrame
with a large number of columns.
In [1]:importnumbaIn [2]:numba.set_num_threads(1)In [3]:df=pd.DataFrame(np.random.randn(10_000,100))In [4]:roll=df.rolling(100)In [5]:%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})347 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [6]:numba.set_num_threads(2)In [7]:%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})201 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Custom Function Examples#
A custom Python function decorated with@jit
can be used with pandas objects by passing their NumPy arrayrepresentations withSeries.to_numpy()
.
importnumba@numba.jitdeff_plain(x):returnx*(x-1)@numba.jitdefintegrate_f_numba(a,b,N):s=0dx=(b-a)/Nforiinrange(N):s+=f_plain(a+i*dx)returns*dx@numba.jitdefapply_integrate_f_numba(col_a,col_b,col_N):n=len(col_N)result=np.empty(n,dtype="float64")assertlen(col_a)==len(col_b)==nforiinrange(n):result[i]=integrate_f_numba(col_a[i],col_b[i],col_N[i])returnresultdefcompute_numba(df):result=apply_integrate_f_numba(df["a"].to_numpy(),df["b"].to_numpy(),df["N"].to_numpy())returnpd.Series(result,index=df.index,name="result")
In [4]:%timeit compute_numba(df)1000 loops, best of 3: 798 us per loop
In this example, using Numba was faster than Cython.
Numba can also be used to write vectorized functions that do not require the user to explicitlyloop over the observations of a vector; a vectorized function will be applied to each row automatically.Consider the following example of doubling each observation:
importnumbadefdouble_every_value_nonumba(x):returnx*2@numba.vectorizedefdouble_every_value_withnumba(x):# noqa E501returnx*2
# Custom function without numbaIn [5]:%timeit df["col1_doubled"] = df["a"].apply(double_every_value_nonumba) # noqa E5011000 loops, best of 3: 797 us per loop# Standard implementation (faster than a custom function)In [6]:%timeit df["col1_doubled"] = df["a"] * 21000 loops, best of 3: 233 us per loop# Custom function with numbaIn [7]:%timeit df["col1_doubled"] = double_every_value_withnumba(df["a"].to_numpy())1000 loops, best of 3: 145 us per loop
Caveats#
Numba is best at accelerating functions that apply numerical functions to NumPyarrays. If you try to@jit
a function that contains unsupportedPythonorNumPycode, compilation will revertobject mode whichwill mostly likely not speed up your function. If you wouldprefer that Numba throw an error if it cannot compile a function in a way thatspeeds up your code, pass Numba the argumentnopython=True
(e.g.@jit(nopython=True)
). For more ontroubleshooting Numba modes, see theNumba troubleshooting page.
Usingparallel=True
(e.g.@jit(parallel=True)
) may result in aSIGABRT
if the threading layer leads to unsafebehavior. You can firstspecify a safe threading layerbefore running a JIT function withparallel=True
.
Generally if the you encounter a segfault (SIGSEGV
) while using Numba, please report the issueto theNumba issue tracker.
Expression evaluation viaeval()
#
The top-level functionpandas.eval()
implements performant expression evaluation ofSeries
andDataFrame
. Expression evaluation allows operationsto be expressed as strings and can potentially provide a performance improvementby evaluate arithmetic and boolean expression all at once for largeDataFrame
.
Note
You should not useeval()
for simpleexpressions or for expressions involving small DataFrames. In fact,eval()
is many orders of magnitude slower forsmaller expressions or objects than plain Python. A good rule of thumb isto only useeval()
when you have aDataFrame
with more than 10,000 rows.
Supported syntax#
These operations are supported bypandas.eval()
:
Arithmetic operations except for the left shift (
<<
) and right shift(>>
) operators, e.g.,df+2*pi/s**4%42-the_golden_ratio
Comparison operations, including chained comparisons, e.g.,
2<df<df2
Boolean operations, e.g.,
df<df2anddf3<df4ornotdf_bool
list
andtuple
literals, e.g.,[1,2]
or(1,2)
Attribute access, e.g.,
df.a
Subscript expressions, e.g.,
df[0]
Simple variable evaluation, e.g.,
pd.eval("df")
(this is not very useful)Math functions:
sin
,cos
,exp
,log
,expm1
,log1p
,sqrt
,sinh
,cosh
,tanh
,arcsin
,arccos
,arctan
,arccosh
,arcsinh
,arctanh
,abs
,arctan2
andlog10
.
The following Python syntax isnot allowed:
Expressions
Function calls other than math functions.
is
/isnot
operationsif
expressionslambda
expressionslist
/set
/dict
comprehensionsLiteral
dict
andset
expressionsyield
expressionsGenerator expressions
Boolean expressions consisting of only scalar values
Statements
Local variables#
You mustexplicitly reference any local variable that you want to use in anexpression by placing the@
character in front of the name. This mechanism isthe same for bothDataFrame.query()
andDataFrame.eval()
. For example,
In [18]:df=pd.DataFrame(np.random.randn(5,2),columns=list("ab"))In [19]:newcol=np.random.randn(len(df))In [20]:df.eval("b + @newcol")Out[20]:0 -0.2061221 -1.0295872 0.5197263 -2.0525894 1.453210dtype: float64In [21]:df.query("b < @newcol")Out[21]: a b1 0.160268 -0.8488963 0.333758 -1.1803554 0.572182 0.439895
If you don’t prefix the local variable with@
, pandas will raise anexception telling you the variable is undefined.
When usingDataFrame.eval()
andDataFrame.query()
, this allows youto have a local variable and aDataFrame
column with the samename in an expression.
In [22]:a=np.random.randn()In [23]:df.query("@a < a")Out[23]: a b0 0.473349 0.8912361 0.160268 -0.8488962 0.803311 1.6620313 0.333758 -1.1803554 0.572182 0.439895In [24]:df.loc[a<df["a"]]# same as the previous expressionOut[24]: a b0 0.473349 0.8912361 0.160268 -0.8488962 0.803311 1.6620313 0.333758 -1.1803554 0.572182 0.439895
Warning
pandas.eval()
will raise an exception if you cannot use the@
prefix because itisn’t defined in that context.
In [25]:a,b=1,2In [26]:pd.eval("@a + b")Traceback (most recent call last): File ~/micromamba/envs/test/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3577 in run_code exec(code_obj, self.user_global_ns, self.user_ns) Cell In[26], line 1 pd.eval("@a + b") File ~/work/pandas/pandas/pandas/core/computation/eval.py:325 in eval _check_for_locals(expr, level, parser) File ~/work/pandas/pandas/pandas/core/computation/eval.py:167 in _check_for_locals raise SyntaxError(msg) File <string>SyntaxError: The '@' prefix is not allowed in top-level eval calls.please refer to your variables by name without the '@' prefix.
In this case, you should simply refer to the variables like you would instandard Python.
In [27]:pd.eval("a + b")Out[27]:3
pandas.eval()
parsers#
There are two different expression syntax parsers.
The default'pandas'
parser allows a more intuitive syntax for expressingquery-like operations (comparisons, conjunctions and disjunctions). Inparticular, the precedence of the&
and|
operators is made equal tothe precedence of the corresponding boolean operationsand
andor
.
For example, the above conjunction can be written without parentheses.Alternatively, you can use the'python'
parser to enforce strict Pythonsemantics.
In [28]:nrows,ncols=20000,100In [29]:df1,df2,df3,df4=[pd.DataFrame(np.random.randn(nrows,ncols))for_inrange(4)]In [30]:expr="(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"In [31]:x=pd.eval(expr,parser="python")In [32]:expr_no_parens="df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0"In [33]:y=pd.eval(expr_no_parens,parser="pandas")In [34]:np.all(x==y)Out[34]:True
The same expression can be “anded” together with the wordand
aswell:
In [35]:expr="(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"In [36]:x=pd.eval(expr,parser="python")In [37]:expr_with_ands="df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0"In [38]:y=pd.eval(expr_with_ands,parser="pandas")In [39]:np.all(x==y)Out[39]:True
Theand
andor
operators here have the same precedence that they wouldin Python.
pandas.eval()
engines#
There are two different expression engines.
The'numexpr'
engine is the more performant engine that can yield performance improvementscompared to standard Python syntax for largeDataFrame
. This engine requires theoptional dependencynumexpr
to be installed.
The'python'
engine is generallynot useful except for testingother evaluation engines against it. You will achieveno performancebenefits usingeval()
withengine='python'
and mayincur a performance hit.
In [40]:%timeit df1 + df2 + df3 + df47.3 ms +- 24.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [41]:%timeit pd.eval("df1 + df2 + df3 + df4", engine="python")7.92 ms +- 70.6 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
TheDataFrame.eval()
method#
In addition to the top levelpandas.eval()
function you can alsoevaluate an expression in the “context” of aDataFrame
.
In [42]:df=pd.DataFrame(np.random.randn(5,2),columns=["a","b"])In [43]:df.eval("a + b")Out[43]:0 -0.1610991 0.8054522 0.7474473 1.1890424 -2.057490dtype: float64
Any expression that is a validpandas.eval()
expression is also a validDataFrame.eval()
expression, with the added benefit that you don’t have toprefix the name of theDataFrame
to the column(s) you’reinterested in evaluating.
In addition, you can perform assignment of columns within an expression.This allows forformulaic evaluation. The assignment target can be anew column name or an existing column name, and it must be a valid Pythonidentifier.
In [44]:df=pd.DataFrame(dict(a=range(5),b=range(5,10)))In [45]:df=df.eval("c = a + b")In [46]:df=df.eval("d = a + b + c")In [47]:df=df.eval("a = 1")In [48]:dfOut[48]: a b c d0 1 5 5 101 1 6 7 142 1 7 9 183 1 8 11 224 1 9 13 26
A copy of theDataFrame
with thenew or modified columns is returned, and the original frame is unchanged.
In [49]:dfOut[49]: a b c d0 1 5 5 101 1 6 7 142 1 7 9 183 1 8 11 224 1 9 13 26In [50]:df.eval("e = a - c")Out[50]: a b c d e0 1 5 5 10 -41 1 6 7 14 -62 1 7 9 18 -83 1 8 11 22 -104 1 9 13 26 -12In [51]:dfOut[51]: a b c d0 1 5 5 101 1 6 7 142 1 7 9 183 1 8 11 224 1 9 13 26
Multiple column assignments can be performed by using a multi-line string.
In [52]:df.eval( ....:""" ....:c = a + b ....:d = a + b + c ....:a = 1""", ....:) ....:Out[52]: a b c d0 1 5 6 121 1 6 7 142 1 7 8 163 1 8 9 184 1 9 10 20
The equivalent in standard Python would be
In [53]:df=pd.DataFrame(dict(a=range(5),b=range(5,10)))In [54]:df["c"]=df["a"]+df["b"]In [55]:df["d"]=df["a"]+df["b"]+df["c"]In [56]:df["a"]=1In [57]:dfOut[57]: a b c d0 1 5 5 101 1 6 7 142 1 7 9 183 1 8 11 224 1 9 13 26
eval()
performance comparison#
pandas.eval()
works well with expressions containing large arrays.
In [58]:nrows,ncols=20000,100In [59]:df1,df2,df3,df4=[pd.DataFrame(np.random.randn(nrows,ncols))for_inrange(4)]
DataFrame
arithmetic:
In [60]:%timeit df1 + df2 + df3 + df47.72 ms +- 56.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [61]:%timeit pd.eval("df1 + df2 + df3 + df4")2.89 ms +- 73.7 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
DataFrame
comparison:
In [62]:%timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)6.08 ms +- 48.5 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [63]:%timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)")9.32 ms +- 24.1 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
DataFrame
arithmetic with unaligned axes.
In [64]:s=pd.Series(np.random.randn(50))In [65]:%timeit df1 + df2 + df3 + df4 + s12.7 ms +- 69.2 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [66]:%timeit pd.eval("df1 + df2 + df3 + df4 + s")3.61 ms +- 41.1 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
Note
Operations such as
1and2# would parse to 1 & 2, but should evaluate to 23or4# would parse to 3 | 4, but should evaluate to 3~1# this is okay, but slower when using eval
should be performed in Python. An exception will be raised if you try toperform any boolean/bitwise operations with scalar operands that are notof typebool
ornp.bool_
.
Here is a plot showing the running time ofpandas.eval()
as function of the size of the frame involved in thecomputation. The two lines are two different engines.

You will only see the performance benefits of using thenumexpr
engine withpandas.eval()
if yourDataFrame
has more than approximately 100,000 rows.
This plot was created using aDataFrame
with 3 columns each containingfloating point values generated usingnumpy.random.randn()
.
Expression evaluation limitations withnumexpr
#
Expressions that would result in an object dtype or involve datetime operationsbecause ofNaT
must be evaluated in Python space, but part of an expressioncan still be evaluated withnumexpr
. For example:
In [67]:df=pd.DataFrame( ....:{"strings":np.repeat(list("cba"),3),"nums":np.repeat(range(3),3)} ....:) ....:In [68]:dfOut[68]: strings nums0 c 01 c 02 c 03 b 14 b 15 b 16 a 27 a 28 a 2In [69]:df.query("strings == 'a' and nums == 1")Out[69]:Empty DataFrameColumns: [strings, nums]Index: []
The numeric part of the comparison (nums==1
) will be evaluated bynumexpr
and the object part of the comparison ("strings=='a'
) willbe evaluated by Python.