Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

Enhancing performance#

In this part of the tutorial, we will investigate how to speed up certainfunctions operating on pandasDataFrame using Cython, Numba andpandas.eval().Generally, using Cython and Numba can offer a larger speedup than usingpandas.eval()but will require a lot more code.

Note

In addition to following the steps in this tutorial, users interested in enhancingperformance are highly encouraged to install therecommended dependencies for pandas.These dependencies are often not installed by default, but will offer speedimprovements if present.

Cython (writing C extensions for pandas)#

For many use cases writing pandas in pure Python and NumPy is sufficient. In somecomputationally heavy applications however, it can be possible to achieve sizablespeed-ups by offloading work tocython.

This tutorial assumes you have refactored as much as possible in Python, for exampleby trying to remove for-loops and making use of NumPy vectorization. It’s always worthoptimising in Python first.

This tutorial walks through a “typical” process of cythonizing a slow computation.We use anexample from the Cython documentationbut in the context of pandas. Our final cythonized solution is around 100 timesfaster than the pure Python solution.

Pure Python#

We have aDataFrame to which we want to apply a function row-wise.

In [1]:df=pd.DataFrame(   ...:{   ...:"a":np.random.randn(1000),   ...:"b":np.random.randn(1000),   ...:"N":np.random.randint(100,1000,(1000)),   ...:"x":"x",   ...:}   ...:)   ...:In [2]:dfOut[2]:            a         b    N  x0    0.469112 -0.218470  585  x1   -0.282863 -0.061645  841  x2   -1.509059 -0.723780  251  x3   -1.135632  0.551225  972  x4    1.212112 -0.497767  181  x..        ...       ...  ... ..995 -1.512743  0.874737  374  x996  0.933753  1.120790  246  x997 -0.308013  0.198768  157  x998 -0.079915  1.757555  977  x999 -1.010589 -1.115680  770  x[1000 rows x 4 columns]

Here’s the function in pure Python:

In [3]:deff(x):   ...:returnx*(x-1)   ...:In [4]:defintegrate_f(a,b,N):   ...:s=0   ...:dx=(b-a)/N   ...:foriinrange(N):   ...:s+=f(a+i*dx)   ...:returns*dx   ...:

We achieve our result by usingDataFrame.apply() (row-wise):

In [5]:%timeit df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)84 ms +- 1.01 ms per loop (mean +- std. dev. of 7 runs, 10 loops each)

Let’s take a look and see where the time is spent during this operationusing theprun ipython magic function:

# most time consuming 4 callsIn [6]:%prun -l 4 df.apply(lambda x: integrate_f(x["a"], x["b"], x["N"]), axis=1)  # noqa E999         605956 function calls (605938 primitive calls) in 0.171 seconds   Ordered by: internal time   List reduced from 163 to 4 due to restriction <4>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)     1000    0.099    0.000    0.152    0.000 <ipython-input-4-c2a74e076cf0>:1(integrate_f)   552423    0.053    0.000    0.053    0.000 <ipython-input-3-c138bdd570e3>:1(f)     3000    0.003    0.000    0.013    0.000 series.py:1095(__getitem__)     3000    0.002    0.000    0.006    0.000 series.py:1220(_get_value)

By far the majority of time is spend inside eitherintegrate_f orf,hence we’ll concentrate our efforts cythonizing these two functions.

Plain Cython#

First we’re going to need to import the Cython magic function to IPython:

In [7]:%load_ext Cython

Now, let’s simply copy our functions over to Cython:

In [8]:%%cython   ...:def f_plain(x):   ...:    return x * (x - 1)   ...:def integrate_f_plain(a, b, N):   ...:    s = 0   ...:    dx = (b - a) / N   ...:    for i in range(N):   ...:        s += f_plain(a + i * dx)   ...:    return s * dx   ...:
In [9]:%timeit df.apply(lambda x: integrate_f_plain(x["a"], x["b"], x["N"]), axis=1)47.2 ms +- 366 us per loop (mean +- std. dev. of 7 runs, 10 loops each)

This has improved the performance compared to the pure Python approach by one-third.

Declaring C types#

We can annotate the function variables and return types as well as usecdefandcpdef to improve performance:

In [10]:%%cython   ....:cdef double f_typed(double x) except? -2:   ....:    return x * (x - 1)   ....:cpdef double integrate_f_typed(double a, double b, int N):   ....:    cdef int i   ....:    cdef double s, dx   ....:    s = 0   ....:    dx = (b - a) / N   ....:    for i in range(N):   ....:        s += f_typed(a + i * dx)   ....:    return s * dx   ....:
In [11]:%timeit df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)7.75 ms +- 23.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Annotating the functions with C types yields an over ten times performance improvement compared tothe original Python implementation.

Using ndarray#

When re-profiling, time is spent creating aSeries from each row, and calling__getitem__ from boththe index and the series (three times for each row). These Python function calls are expensive andcan be improved by passing annp.ndarray.

In [12]:%prun -l 4 df.apply(lambda x: integrate_f_typed(x["a"], x["b"], x["N"]), axis=1)         52533 function calls (52515 primitive calls) in 0.019 seconds   Ordered by: internal time   List reduced from 161 to 4 due to restriction <4>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)     3000    0.003    0.000    0.012    0.000 series.py:1095(__getitem__)     3000    0.002    0.000    0.005    0.000 series.py:1220(_get_value)     3000    0.002    0.000    0.002    0.000 base.py:3777(get_loc)     3000    0.002    0.000    0.002    0.000 indexing.py:2765(check_dict_or_set_indexers)
In [13]:%%cython   ....:cimport numpy as np   ....:import numpy as np   ....:cdef double f_typed(double x) except? -2:   ....:    return x * (x - 1)   ....:cpdef double integrate_f_typed(double a, double b, int N):   ....:    cdef int i   ....:    cdef double s, dx   ....:    s = 0   ....:    dx = (b - a) / N   ....:    for i in range(N):   ....:        s += f_typed(a + i * dx)   ....:    return s * dx   ....:cpdef np.ndarray[double] apply_integrate_f(np.ndarray col_a, np.ndarray col_b,   ....:                                           np.ndarray col_N):   ....:    assert (col_a.dtype == np.float64   ....:            and col_b.dtype == np.float64 and col_N.dtype == np.dtype(int))   ....:    cdef Py_ssize_t i, n = len(col_N)   ....:    assert (len(col_a) == len(col_b) == n)   ....:    cdef np.ndarray[double] res = np.empty(n)   ....:    for i in range(len(col_a)):   ....:        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])   ....:    return res   ....:Content of stderr:In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,                 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,                 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,                 from /home/runner/.cache/ipython/cython/_cython_magic_96d1519457caba8fa4f96b759be00659f51c6b18.c:1215:/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]   17 | #warning "Using deprecated NumPy API, disable it with " \      |  ^~~~~~~

This implementation creates an array of zeros and inserts the resultofintegrate_f_typed applied over each row. Looping over anndarray is fasterin Cython than looping over aSeries object.

Sinceapply_integrate_f is typed to accept annp.ndarray,Series.to_numpy()calls are needed to utilize this function.

In [14]:%timeit apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())834 us +- 2.87 us per loop (mean +- std. dev. of 7 runs, 1,000 loops each)

Performance has improved from the prior implementation by almost ten times.

Disabling compiler directives#

The majority of the time is now spent inapply_integrate_f. Disabling Cython’sboundscheckandwraparound checks can yield more performance.

In [15]:%prun -l 4 apply_integrate_f(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())         78 function calls in 0.001 seconds   Ordered by: internal time   List reduced from 21 to 4 due to restriction <4>   ncalls  tottime  percall  cumtime  percall filename:lineno(function)        1    0.001    0.001    0.001    0.001 <string>:1(<module>)        1    0.000    0.000    0.001    0.001 {built-in method builtins.exec}        3    0.000    0.000    0.000    0.000 frame.py:4062(__getitem__)        3    0.000    0.000    0.000    0.000 base.py:541(to_numpy)
In [16]:%%cython   ....:cimport cython   ....:cimport numpy as np   ....:import numpy as np   ....:cdef np.float64_t f_typed(np.float64_t x) except? -2:   ....:    return x * (x - 1)   ....:cpdef np.float64_t integrate_f_typed(np.float64_t a, np.float64_t b, np.int64_t N):   ....:    cdef np.int64_t i   ....:    cdef np.float64_t s = 0.0, dx   ....:    dx = (b - a) / N   ....:    for i in range(N):   ....:        s += f_typed(a + i * dx)   ....:    return s * dx   ....:@cython.boundscheck(False)   ....:@cython.wraparound(False)   ....:cpdef np.ndarray[np.float64_t] apply_integrate_f_wrap(   ....:    np.ndarray[np.float64_t] col_a,   ....:    np.ndarray[np.float64_t] col_b,   ....:    np.ndarray[np.int64_t] col_N   ....:):   ....:    cdef np.int64_t i, n = len(col_N)   ....:    assert len(col_a) == len(col_b) == n   ....:    cdef np.ndarray[np.float64_t] res = np.empty(n, dtype=np.float64)   ....:    for i in range(n):   ....:        res[i] = integrate_f_typed(col_a[i], col_b[i], col_N[i])   ....:    return res   ....:Content of stderr:In file included from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarraytypes.h:1929,                 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/ndarrayobject.h:12,                 from /home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/arrayobject.h:5,                 from /home/runner/.cache/ipython/cython/_cython_magic_3bb7bde31cdaf5ab952bfe5a612c6edef03550d0.c:1216:/home/runner/micromamba/envs/test/lib/python3.10/site-packages/numpy/core/include/numpy/npy_1_7_deprecated_api.h:17:2: warning: #warning "Using deprecated NumPy API, disable it with " "#define NPY_NO_DEPRECATED_API NPY_1_7_API_VERSION" [-Wcpp]   17 | #warning "Using deprecated NumPy API, disable it with " \      |  ^~~~~~~
In [17]:%timeit apply_integrate_f_wrap(df["a"].to_numpy(), df["b"].to_numpy(), df["N"].to_numpy())622 us +- 672 ns per loop (mean +- std. dev. of 7 runs, 1,000 loops each)

However, a loop indexeri accessing an invalid location in an array would cause a segfault because memory access isn’t checked.For more aboutboundscheck andwraparound, see the Cython docs oncompiler directives.

Numba (JIT compilation)#

An alternative to statically compiling Cython code is to use a dynamic just-in-time (JIT) compiler withNumba.

Numba allows you to write a pure Python function which can be JIT compiled to native machine instructions, similar in performance to C, C++ and Fortran,by decorating your function with@jit.

Numba works by generating optimized machine code using the LLVM compiler infrastructure at import time, runtime, or statically (using the included pycc tool).Numba supports compilation of Python to run on either CPU or GPU hardware and is designed to integrate with the Python scientific software stack.

Note

The@jit compilation will add overhead to the runtime of the function, so performance benefits may not be realized especially when using small data sets.Considercaching your function to avoid compilation overhead each time your function is run.

Numba can be used in 2 ways with pandas:

  1. Specify theengine="numba" keyword in select pandas methods

  2. Define your own Python function decorated with@jit and pass the underlying NumPy array ofSeries orDataFrame (usingSeries.to_numpy()) into the function

pandas Numba Engine#

If Numba is installed, one can specifyengine="numba" in select pandas methods to execute the method using Numba.Methods that supportengine="numba" will also have anengine_kwargs keyword that accepts a dictionary that allows one to specify"nogil","nopython" and"parallel" keys with boolean values to pass into the@jit decorator.Ifengine_kwargs is not specified, it defaults to{"nogil":False,"nopython":True,"parallel":False} unless otherwise specified.

Note

In terms of performance,the first time a function is run using the Numba engine will be slowas Numba will have some function compilation overhead. However, the JIT compiled functions are cached,and subsequent calls will be fast. In general, the Numba engine is performant witha larger amount of data points (e.g. 1+ million).

In [1]:data=pd.Series(range(1_000_000))# noqa: E225In [2]:roll=data.rolling(10)In [3]:deff(x):   ...:returnnp.sum(x)+5# Run the first time, compilation time will affect performanceIn [4]:%timeit -r 1 -n 1 roll.apply(f, engine='numba', raw=True)1.23 s ± 0 ns per loop (mean ± std. dev. of 1 run, 1 loop each)# Function is cached and performance will improveIn [5]:%timeit roll.apply(f, engine='numba', raw=True)188 ms ± 1.93 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)In [6]:%timeit roll.apply(f, engine='cython', raw=True)3.92 s ± 59 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

If your compute hardware contains multiple CPUs, the largest performance gain can be realized by settingparallel toTrueto leverage more than 1 CPU. Internally, pandas leverages numba to parallelize computations over the columns of aDataFrame;therefore, this performance benefit is only beneficial for aDataFrame with a large number of columns.

In [1]:importnumbaIn [2]:numba.set_num_threads(1)In [3]:df=pd.DataFrame(np.random.randn(10_000,100))In [4]:roll=df.rolling(100)In [5]:%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})347 ms ± 26 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)In [6]:numba.set_num_threads(2)In [7]:%timeit roll.mean(engine="numba", engine_kwargs={"parallel": True})201 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

Custom Function Examples#

A custom Python function decorated with@jit can be used with pandas objects by passing their NumPy arrayrepresentations withSeries.to_numpy().

importnumba@numba.jitdeff_plain(x):returnx*(x-1)@numba.jitdefintegrate_f_numba(a,b,N):s=0dx=(b-a)/Nforiinrange(N):s+=f_plain(a+i*dx)returns*dx@numba.jitdefapply_integrate_f_numba(col_a,col_b,col_N):n=len(col_N)result=np.empty(n,dtype="float64")assertlen(col_a)==len(col_b)==nforiinrange(n):result[i]=integrate_f_numba(col_a[i],col_b[i],col_N[i])returnresultdefcompute_numba(df):result=apply_integrate_f_numba(df["a"].to_numpy(),df["b"].to_numpy(),df["N"].to_numpy())returnpd.Series(result,index=df.index,name="result")
In [4]:%timeit compute_numba(df)1000 loops, best of 3: 798 us per loop

In this example, using Numba was faster than Cython.

Numba can also be used to write vectorized functions that do not require the user to explicitlyloop over the observations of a vector; a vectorized function will be applied to each row automatically.Consider the following example of doubling each observation:

importnumbadefdouble_every_value_nonumba(x):returnx*2@numba.vectorizedefdouble_every_value_withnumba(x):# noqa E501returnx*2
# Custom function without numbaIn [5]:%timeit df["col1_doubled"] = df["a"].apply(double_every_value_nonumba)  # noqa E5011000 loops, best of 3: 797 us per loop# Standard implementation (faster than a custom function)In [6]:%timeit df["col1_doubled"] = df["a"] * 21000 loops, best of 3: 233 us per loop# Custom function with numbaIn [7]:%timeit df["col1_doubled"] = double_every_value_withnumba(df["a"].to_numpy())1000 loops, best of 3: 145 us per loop

Caveats#

Numba is best at accelerating functions that apply numerical functions to NumPyarrays. If you try to@jit a function that contains unsupportedPythonorNumPycode, compilation will revertobject mode whichwill mostly likely not speed up your function. If you wouldprefer that Numba throw an error if it cannot compile a function in a way thatspeeds up your code, pass Numba the argumentnopython=True (e.g.@jit(nopython=True)). For more ontroubleshooting Numba modes, see theNumba troubleshooting page.

Usingparallel=True (e.g.@jit(parallel=True)) may result in aSIGABRT if the threading layer leads to unsafebehavior. You can firstspecify a safe threading layerbefore running a JIT function withparallel=True.

Generally if the you encounter a segfault (SIGSEGV) while using Numba, please report the issueto theNumba issue tracker.

Expression evaluation viaeval()#

The top-level functionpandas.eval() implements performant expression evaluation ofSeries andDataFrame. Expression evaluation allows operationsto be expressed as strings and can potentially provide a performance improvementby evaluate arithmetic and boolean expression all at once for largeDataFrame.

Note

You should not useeval() for simpleexpressions or for expressions involving small DataFrames. In fact,eval() is many orders of magnitude slower forsmaller expressions or objects than plain Python. A good rule of thumb isto only useeval() when you have aDataFrame with more than 10,000 rows.

Supported syntax#

These operations are supported bypandas.eval():

  • Arithmetic operations except for the left shift (<<) and right shift(>>) operators, e.g.,df+2*pi/s**4%42-the_golden_ratio

  • Comparison operations, including chained comparisons, e.g.,2<df<df2

  • Boolean operations, e.g.,df<df2anddf3<df4ornotdf_bool

  • list andtuple literals, e.g.,[1,2] or(1,2)

  • Attribute access, e.g.,df.a

  • Subscript expressions, e.g.,df[0]

  • Simple variable evaluation, e.g.,pd.eval("df") (this is not very useful)

  • Math functions:sin,cos,exp,log,expm1,log1p,sqrt,sinh,cosh,tanh,arcsin,arccos,arctan,arccosh,arcsinh,arctanh,abs,arctan2 andlog10.

The following Python syntax isnot allowed:

  • Expressions

    • Function calls other than math functions.

    • is/isnot operations

    • if expressions

    • lambda expressions

    • list/set/dict comprehensions

    • Literaldict andset expressions

    • yield expressions

    • Generator expressions

    • Boolean expressions consisting of only scalar values

  • Statements

    • Neithersimpleorcompoundstatements are allowed. This includesfor,while, andif.

Local variables#

You mustexplicitly reference any local variable that you want to use in anexpression by placing the@ character in front of the name. This mechanism isthe same for bothDataFrame.query() andDataFrame.eval(). For example,

In [18]:df=pd.DataFrame(np.random.randn(5,2),columns=list("ab"))In [19]:newcol=np.random.randn(len(df))In [20]:df.eval("b + @newcol")Out[20]:0   -0.2061221   -1.0295872    0.5197263   -2.0525894    1.453210dtype: float64In [21]:df.query("b < @newcol")Out[21]:          a         b1  0.160268 -0.8488963  0.333758 -1.1803554  0.572182  0.439895

If you don’t prefix the local variable with@, pandas will raise anexception telling you the variable is undefined.

When usingDataFrame.eval() andDataFrame.query(), this allows youto have a local variable and aDataFrame column with the samename in an expression.

In [22]:a=np.random.randn()In [23]:df.query("@a < a")Out[23]:          a         b0  0.473349  0.8912361  0.160268 -0.8488962  0.803311  1.6620313  0.333758 -1.1803554  0.572182  0.439895In [24]:df.loc[a<df["a"]]# same as the previous expressionOut[24]:          a         b0  0.473349  0.8912361  0.160268 -0.8488962  0.803311  1.6620313  0.333758 -1.1803554  0.572182  0.439895

Warning

pandas.eval() will raise an exception if you cannot use the@ prefix because itisn’t defined in that context.

In [25]:a,b=1,2In [26]:pd.eval("@a + b")Traceback (most recent call last):  File ~/micromamba/envs/test/lib/python3.10/site-packages/IPython/core/interactiveshell.py:3577 in run_code    exec(code_obj, self.user_global_ns, self.user_ns)  Cell In[26], line 1    pd.eval("@a + b")  File ~/work/pandas/pandas/pandas/core/computation/eval.py:325 in eval    _check_for_locals(expr, level, parser)  File ~/work/pandas/pandas/pandas/core/computation/eval.py:167 in _check_for_locals    raise SyntaxError(msg)  File <string>SyntaxError: The '@' prefix is not allowed in top-level eval calls.please refer to your variables by name without the '@' prefix.

In this case, you should simply refer to the variables like you would instandard Python.

In [27]:pd.eval("a + b")Out[27]:3

pandas.eval() parsers#

There are two different expression syntax parsers.

The default'pandas' parser allows a more intuitive syntax for expressingquery-like operations (comparisons, conjunctions and disjunctions). Inparticular, the precedence of the& and| operators is made equal tothe precedence of the corresponding boolean operationsand andor.

For example, the above conjunction can be written without parentheses.Alternatively, you can use the'python' parser to enforce strict Pythonsemantics.

In [28]:nrows,ncols=20000,100In [29]:df1,df2,df3,df4=[pd.DataFrame(np.random.randn(nrows,ncols))for_inrange(4)]In [30]:expr="(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"In [31]:x=pd.eval(expr,parser="python")In [32]:expr_no_parens="df1 > 0 & df2 > 0 & df3 > 0 & df4 > 0"In [33]:y=pd.eval(expr_no_parens,parser="pandas")In [34]:np.all(x==y)Out[34]:True

The same expression can be “anded” together with the wordand aswell:

In [35]:expr="(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)"In [36]:x=pd.eval(expr,parser="python")In [37]:expr_with_ands="df1 > 0 and df2 > 0 and df3 > 0 and df4 > 0"In [38]:y=pd.eval(expr_with_ands,parser="pandas")In [39]:np.all(x==y)Out[39]:True

Theand andor operators here have the same precedence that they wouldin Python.

pandas.eval() engines#

There are two different expression engines.

The'numexpr' engine is the more performant engine that can yield performance improvementscompared to standard Python syntax for largeDataFrame. This engine requires theoptional dependencynumexpr to be installed.

The'python' engine is generallynot useful except for testingother evaluation engines against it. You will achieveno performancebenefits usingeval() withengine='python' and mayincur a performance hit.

In [40]:%timeit df1 + df2 + df3 + df47.3 ms +- 24.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [41]:%timeit pd.eval("df1 + df2 + df3 + df4", engine="python")7.92 ms +- 70.6 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

TheDataFrame.eval() method#

In addition to the top levelpandas.eval() function you can alsoevaluate an expression in the “context” of aDataFrame.

In [42]:df=pd.DataFrame(np.random.randn(5,2),columns=["a","b"])In [43]:df.eval("a + b")Out[43]:0   -0.1610991    0.8054522    0.7474473    1.1890424   -2.057490dtype: float64

Any expression that is a validpandas.eval() expression is also a validDataFrame.eval() expression, with the added benefit that you don’t have toprefix the name of theDataFrame to the column(s) you’reinterested in evaluating.

In addition, you can perform assignment of columns within an expression.This allows forformulaic evaluation. The assignment target can be anew column name or an existing column name, and it must be a valid Pythonidentifier.

In [44]:df=pd.DataFrame(dict(a=range(5),b=range(5,10)))In [45]:df=df.eval("c = a + b")In [46]:df=df.eval("d = a + b + c")In [47]:df=df.eval("a = 1")In [48]:dfOut[48]:   a  b   c   d0  1  5   5  101  1  6   7  142  1  7   9  183  1  8  11  224  1  9  13  26

A copy of theDataFrame with thenew or modified columns is returned, and the original frame is unchanged.

In [49]:dfOut[49]:   a  b   c   d0  1  5   5  101  1  6   7  142  1  7   9  183  1  8  11  224  1  9  13  26In [50]:df.eval("e = a - c")Out[50]:   a  b   c   d   e0  1  5   5  10  -41  1  6   7  14  -62  1  7   9  18  -83  1  8  11  22 -104  1  9  13  26 -12In [51]:dfOut[51]:   a  b   c   d0  1  5   5  101  1  6   7  142  1  7   9  183  1  8  11  224  1  9  13  26

Multiple column assignments can be performed by using a multi-line string.

In [52]:df.eval(   ....:"""   ....:c = a + b   ....:d = a + b + c   ....:a = 1""",   ....:)   ....:Out[52]:   a  b   c   d0  1  5   6  121  1  6   7  142  1  7   8  163  1  8   9  184  1  9  10  20

The equivalent in standard Python would be

In [53]:df=pd.DataFrame(dict(a=range(5),b=range(5,10)))In [54]:df["c"]=df["a"]+df["b"]In [55]:df["d"]=df["a"]+df["b"]+df["c"]In [56]:df["a"]=1In [57]:dfOut[57]:   a  b   c   d0  1  5   5  101  1  6   7  142  1  7   9  183  1  8  11  224  1  9  13  26

eval() performance comparison#

pandas.eval() works well with expressions containing large arrays.

In [58]:nrows,ncols=20000,100In [59]:df1,df2,df3,df4=[pd.DataFrame(np.random.randn(nrows,ncols))for_inrange(4)]

DataFrame arithmetic:

In [60]:%timeit df1 + df2 + df3 + df47.72 ms +- 56.9 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [61]:%timeit pd.eval("df1 + df2 + df3 + df4")2.89 ms +- 73.7 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

DataFrame comparison:

In [62]:%timeit (df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)6.08 ms +- 48.5 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [63]:%timeit pd.eval("(df1 > 0) & (df2 > 0) & (df3 > 0) & (df4 > 0)")9.32 ms +- 24.1 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

DataFrame arithmetic with unaligned axes.

In [64]:s=pd.Series(np.random.randn(50))In [65]:%timeit df1 + df2 + df3 + df4 + s12.7 ms +- 69.2 us per loop (mean +- std. dev. of 7 runs, 100 loops each)
In [66]:%timeit pd.eval("df1 + df2 + df3 + df4 + s")3.61 ms +- 41.1 us per loop (mean +- std. dev. of 7 runs, 100 loops each)

Note

Operations such as

1and2# would parse to 1 & 2, but should evaluate to 23or4# would parse to 3 | 4, but should evaluate to 3~1# this is okay, but slower when using eval

should be performed in Python. An exception will be raised if you try toperform any boolean/bitwise operations with scalar operands that are notof typebool ornp.bool_.

Here is a plot showing the running time ofpandas.eval() as function of the size of the frame involved in thecomputation. The two lines are two different engines.

../_images/eval-perf.png

You will only see the performance benefits of using thenumexpr engine withpandas.eval() if yourDataFramehas more than approximately 100,000 rows.

This plot was created using aDataFrame with 3 columns each containingfloating point values generated usingnumpy.random.randn().

Expression evaluation limitations withnumexpr#

Expressions that would result in an object dtype or involve datetime operationsbecause ofNaT must be evaluated in Python space, but part of an expressioncan still be evaluated withnumexpr. For example:

In [67]:df=pd.DataFrame(   ....:{"strings":np.repeat(list("cba"),3),"nums":np.repeat(range(3),3)}   ....:)   ....:In [68]:dfOut[68]:  strings  nums0       c     01       c     02       c     03       b     14       b     15       b     16       a     27       a     28       a     2In [69]:df.query("strings == 'a' and nums == 1")Out[69]:Empty DataFrameColumns: [strings, nums]Index: []

The numeric part of the comparison (nums==1) will be evaluated bynumexpr and the object part of the comparison ("strings=='a') willbe evaluated by Python.


[8]ページ先頭

©2009-2025 Movatter.jp