Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Commit5b0767a

Browse files
authored
DOC: User Guide Page on user-defined functions (#61195)
1 parent5aa78c0 commit5b0767a

File tree

2 files changed

+306
-0
lines changed

2 files changed

+306
-0
lines changed

‎doc/source/user_guide/index.rst

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -78,6 +78,7 @@ Guides
7878
boolean
7979
visualization
8080
style
81+
user_defined_functions
8182
groupby
8283
window
8384
timeseries
Lines changed: 305 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,305 @@
1+
.. _user_defined_functions:
2+
3+
{{ header }}
4+
5+
*****************************
6+
User-Defined Functions (UDFs)
7+
*****************************
8+
9+
In pandas, User-Defined Functions (UDFs) provide a way to extend the library’s
10+
functionality by allowing users to apply custom computations to their data. While
11+
pandas comes with a set of built-in functions for data manipulation, UDFs offer
12+
flexibility when built-in methods are not sufficient. These functions can be
13+
applied at different levels: element-wise, row-wise, column-wise, or group-wise,
14+
and behave differently, depending on the method used.
15+
16+
Here’s a simple example to illustrate a UDF applied to a Series:
17+
18+
..ipython::python
19+
20+
s= pd.Series([1,2,3])
21+
22+
# Simple UDF that adds 1 to a value
23+
defadd_one(x):
24+
return x+1
25+
26+
# Apply the function element-wise using .map
27+
s.map(add_one)
28+
29+
You can also apply UDFs to an entire DataFrame. For example:
30+
31+
..ipython::python
32+
33+
df= pd.DataFrame({"A": [1,2,3],"B": [10,20,30]})
34+
35+
# UDF that takes a row and returns the sum of columns A and B
36+
defsum_row(row):
37+
return row["A"]+ row["B"]
38+
39+
# Apply the function row-wise (axis=1 means apply across columns per row)
40+
df.apply(sum_row,axis=1)
41+
42+
43+
Why Not To Use User-Defined Functions
44+
-------------------------------------
45+
46+
While UDFs provide flexibility, they come with significant drawbacks, primarily
47+
related to performance and behavior. When using UDFs, pandas must perform inference
48+
on the result, and that inference could be incorrect. Furthermore, unlike vectorized operations,
49+
UDFs are slower because pandas can't optimize their computations, leading to
50+
inefficient processing.
51+
52+
..note::
53+
In general, most tasks can and should be accomplished using pandas’ built-in methods or vectorized operations.
54+
55+
Despite their drawbacks, UDFs can be helpful when:
56+
57+
* **Custom Computations Are Needed**: Implementing complex logic or domain-specific calculations that pandas'
58+
built-in methods cannot handle.
59+
* **Extending pandas' Functionality**: Applying external libraries or specialized algorithms unavailable in pandas.
60+
* **Handling Complex Grouped Operations**: Performing operations on grouped data that standard methods do not support.
61+
62+
For example:
63+
64+
..code-block::python
65+
66+
from sklearn.linear_modelimport LinearRegression
67+
68+
# Sample data
69+
df= pd.DataFrame({
70+
'group': ['A','A','A','B','B','B'],
71+
'x': [1,2,3,1,2,3],
72+
'y': [2,4,6,1,2,1.5]
73+
})
74+
75+
# Function to fit a model to each group
76+
deffit_model(group):
77+
model= LinearRegression()
78+
model.fit(group[['x']], group['y'])
79+
group['y_pred']= model.predict(group[['x']])
80+
return group
81+
82+
result= df.groupby('group').apply(fit_model)
83+
84+
85+
Methods that support User-Defined Functions
86+
-------------------------------------------
87+
88+
User-Defined Functions can be applied across various pandas methods:
89+
90+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
91+
| Method| Function Input| Function Output| Description|
92+
+============================+========================+==========================+==============================================================================================================================================+
93+
|:meth:`map`| Scalar| Scalar| Apply a function to each element|
94+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
95+
|:meth:`apply` (axis=0)| Column (Series)| Column (Series)| Apply a function to each column|
96+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
97+
|:meth:`apply` (axis=1)| Row (Series)| Row (Series)| Apply a function to each row|
98+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
99+
|:meth:`agg`| Series/DataFrame| Scalar or Series| Aggregate and summarizes values, e.g., sum or custom reducer|
100+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
101+
|:meth:`transform` (axis=0)| Column (Series)| Column(Series)| Same as:meth:`apply` with (axis=0), but it raises an exception if the function changes the shape of the data|
102+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
103+
|:meth:`transform` (axis=1)| Row (Series)| Row (Series)| Same as:meth:`apply` with (axis=1), but it raises an exception if the function changes the shape of the data|
104+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
105+
|:meth:`filter`| Series or DataFrame| Boolean| Only accepts UDFs in group by. Function is called for each group, and the group is removed from the result if the function returns ``False``|
106+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
107+
|:meth:`pipe`| Series/DataFrame| Series/DataFrame| Chain functions together to apply to Series or Dataframe|
108+
+----------------------------+------------------------+--------------------------+----------------------------------------------------------------------------------------------------------------------------------------------+
109+
110+
When applying UDFs in pandas, it is essential to select the appropriate method based
111+
on your specific task. Each method has its strengths and is designed for different use
112+
cases. Understanding the purpose and behavior of each method will help you make informed
113+
decisions, ensuring more efficient and maintainable code.
114+
115+
..note::
116+
Some of these methods are can also be applied to groupby, resample, and various window objects.
117+
See:ref:`groupby`,:ref:`resample()<timeseries>`,:ref:`rolling()<window>`,:ref:`expanding()<window>`,
118+
and:ref:`ewm()<window>` for details.
119+
120+
121+
:meth:`DataFrame.apply`
122+
~~~~~~~~~~~~~~~~~~~~~~~
123+
124+
The:meth:`apply` method allows you to apply UDFs along either rows or columns. While flexible,
125+
it is slower than vectorized operations and should be used only when you need operations
126+
that cannot be achieved with built-in pandas functions.
127+
128+
When to use::meth:`apply` is suitable when no alternative vectorized method or UDF method is available,
129+
but consider optimizing performance with vectorized operations wherever possible.
130+
131+
:meth:`DataFrame.agg`
132+
~~~~~~~~~~~~~~~~~~~~~
133+
134+
If you need to aggregate data,:meth:`agg` is a better choice than apply because it is
135+
specifically designed for aggregation operations.
136+
137+
When to use: Use:meth:`agg` for performing custom aggregations, where the operation returns
138+
a scalar value on each input.
139+
140+
:meth:`DataFrame.transform`
141+
~~~~~~~~~~~~~~~~~~~~~~~~~~~
142+
143+
The:meth:`transform` method is ideal for performing element-wise transformations while preserving the shape of the original DataFrame.
144+
It is generally faster than apply because it can take advantage of pandas' internal optimizations.
145+
146+
When to use: When you need to perform element-wise transformations that retain the original structure of the DataFrame.
147+
148+
..code-block::python
149+
150+
from sklearn.linear_modelimport LinearRegression
151+
152+
df= pd.DataFrame({
153+
'group': ['A','A','A','B','B','B'],
154+
'x': [1,2,3,1,2,3],
155+
'y': [2,4,6,1,2,1.5]
156+
}).set_index("x")
157+
158+
# Function to fit a model to each group
159+
deffit_model(group):
160+
x= group.index.to_frame()
161+
y= group
162+
model= LinearRegression()
163+
model.fit(x, y)
164+
pred= model.predict(x)
165+
return pred
166+
167+
result= df.groupby('group').transform(fit_model)
168+
169+
:meth:`DataFrame.filter`
170+
~~~~~~~~~~~~~~~~~~~~~~~~
171+
172+
The:meth:`filter` method is used to select subsets of the DataFrame’s
173+
columns or row. It is useful when you want to extract specific columns or rows that
174+
match particular conditions.
175+
176+
When to use: Use:meth:`filter` when you want to use a UDF to create a subset of a DataFrame or Series
177+
178+
..note::
179+
:meth:`DataFrame.filter` does not accept UDFs, but can accept
180+
list comprehensions that have UDFs applied to them.
181+
182+
..ipython::python
183+
184+
# Sample DataFrame
185+
df= pd.DataFrame({
186+
'AA': [1,2,3],
187+
'BB': [4,5,6],
188+
'C': [7,8,9],
189+
'D': [10,11,12]
190+
})
191+
192+
# Function that filters out columns where the name is longer than 1 character
193+
defis_long_name(column_name):
194+
returnlen(column_name)>1
195+
196+
df_filtered= df.filter(items=[colfor colin df.columnsif is_long_name(col)])
197+
print(df_filtered)
198+
199+
Since filter does not directly accept a UDF, you have to apply the UDF indirectly,
200+
for example, by using list comprehensions.
201+
202+
:meth:`DataFrame.map`
203+
~~~~~~~~~~~~~~~~~~~~~
204+
205+
The:meth:`map` method is used specifically to apply element-wise UDFs.
206+
207+
When to use: Use:meth:`map` for applying element-wise UDFs to DataFrames or Series.
208+
209+
:meth:`DataFrame.pipe`
210+
~~~~~~~~~~~~~~~~~~~~~~
211+
212+
The:meth:`pipe` method is useful for chaining operations together into a clean and readable pipeline.
213+
It is a helpful tool for organizing complex data processing workflows.
214+
215+
When to use: Use:meth:`pipe` when you need to create a pipeline of operations and want to keep the code readable and maintainable.
216+
217+
218+
Performance
219+
-----------
220+
221+
While UDFs provide flexibility, their use is generally discouraged as they can introduce
222+
performance issues, especially when written in pure Python. To improve efficiency,
223+
consider using built-in ``NumPy`` or ``pandas`` functions instead of UDFs
224+
for common operations.
225+
226+
..note::
227+
If performance is critical, explore **vectorized operations** before resorting
228+
to UDFs.
229+
230+
Vectorized Operations
231+
~~~~~~~~~~~~~~~~~~~~~
232+
233+
Below is a comparison of using UDFs versus using Vectorized Operations:
234+
235+
..code-block::python
236+
237+
# User-defined function
238+
defcalc_ratio(row):
239+
return100* (row["one"]/ row["two"])
240+
241+
df["new_col"]= df.apply(calc_ratio,axis=1)
242+
243+
# Vectorized Operation
244+
df["new_col2"]=100* (df["one"]/ df["two"])
245+
246+
Measuring how long each operation takes:
247+
248+
..code-block::text
249+
250+
User-defined function: 5.6435 secs
251+
Vectorized: 0.0043 secs
252+
253+
Vectorized operations in pandas are significantly faster than using:meth:`DataFrame.apply`
254+
with UDFs because they leverage highly optimized C functions
255+
via ``NumPy`` to process entire arrays at once. This approach avoids the overhead of looping
256+
through rows in Python and making separate function calls for each row, which is slow and
257+
inefficient. Additionally, ``NumPy`` arrays benefit from memory efficiency and CPU-level
258+
optimizations, making vectorized operations the preferred choice whenever possible.
259+
260+
261+
Improving Performance with UDFs
262+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
263+
264+
In scenarios where UDFs are necessary, there are still ways to mitigate their performance drawbacks.
265+
One approach is to use **Numba**, a Just-In-Time (JIT) compiler that can significantly speed up numerical
266+
Python code by compiling Python functions to optimized machine code at runtime.
267+
268+
By annotating your UDFs with ``@numba.jit``, you can achieve performance closer to vectorized operations,
269+
especially for computationally heavy tasks.
270+
271+
..note::
272+
You may also refer to the user guide on `Enhancing performance<https://pandas.pydata.org/pandas-docs/dev/user_guide/enhancingperf.html#numba-jit-compilation>`_
273+
for a more detailed guide to using **Numba**.
274+
275+
Using:meth:`DataFrame.pipe` for Composable Logic
276+
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
277+
278+
Another useful pattern for improving readability and composability, especially when mixing
279+
vectorized logic with UDFs, is to use the:meth:`DataFrame.pipe` method.
280+
281+
:meth:`DataFrame.pipe` doesn't improve performance directly, but it enables cleaner
282+
method chaining by passing the entire object into a function. This is especially helpful
283+
when chaining custom transformations:
284+
285+
..code-block::python
286+
287+
defadd_ratio_column(df):
288+
df["ratio"]=100* (df["one"]/ df["two"])
289+
return df
290+
291+
df= (
292+
df
293+
.query("one > 0")
294+
.pipe(add_ratio_column)
295+
.dropna()
296+
)
297+
298+
This is functionally equivalent to calling ``add_ratio_column(df)``, but keeps your code
299+
clean and composable. The function you pass to:meth:`DataFrame.pipe` can use vectorized operations,
300+
row-wise UDFs, or any other logic;:meth:`DataFrame.pipe` is agnostic.
301+
302+
..note::
303+
While:meth:`DataFrame.pipe` does not improve performance on its own,
304+
it promotes clean, modular design and allows both vectorized and UDF-based logic
305+
to be composed in method chains.

0 commit comments

Comments
 (0)

[8]ページ先頭

©2009-2025 Movatter.jp