Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Deploying python ML models in pyspark using Pandas UDFs

NotificationsYou must be signed in to change notification settings

HowardRiddiough/deploy-sklearn-in-pyspark

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

12 Commits
 
 
 
 
 
 

Repository files navigation

This repo includes anotebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:

  • Deploying a RandomForestRegressor in PySpark
  • Deployment of ML Pipeline that scales numerical features
  • Deployment of ML Pipeline that is capable of preprocessing mixed feature types

Introducing thespark_predict function: a vessle for python ml deployment in PySpark

Making predictions in PySpark using sophistaicated python ml is unlocked using ourspark_predict function defined below.

spark_predict is a wrapper around apandas_udf, a wrapper is used to enable a python ml model to be passed to thepandas_udf.

def spark_predict(model, cols) -> pyspark.sql.column:    """This function deploys python ml in PySpark using the `predict` method of `model.    Args:        model: python ml model with sklearn API        cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.    """    @sf.pandas_udf(returnType=DoubleType())    def predict_pandas_udf(*cols):        # cols will be a tuple of pandas.Series here.        x = pd.concat(cols, axis=1)        return pd.Series(model.predict(x))    return predict_pandas_udf(*cols)

Python ML Deployment in practice

Thedeploying-python-ml-in-pyspark notebook demonstrates howspark_predict can be used to deploy python ML in PySpark. It is shown thatspark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.

I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearnPipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using thespark_predict function.

Requirements

Seerequirements.txt.

PySpark Installation

The code used in thedeploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.

Further Reading

About

Deploying python ML models in pyspark using Pandas UDFs

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

[8]ページ先頭

©2009-2025 Movatter.jp