HowardRiddiough/deploy-sklearn-in-pysparkPublic

NotificationsYou must be signed in to change notification settings
Fork1
Star10

Deploying python ML models in pyspark using Pandas UDFs

You must be signed in to change notification settings

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 12 Commits
README.md		README.md
deploying-python-ml-in-pyspark.ipynb		deploying-python-ml-in-pyspark.ipynb
requirements.txt		requirements.txt

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

This repo includes anotebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:

Deploying a RandomForestRegressor in PySpark
Deployment of ML Pipeline that scales numerical features
Deployment of ML Pipeline that is capable of preprocessing mixed feature types

Introducing thespark_predict function: a vessle for python ml deployment in PySpark

Making predictions in PySpark using sophistaicated python ml is unlocked using ourspark_predict function defined below.

spark_predict is a wrapper around apandas_udf, a wrapper is used to enable a python ml model to be passed to thepandas_udf.

def spark_predict(model, cols) -> pyspark.sql.column:    """This function deploys python ml in PySpark using the `predict` method of `model.    Args:        model: python ml model with sklearn API        cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions.    """    @sf.pandas_udf(returnType=DoubleType())    def predict_pandas_udf(*cols):        # cols will be a tuple of pandas.Series here.        x = pd.concat(cols, axis=1)        return pd.Series(model.predict(x))    return predict_pandas_udf(*cols)

Python ML Deployment in practice

Thedeploying-python-ml-in-pyspark notebook demonstrates howspark_predict can be used to deploy python ML in PySpark. It is shown thatspark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.

I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearnPipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using thespark_predict function.

Requirements

Seerequirements.txt.

PySpark Installation

The code used in thedeploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

Introducing thespark_predict function: a vessle for python ml deployment in PySpark

Python ML Deployment in practice

Requirements

PySpark Installation

Further Reading

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages

Uh oh!

Languages

Movatterモバイル変換

HowardRiddiough/deploy-sklearn-in-pyspark

Folders and files

Latest commit

History

Repository files navigation

Python ML Deployment in PySpark Using Pandas UDFs

Introducing thespark_predict function: a vessle for python ml deployment in PySpark

Python ML Deployment in practice

Requirements

PySpark Installation

Further Reading

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages0

Uh oh!

Languages

Packages