- Notifications
You must be signed in to change notification settings - Fork1
HowardRiddiough/deploy-sklearn-in-pyspark
Folders and files
| Name | Name | Last commit message | Last commit date | |
|---|---|---|---|---|
Repository files navigation
This repo includes anotebook that defines a versatile python function that can be used to deploy python ml in PySpark, several examples are used to demonstrate how python ml can be deployed in PySpark:
- Deploying a RandomForestRegressor in PySpark
- Deployment of ML Pipeline that scales numerical features
- Deployment of ML Pipeline that is capable of preprocessing mixed feature types
Making predictions in PySpark using sophistaicated python ml is unlocked using ourspark_predict function defined below.
spark_predict is a wrapper around apandas_udf, a wrapper is used to enable a python ml model to be passed to thepandas_udf.
def spark_predict(model, cols) -> pyspark.sql.column: """This function deploys python ml in PySpark using the `predict` method of `model. Args: model: python ml model with sklearn API cols (list-like): Features used for predictions, required to be present as columns in the spark DataFrame used to make predictions. """ @sf.pandas_udf(returnType=DoubleType()) def predict_pandas_udf(*cols): # cols will be a tuple of pandas.Series here. x = pd.concat(cols, axis=1) return pd.Series(model.predict(x)) return predict_pandas_udf(*cols)Thedeploying-python-ml-in-pyspark notebook demonstrates howspark_predict can be used to deploy python ML in PySpark. It is shown thatspark_predict is capable of deploying simple ml models in addition to more sophisticated pipelines in PySpark.
I often use both categorical and numerical features in predictive model, so I have included an example that includes an sklearnPipeline designed to scale numerical and encode categorical data. This particular pipeline appends two preprocessing pipelines to a random forest to create a full prediction pipeline that will transform categorical and numerical data and fit a model. And of course this pipeline is deployed in PySpark using thespark_predict function.
Seerequirements.txt.
The code used in thedeploying-python-ml-in-pyspark notebook requires installation of PySpark. We leave the installation of PySpark for the user.
- The code used in is based on the excellent excellent blog post"Prediction at Scale with scikit-learn and PySpark Pandas UDFs" written byMichael Heilman.
- sklearn has more information on column transformers with mixed types.
About
Deploying python ML models in pyspark using Pandas UDFs
Topics
Resources
Uh oh!
There was an error while loading.Please reload this page.
Stars
Watchers
Forks
Releases
Packages0
Uh oh!
There was an error while loading.Please reload this page.