Installation #

PySpark is included in the official releases of Spark available in theApache Spark website.For Python users, PySpark also providespip installation from PyPI. This is usually for local usage or asa client to connect to a cluster instead of setting up a cluster itself.

This page includes instructions for installing PySpark by using pip, Conda, downloading manually,and building from the source.

Python Versions Supported#

Python 3.9 and above.

Using PyPI#

PySpark installation usingPyPI (pyspark) is as follows:

pipinstallpyspark

If you want to install extra dependencies for a specific component, you can install it as below:

# Spark SQLpipinstallpyspark[sql]# pandas API on Sparkpipinstallpyspark[pandas_on_spark]plotly# to plot your data, you can install plotly together.# Spark Connectpipinstallpyspark[connect]

SeeOptional dependencies for more detail about extra dependencies.

For PySpark with/without a specific Hadoop version, you can install it by usingPYSPARK_HADOOP_VERSION environment variables as below:

PYSPARK_HADOOP_VERSION=3pipinstallpyspark

The default distribution uses Hadoop 3.3 and Hive 2.3. If users specify different versions of Hadoop, the pip installation automaticallydownloads a different version and uses it in PySpark. Downloading it can take a while depending onthe network and the mirror chosen.PYSPARK_RELEASE_MIRROR can be set to manually choose the mirror for faster downloading.

PYSPARK_RELEASE_MIRROR=http://mirror.apache-kr.orgPYSPARK_HADOOP_VERSION=3pipinstall

It is recommended to use-v option inpip to track the installation and download status.

PYSPARK_HADOOP_VERSION=3pipinstallpyspark-v

Supported values inPYSPARK_HADOOP_VERSION are:

without: Spark pre-built with user-provided Apache Hadoop
3: Spark pre-built for Apache Hadoop 3.3 and later (default)

Note that this installation of PySpark with/without a specific Hadoop version is experimental. It can change or be removed between minor releases.

Making Spark Connect by default#

If you want to make Spark Connect default, you can install and additional library viaPyPI (pyspark-connect), execute the following command:

pipinstallpyspark-connect

It will automatically installpyspark library as well as dependencies that are necessary for Spark Connect.If you want to customizepyspark, you need to installpyspark with the instructions above in advance.

This package supports bothspark.master (--master) with a locally running Spark Connect server, andspark.remote (--remote) including local clusters, e.g.,local[*] as well as connection URIs such assc://localhost. See alsoQuickstart: Spark Connect for how to use it.

Python Spark Connect Client#

The Python Spark Connect client is a pure Python library that does not rely on any non-Python dependencies such as jars and JRE in your environment.To install the Python Spark Connect client viaPyPI (pyspark-client), execute the following command:

pipinstallpyspark-client

This package only supportsspark.remote with connection URIs, e.g.,sc://localhost. See alsoQuickstart: Spark Connect for how to use it.

Using Conda#

Conda is an open-source package management and environment management system (developed byAnaconda), which is best installed throughMiniconda orMiniforge.The tool is both cross-platform and language agnostic, and in practice, conda can replace bothpip andvirtualenv.

Conda uses so-called channels to distribute packages, and together with the default channels byAnaconda itself, the most important channel isconda-forge, whichis the community-driven packaging effort that is the most extensive & the most current (and alsoserves as the upstream for the Anaconda channels in most cases).

To create a new conda environment from your terminal and activate it, proceed as shown below:

condacreate-npyspark_envcondaactivatepyspark_env

After activating the environment, use the following command to install pyspark,a python version of your choice, as well as other packages you want to use inthe same session as pyspark (you can install in several steps too).

condainstall-cconda-forgepyspark# can also add "python=3.9 some_package [etc.]" here

Note thatPySpark for conda is maintainedseparately by the community; while new versions generally get packaged quickly, theavailability through conda(-forge) is not directly in sync with the PySpark release cycle.

While using pip in a conda environment is technically feasible (with the same command asabove), this approach isdiscouraged,because pip does not interoperate with conda.

For a short summary about useful conda commands, see theircheat sheet.

Manually Downloading#

PySpark is included in the distributions available at theApache Spark website.You can download a distribution you want from the site. After that, uncompress the tar file into the directory where you wantto install Spark, for example, as below:

tarxzvfspark-\|release|\-bin-hadoop3.tgz

Ensure theSPARK_HOME environment variable points to the directory where the tar file has been extracted.UpdatePYTHONPATH environment variable such that it can find the PySpark and Py4J underSPARK_HOME/python/lib.One example of doing this is shown below:

cdspark-\|release|\-bin-hadoop3exportSPARK_HOME=`pwd`exportPYTHONPATH=$(ZIPS=("$SPARK_HOME"/python/lib/*.zip);IFS=:;echo"${ZIPS[*]}"):$PYTHONPATH

Installing from Source#

To install PySpark from source, refer toBuilding Spark.

Dependencies#

Required dependencies#

PySpark requires the following dependencies.

Package	Supported version	Note
py4j	>=0.10.9.9	Required to interact with JVM

Additional libraries that enhance functionality but are not included in the installation packages:

memory-profiler: Used for PySpark UDF memory profiling,spark.profile.show(...) andspark.sql.pyspark.udf.profiler.
plotly: Used for PySpark plotting,DataFrame.plot.

Note that PySpark requires Java 17 or later withJAVA_HOME properly set and refer toDownloading.

Optional dependencies#

PySpark has several optional dependencies that enhance its functionality for specific modules.These dependencies are only required for certain features and are not necessary for the basic functionality of PySpark.If these optional dependencies are not installed, PySpark will function correctly for basic operations but will raise anImportErrorwhen you try to use features that require these dependencies.

Spark Connect#

Installable withpipinstall"pyspark[connect]".

Package	Supported version	Note
pandas	>=2.0.0	Required for Spark Connect
pyarrow	>=11.0.0	Required for Spark Connect
grpcio	>=1.67.0	Required for Spark Connect
grpcio-status	>=1.67.0	Required for Spark Connect
googleapis-common-protos	>=1.65.0	Required for Spark Connect
graphviz	>=0.20	Optional for Spark Connect

Spark SQL#

Installable withpipinstall"pyspark[sql]".

Package	Supported version	Note
pandas	>=2.0.0	Required for Spark SQL
pyarrow	>=11.0.0	Required for Spark SQL

Additional libraries that enhance functionality but are not included in the installation packages:

flameprof: Provide the default renderer for UDF performance profiling.

Pandas API on Spark#

Installable withpipinstall"pyspark[pandas_on_spark]".

Package	Supported version	Note
pandas	>=2.2.0	Required for Pandas API on Spark
pyarrow	>=11.0.0	Required for Pandas API on Spark

Additional libraries that enhance functionality but are not included in the installation packages:

mlflow: Required forpyspark.pandas.mlflow.
plotly: Provide plotting for visualization. It is recommended usingplotly overmatplotlib.
matplotlib: Provide plotting for visualization. The default isplotly.

MLlib DataFrame-based API#

Installable withpipinstall"pyspark[ml]".

Package	Supported version	Note
numpy	>=1.21	Required for MLlib DataFrame-based API

Additional libraries that enhance functionality but are not included in the installation packages:

scipy: Required for SciPy integration.
scikit-learn: Required for implementing machine learning algorithms.
torch: Required for machine learning model training.
torchvision: Required for supporting image and video processing.
torcheval: Required for facilitating model evaluation metrics.
deepspeed: Required for providing high-performance model training optimizations. Installable on non-Darwin systems.

MLlib#

Installable withpipinstall"pyspark[mllib]".

Package	Supported version	Note
numpy	>=1.21	Required for MLlib

On this page

Show Source

Movatterモバイル変換

Installation#

Python Versions Supported#

Using PyPI#

Making Spark Connect by default#

Python Spark Connect Client#

Using Conda#

Manually Downloading#

Installing from Source#

Dependencies#

Required dependencies#

Optional dependencies#

Spark Connect#

Spark SQL#

Pandas API on Spark#

MLlib DataFrame-based API#

MLlib#

Installation #