PySpark Overview#
Date: May 19, 2025Version: 4.0.0
Useful links:Live Notebook |GitHub |Issues |Examples |Community |Stack Overflow |Dev Mailing List |User Mailing List
PySpark is the Python API for Apache Spark. It enables you to perform real-time,large-scale data processing in a distributed environment using Python. It also provides a PySparkshell for interactively analyzing your data.
PySpark combines Python’s learnability and ease of use with the power of Apache Sparkto enable processing and analysis of data at any size for everyone familiar with Python.
PySpark supports all of Spark’s features such as Spark SQL,DataFrames, Structured Streaming, Machine Learning (MLlib) and Spark Core.
![]() |
![]() | ![]() | ![]() | ![]() |
![]() |
Python Spark Connect Client
Spark Connect is a client-server architecture within Apache Spark thatenables remote connectivity to Spark clusters from any application.PySpark provides the client for the Spark Connect server, allowingSpark to be used as a service.
Spark SQL and DataFrames
Spark SQL is Apache Spark’s module for working with structured data.It allows you to seamlessly mix SQL queries with Spark programs.With PySpark DataFrames you can efficiently read, write, transform,and analyze data using Python and SQL.Whether you use Python or SQL, the same underlying executionengine is used so you will always leverage the full power of Spark.
Pandas API on Spark
Pandas API on Spark allows you to scale your pandas workload to any sizeby running it distributed across multiple nodes. If you are already familiarwith pandas and want to leverage Spark for big data, pandas API on Spark makesyou immediately productive and lets you migrate your applications without modifying the code.You can have a single codebase that works both with pandas (tests, smaller datasets)and with Spark (production, distributed datasets) and you can switch between thepandas API and the Pandas API on Spark easily and without overhead.
Pandas API on Spark aims to make the transition from pandas to Spark easy butif you are new to Spark or deciding which API to use, we recommend using PySpark(seeSpark SQL and DataFrames).
Structured Streaming
Structured Streaming is a scalable and fault-tolerant stream processing engine built on the Spark SQL engine.You can express your streaming computation the same way you would express a batch computation on static data.The Spark SQL engine will take care of running it incrementally and continuously and updating the final resultas streaming data continues to arrive.
Machine Learning (MLlib)
Built on top of Spark, MLlib is a scalable machine learning library that providesa uniform set of high-level APIs that help users create and tune practical machinelearning pipelines.
Spark Core and RDDs
Spark Core is the underlying general execution engine for the Spark platform that allother functionality is built on top of. It provides RDDs (Resilient Distributed Datasets)and in-memory computing capabilities.
Note that the RDD API is a low-level API which can be difficult to use and you do not getthe benefit of Spark’s automatic query optimization capabilities.We recommend using DataFrames (seeSpark SQL and DataFrames above)instead of RDDs as it allows you to express what you want more easily and lets Spark automaticallyconstruct the most efficient query for you.
Spark Streaming (Legacy)
Spark Streaming is an extension of the core Spark API that enables scalable,high-throughput, fault-tolerant stream processing of live data streams.
Note that Spark Streaming is the previous generation of Spark’s streaming engine.It is a legacy project and it is no longer being updated.There is a newer and easier to use streaming engine in Spark calledStructured Streaming which youshould use for your streaming applications and pipelines.