jupyter-incubator/sparkmagicPublic

NotificationsYou must be signed in to change notification settings
Fork455
Star1.4k

Jupyter magics and kernels for working with remote Spark clusters

License

View license

1.4k stars 455 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 1,429 Commits
.github		.github
.vscode		.vscode
autovizwidget		autovizwidget
examples		examples
hdijupyterutils		hdijupyterutils
helm		helm
screenshots		screenshots
sparkmagic		sparkmagic
.bumpversion.cfg		.bumpversion.cfg
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
Dockerfile.jupyter		Dockerfile.jupyter
Dockerfile.spark		Dockerfile.spark
LICENSE.md		LICENSE.md
README.md		README.md
RELEASING.md		RELEASING.md
SECURITY.md		SECURITY.md
config.json		config.json
docker-compose.yml		docker-compose.yml
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Repository files navigation

sparkmagic

Sparkmagic is a set of tools for interactively working with remote Spark clusters inJupyter notebooks. Sparkmagic interacts with remote Spark clusters through a REST server. Currently there are three server implementations compatible with Sparkmagic:

Livy - for running interactive sessions on Yarn
Lighter - for running interactive sessions on Yarn or Kubernetes (only PySpark sessions are supported)
Ilum - for running interactive sessions on Yarn or Kubernetes

The Sparkmagic project includes a set of magics for interactively running Spark code in multiple languages, as well as some kernels that you can use to turn Jupyter into an integrated Spark environment.

Features

Run Spark code in multiple languages against any remote Spark cluster through Livy
Automatic SparkContext (sc) and HiveContext (sqlContext) creation
Easily execute SparkSQL queries with the%%sql magic
Automatic visualization of SQL queries in the PySpark, Spark and SparkR kernels; use an easy visual interface to interactively construct visualizations, no code required
Easy access to Spark application information and logs (%%info magic)
Ability to capture the output of SQL queries as Pandas dataframes to interact with other Python libraries (e.g. matplotlib)
Send local files or dataframes to a remote cluster (e.g. sending pretrained local ML model straight to the Spark cluster)
Authenticate to Livy via Basic Access authentication or via Kerberos

Examples

There are two ways to use sparkmagic. Head over to theexamples section for a demonstration on how to use both models of execution.

1. Via the IPython kernel

The sparkmagic library provides a %%spark magic that you can use to easily run code against a remote Spark cluster from a normal IPython notebook. See theSpark Magics on IPython sample notebook

2. Via the PySpark and Spark kernels

The sparkmagic library also provides a set of Scala and Python kernels that allow you to automatically connect to a remote Spark cluster, run code and SQL queries, manage your Livy server and Spark job configuration, and generate automatic visualizations.SeePyspark andSpark sample notebooks.

3. Sending local data to Spark Kernel

See theSending Local Data to Spark notebook.

Installation

Jupyter Notebook 7.x / JupyterLab 3.x

Install the library
```
 pip install sparkmagic
```
Make sure that ipywidgets is properly installed by running
```
 pip install ipywidgets
```

(Optional) Install the wrapper kernels. Runpip show sparkmagic and it will show the path wheresparkmagic is installed at.cd to that location and run:

 jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at theexample_config.json
(Optional) Enable the server extension so that clusters can be programatically changed:
```
 jupyter server extension enable --py sparkmagic
```

Jupyter Notebook 5.2 or earlier / JupyterLab 1 or 2

Install the library
```
 pip install sparkmagic
```

Make sure that ipywidgets is properly installed by running

 jupyter nbextension enable --py --sys-prefix widgetsnbextension

If you're using JupyterLab 1 or 2, you'll need to run another command:
```
 jupyter labextension install "@jupyter-widgets/jupyterlab-manager"
```

(Optional) Install the wrapper kernels. Runpip show sparkmagic and it will show the path wheresparkmagic is installed at.cd to that location run:

 jupyter-kernelspec install sparkmagic/kernels/sparkkernel jupyter-kernelspec install sparkmagic/kernels/pysparkkernel jupyter-kernelspec install sparkmagic/kernels/sparkrkernel

(Optional) Modify the configuration file at ~/.sparkmagic/config.json. Look at theexample_config.json
(Optional) Enable the server extension so that clusters can be programatically changed:
```
 jupyter serverextension enable --py sparkmagic
```

Authentication Methods

Sparkmagic supports:

No auth
Basic authentication
Kerberos

TheAuthenticator is the mechanism for authenticating to Livy. The baseAuthenticator used by itself supports no auth, but it can be subclassed to enable authentication via other methods.Two such examples are theBasic andKerberos Authenticators.

Kerberos Authenticator

Kerberos support is implemented via therequests-kerberos package. Sparkmagic expects a kerberos ticket to be available in the system. Requests-kerberos will pick up the kerberos ticket from a cache file. For the ticket to be available, the user needs to have runkinit to create the kerberos ticket.

Kerberos Configuration

By default theHTTPKerberosAuth constructor provided by therequests-kerberos package will use the following configuration

HTTPKerberosAuth(mutual_authentication=REQUIRED)

but this will not be right configuration for every context, so it is able to pass custom arguments for this constructor using the following configuration on the~/.sparkmagic/config.json

{"kerberos_auth_configuration": {"mutual_authentication":1,"service":"HTTP","delegate":false,"force_preemptive":false,"principal":"principal","hostname_override":"hostname_override","sanitize_mutual_error_response":true,"send_cbt":true    }}

Custom Authenticators

You can write custom Authenticator subclasses to enable authentication via other mechanisms. All Authenticator subclassesshould override theAuthenticator.__call__(request) method that attaches HTTP Authentication to the given Request object.

Authenticator subclasses that add additional class attributes to be used for the authentication, such as the [Basic] (sparkmagic/sparkmagic/auth/basic.py) authenticator which addsusername andpassword attributes, should override the__hash__,__eq__,update_with_widget_values, andget_widgets methods to work with these new attributes. This is necessary in order for the Authenticator to use these attributes in the authentication process.

Using a Custom Authenticator with Sparkmagic

If your repository layout is:

    .    ├── LICENSE    ├── README.md    ├── customauthenticator    │   ├── __init__.py     │   ├── customauthenticator.py     └── setup.py

Then to pip install from this repository, run:pip install git+https://git_repo_url/#egg=customauthenticator

After installing, you need to register the custom authenticator with Sparkmagic so it can be dynamically imported. This can be done in two different ways:

Edit the configuration file at~/.sparkmagic/config.json with the following settings:
```
{"authenticators": {"Kerberos":"sparkmagic.auth.kerberos.Kerberos","None":"sparkmagic.auth.customauth.Authenticator","Basic_Access":"sparkmagic.auth.basic.Basic","Custom_Auth":"customauthenticator.customauthenticator.CustomAuthenticator"  }}
```
This adds yourCustomAuthenticator class incustomauthenticator.py to Sparkmagic.Custom_Auth is the authentication type that will be displayed in the%manage_spark widget's Auth type dropdown as well as the Auth type passed as an argument to the -t flag in the%spark add session magic.

Modify theauthenticators method insparkmagic/utils/configuration.py to return your custom authenticator:

defauthenticators():return {u"Kerberos":u"sparkmagic.auth.kerberos.Kerberos",u"None":u"sparkmagic.auth.customauth.Authenticator",u"Basic_Access":u"sparkmagic.auth.basic.Basic",u"Custom_Auth":u"customauthenticator.customauthenticator.CustomAuthenticator"        }

Spark config settings

There are two config options for spark settingssession_configs_defaults andsession_configs.session_configs_defaults sets default setting that have to be explicitly overidden in order for a user to change them.session_configs provides defaults that are all replaced whenever a user changes them using the configure magic.

HTTP Session Adapters

If you need to customize HTTP request behavior for specific domains by modifying headers, implementing custom logic (e.g., using mTLS, retrying requests), or handling them differently, you can use a custom adapter to gain fine-grained control over request processing.

More details on how we can configure and use http adapter can be foundhere

For configuring custom http adapter, edit the~/.sparkmagic/config.json with the following settings:

"http_session_config": {"adapters":      [        {"prefix":"http://","adapter":"customadapter.customadapter.CustomaAapter"        }      ]  },

This adds your CustomaAapter class in customadapter.py to sparkmagic http livy-requests session.

Papermill

If you want Papermill rendering to stop on a Spark error, edit the~/.sparkmagic/config.json with the following settings:

{"shutdown_session_on_spark_statement_errors":true,"all_errors_are_fatal":true}

If you want any registered livy sessions to be cleaned up on exit regardless of whether the process exits gracefully or not, you can set:

{"cleanup_all_sessions_on_exit":true,"all_errors_are_fatal":true}

Conf overrides in code

In addition to the conf at~/.sparkmagic/config.json, sparkmagic conf can be overridden programmatically in a notebook.

For example:

importsparkmagic.utils.configurationasconfconf.override('cleanup_all_sessions_on_exit',True)

Same thing, but referencing the conf member:

conf.override(conf.cleanup_all_sessions_on_exit.__name__,True)

NOTE: override forcleanup_all_sessions_on_exit must be setbefore initializing sparkmagic ie. before this:

%load_ext sparkmagic.magics

Docker

The includeddocker-compose.yml file will let you spin up a fullsparkmagic stack that includes a Jupyter notebook with the appropriateextensions installed, and a Livy server backed by a local-mode Spark instance.(This is just for testing and developing sparkmagic itself; in reality,sparkmagic is not very useful if your Spark instance is on the same machine!)

In order to use it, make sure you haveDocker andDocker Compose both installed, andthen simply run:

docker compose builddocker compose up

You will then be able to access the Jupyter notebook in your browser athttp://localhost:8888. Inside this notebook, you can configure asparkmagic endpoint athttp://spark:8998. This endpoint is able tolaunch both Scala and Python sessions. You can also choose to start awrapper kernel for Scala, Python, or R from the list of kernels.

To shut down the containers, you can interruptdocker compose withCtrl-C, and optionally remove the containers withdocker compose down.

If you are developing sparkmagic and want to test out your changes inthe Docker container without needing to push a version to PyPI, you canset thedev_mode build arg indocker-compose.yml totrue, and thenre-build the container. This will cause the container to install yourlocal version of autovizwidget, hdijupyterutils, and sparkmagic. The local packages are installed with the editable flag, meaning you can make edits directly to the libraries within the Jupyterlab docker service to debug issues in realtime. To make local changes available in Jupyterlab, make sure to re-rundocker compose build before spinning up the services.

Server extension API

`/reconnectsparkmagic`:

POST:Allows to specify Spark cluster connection information to a notebook passing in the notebook path and cluster information.Kernel will be started/restarted and connected to cluster specified.

Request Body example:{ 'path': 'path.ipynb', 'username': 'username', 'password': 'password', 'endpoint': 'url', 'auth': 'Kerberos', 'kernelname': 'pysparkkernel' }

Note that the auth can be either None, Basic_Access or Kerberos based on the authentication enabled in livy. The kernelname parameter is optional and defaults to the one specified on the config file or pysparkkernel if not on the config file.Returns200 if successful;400 if body is not JSON string or key is not found;500 if error is encountered changing clusters.

Reply Body example:{ 'success': true, 'error': null }

Architecture

Sparkmagic uses Livy, a REST server for Spark, to remotely execute all user code.The library then automatically collects the output of your code as plain text or a JSON document, displaying the results to you as formatted text or as a Pandas dataframe as appropriate.

This architecture offers us some important advantages:

Run Spark code completely remotely; no Spark components need to be installed on the Jupyter server
Multi-language support; the Python, Python3, Scala and R kernels are equally feature-rich, and adding support for more languages will be easy
Support for multiple endpoints; you can use a single notebook to start multiple Spark jobs in different languages and against different remote clusters
Easy integration with any Python library for data science or visualization, like Pandas orPlotly

However, there are some important limitations to note:

Some overhead added by sending all code and output through Livy
Since all code is run on a remote driver through Livy, all structured data must be serialized to JSON and parsed by the Sparkmagic library so that it can be manipulated and visualized on the client side.In practice this means that you must use Python for client-side data manipulation in%%local mode.

Contributing

We welcome contributions from everyone.If you've made an improvement to our code, please send us apull request.

To dev install, execute the following:

Clone the repo

git clone https://github.com/jupyter-incubator/sparkmagic

Install local versions of packages

pip install -e hdijupyterutils pip install -e autovizwidgetpip install -e sparkmagic

Alternatively, you can usePoetry to setup a virtual environment

poetry install# If you run into issues install numpy or pandas, run# poetry run pip install numpy pandas# then re-run poetry install

Run unit tests, withpytest

# if you don't have pytest and mock installed, run# pip install pytest mockpytest

If you installed packages with Poetry, run

poetry run pytest

If you want to see an enhancement made but don't have time to work on it yourself, feel free to submit anissue for us to deal with.

About

Jupyter magics and kernels for working with remote Spark clusters

Code of conduct

Security policy

Activity

Custom properties

Stars

1.4k stars

Watchers

44 watching

Forks

455 forks

Report repository

Releases18

0.23.0 Latest

Jul 7, 2025

+ 17 releases

Movatterモバイル変換

License

jupyter-incubator/sparkmagic

Folders and files

Latest commit

History

Repository files navigation

sparkmagic

Features

Examples

1. Via the IPython kernel

2. Via the PySpark and Spark kernels

3. Sending local data to Spark Kernel

Installation

Jupyter Notebook 7.x / JupyterLab 3.x

Jupyter Notebook 5.2 or earlier / JupyterLab 1 or 2

Authentication Methods

Kerberos Authenticator

Kerberos Configuration

Custom Authenticators

Using a Custom Authenticator with Sparkmagic

Spark config settings

HTTP Session Adapters

Papermill

Conf overrides in code

Docker

Server extension API

/reconnectsparkmagic:

Architecture

Contributing

About

Topics

Resources

License

Code of conduct

Security policy

Uh oh!

Stars

Watchers

Forks

Releases18

Packages0

Used by389

Contributors56

Languages

`/reconnectsparkmagic`:

Packages