- Notifications
You must be signed in to change notification settings - Fork66
AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure
License
Azure/aztk
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.
This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.
This repository has been marked for archival. It is no longer maintained.
- Spark cluster provision time of 5 minutes on average
- Spark clusters run in Docker containers
- Run Spark on a GPU enabled cluster
- Users can bring their own Docker image
- Ability to use low-priority VMs for an 80% discount
- Mixed Mode clusters that use both low-priority and dedicated VMs
- Built in support for Azure Blob Storage and Azure Data Lake connection
- Tailored pythonic experience with PySpark, Jupyter, and Anaconda
- Tailored R experience with SparklyR, RStudio-Server, and Tidyverse
- Ability to runspark submit directly from your local machine's CLI
- Install
aztk
with pip:
pip install aztk
- Initialize the project in a directory. This will automatically create a.aztk folder with config files in your working directory:
aztk spark init
- Login or register for anAzure Account, navigate toAzure Cloud Shell, and run:
wget -q https://raw.githubusercontent.com/Azure/aztk/v0.10.3/account_setup.sh -O account_setup.sh&&chmod 755 account_setup.sh&&/bin/bash account_setup.sh
- Follow the on screen prompts to create the necessary Azure resources and copy the output into your
.aztk/secrets.yaml
file. For more information seeGetting Started Scripts.
The core experience of this package is centered around a few commands.
# create your clusteraztk spark cluster createaztk spark cluster add-user
# monitor and manage your clustersaztk spark cluster getaztk spark cluster listaztk spark cluster delete
# login and submit applications to your clusteraztk spark cluster sshaztk spark cluster submit
First, create your cluster:
aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2
- See our available VM sizeshere.
- The
--vm-size
argument must be the official SKU name which usually come in the form: "standard_d2_v2" - You can createlow-priority VMs at an 80% discount by using
--size-low-pri
instead of--size
- By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More infohere
- By default, AZTK will create a user (with the usernamespark) for your cluster
- The cluster id (
--id
) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters. - By default, you cannot create clusters of more than 20 cores in total. Visitthis page to request a core quota increase.
More information regarding using a cluster can be found in thecluster documentation
To check your cluster status, use theget
command:
aztk spark cluster get --id my_cluster
When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:
// submit a java applicationaztk spark cluster submit \ --id my_cluster \ --name my_java_job \ --class org.apache.spark.examples.SparkPi \ --executor-memory 20G \ path\to\examples.jar 1000 // submit a python applicationaztk spark cluster submit \ --id my_cluster \ --name my_python_job \ --executor-memory 20G \ path\to\pi.py 1000
- The
aztk spark cluster submit
command takes the same parameters as the standardspark-submit
command, except instead of specifying--master
, AZTK requires that you specify your cluster--id
and a unique job--name
- The job name,
--name
, argument must be at least 3 characters long- It can only contain alphanumeric characters including hyphens but excluding underscores
- It cannot contain uppercase letters
- Each job you submitmust have a unique name
- Use the
--no-wait
option for your command to return immediately
Learn more about the spark submit commandhere
Most users will want to work interactively with their Spark clusters. With theaztk spark cluster ssh
command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:
aztk spark cluster ssh --id my_cluster --user spark
By default, we port forward the Spark Web UI tolocalhost:8080, Spark Jobs UI tolocalhost:4040, and the Spark History Server tolocalhost:18080.
You can configure these settings in the.aztk/ssh.yaml file.
NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server. To do so, you need to setup your cluster with the appropriate docker image and plugin. SeePlugins for more information.
You can also see your clusters from the CLI:
aztk spark cluster list
And get the state of any specified cluster:
aztk spark cluster get --id <my_cluster_id>
Finally, you can delete any specified cluster:
aztk spark cluster delete --id <my_cluster_id>
- How do I connect to Azure Storage (WASB)?
- I want to use a different version of Spark
- How do I SSH into my Spark cluster's master node?
- How do I interact with my Spark cluster using a password instead of an SSH-key?
- How do I change my cluster default settings?
- How do I modify myspark-env.sh,spark-defaults.conf orcore-site.xml files?
- How do I use GPUs with AZTK
- I'm a python user and want to use PySpark, Jupyter, Anaconda packages, and have a Pythonic experience.
- I'm a R user and want to use SparklyR, RStudio, Tidyverse packages, and have an R experience.
You can find more documentationhere
About
AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure