This repository was archived by the owner on Feb 3, 2021. It is now read-only.

Azure/aztkPublic archive

NotificationsYou must be signed in to change notification settings
Fork66
Star151

AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

License

MIT license

151 stars 66 forks Branches Tags Activity

Star

Notifications

You must be signed in to change notification settings

Branches Tags

Folders and files

Name		Name	Last commit message	Last commit date
Latest commit History 393 Commits
.vscode		.vscode
aztk		aztk
aztk_cli		aztk_cli
custom-scripts		custom-scripts
docker-image		docker-image
docs		docs
examples		examples
tests		tests
.editorconfig		.editorconfig
.gitattributes		.gitattributes
.gitignore		.gitignore
.style.yapf		.style.yapf
.travis.yml		.travis.yml
.vsts-ci.yml		.vsts-ci.yml
CHANGELOG.md		CHANGELOG.md
LICENSE		LICENSE
README.md		README.md
account_setup.py		account_setup.py
account_setup.sh		account_setup.sh
pylintrc		pylintrc
pytest.ini		pytest.ini
requirements.txt		requirements.txt
setup.py		setup.py

Repository files navigation

Azure Distributed Data Engineering Toolkit (AZTK)

Azure Distributed Data Engineering Toolkit (AZTK) is a python CLI application for provisioning on-demand Spark on Docker clusters in Azure. It's a cheap and easy way to get up and running with a Spark cluster, and a great tool for Spark users who want to experiment and start testing at scale.

This toolkit is built on top of Azure Batch but does not require any Azure Batch knowledge to use.

Status

This repository has been marked for archival. It is no longer maintained.

Notable Features

Spark cluster provision time of 5 minutes on average
Spark clusters run in Docker containers
Run Spark on a GPU enabled cluster
Users can bring their own Docker image
Ability to use low-priority VMs for an 80% discount
Mixed Mode clusters that use both low-priority and dedicated VMs
Built in support for Azure Blob Storage and Azure Data Lake connection
Tailored pythonic experience with PySpark, Jupyter, and Anaconda
Tailored R experience with SparklyR, RStudio-Server, and Tidyverse
Ability to runspark submit directly from your local machine's CLI

Setup

Installaztk with pip:

    pip install aztk

Initialize the project in a directory. This will automatically create a.aztk folder with config files in your working directory:

    aztk spark init

wget -q https://raw.githubusercontent.com/Azure/aztk/v0.10.3/account_setup.sh -O account_setup.sh&&chmod 755 account_setup.sh&&/bin/bash account_setup.sh

Follow the on screen prompts to create the necessary Azure resources and copy the output into your.aztk/secrets.yaml file. For more information seeGetting Started Scripts.

Quickstart Guide

The core experience of this package is centered around a few commands.

# create your clusteraztk spark cluster createaztk spark cluster add-user

# monitor and manage your clustersaztk spark cluster getaztk spark cluster listaztk spark cluster delete

# login and submit applications to your clusteraztk spark cluster sshaztk spark cluster submit

1. Create and setup your cluster

First, create your cluster:

aztk spark cluster create --id my_cluster --size 5 --vm-size standard_d2_v2

See our available VM sizeshere.
The--vm-size argument must be the official SKU name which usually come in the form: "standard_d2_v2"
You can createlow-priority VMs at an 80% discount by using--size-low-pri instead of--size
By default, AZTK runs Spark 2.2.0 on an Ubuntu16.04 Docker image. More infohere
By default, AZTK will create a user (with the usernamespark) for your cluster
The cluster id (--id) can only contain alphanumeric characters including hyphens and underscores, and cannot contain more than 64 characters.
By default, you cannot create clusters of more than 20 cores in total. Visitthis page to request a core quota increase.

More information regarding using a cluster can be found in thecluster documentation

2. Check on your cluster status

To check your cluster status, use theget command:

aztk spark cluster get --id my_cluster

3. Submit a Spark job

When your cluster is ready, you can submit jobs from your local machine to run against the cluster. The output of the spark-submit will be streamed to your local console. Run this command from the cloned AZTK repo:

// submit a java applicationaztk spark cluster submit \    --id my_cluster \    --name my_java_job \    --class org.apache.spark.examples.SparkPi \    --executor-memory 20G \    path\to\examples.jar 1000    // submit a python applicationaztk spark cluster submit \    --id my_cluster \    --name my_python_job \    --executor-memory 20G \    path\to\pi.py 1000

Theaztk spark cluster submit command takes the same parameters as the standardspark-submit command, except instead of specifying--master, AZTK requires that you specify your cluster--id and a unique job--name
The job name,--name, argument must be at least 3 characters long
- It can only contain alphanumeric characters including hyphens but excluding underscores
- It cannot contain uppercase letters
Each job you submitmust have a unique name
Use the--no-wait option for your command to return immediately

Learn more about the spark submit commandhere

4. Log in and Interact with your Spark Cluster

Most users will want to work interactively with their Spark clusters. With theaztk spark cluster ssh command, you can SSH into the cluster's master node. This command also helps you port-forward your Spark Web UI and Spark Jobs UI to your local machine:

aztk spark cluster ssh --id my_cluster --user spark

By default, we port forward the Spark Web UI tolocalhost:8080, Spark Jobs UI tolocalhost:4040, and the Spark History Server tolocalhost:18080.

You can configure these settings in the.aztk/ssh.yaml file.

NOTE: When working interactively, you may want to use tools like Jupyter or RStudio-Server. To do so, you need to setup your cluster with the appropriate docker image and plugin. SeePlugins for more information.

5. Manage and Monitor your Spark Cluster

You can also see your clusters from the CLI:

aztk spark cluster list

And get the state of any specified cluster:

aztk spark cluster get --id <my_cluster_id>

Finally, you can delete any specified cluster:

aztk spark cluster delete --id <my_cluster_id>

FAQs

Next Steps

You can find more documentationhere

About

AZTK powered by Azure Batch: On-demand, Dockerized, Spark Jobs on Azure

Code of conduct

Security policy

Activity

Custom properties

Stars

151 stars

Watchers

40 watching

Forks

66 forks

Report repository

Releases15

0.10.3 (2020-02-13) Latest

Feb 14, 2020

+ 14 releases

Packages

No packages published

Contributors17

+ 3 contributors

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

License

Folders and files

Latest commit

History

Repository files navigation

Azure Distributed Data Engineering Toolkit (AZTK)

Status

Notable Features

Setup

Quickstart Guide

1. Create and setup your cluster

2. Check on your cluster status

3. Submit a Spark job

4. Log in and Interact with your Spark Cluster

5. Manage and Monitor your Spark Cluster

FAQs

Next Steps

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases15

Packages

Contributors17

Languages

Movatterモバイル変換

License

Azure/aztk

Folders and files

Latest commit

History

Repository files navigation

Azure Distributed Data Engineering Toolkit (AZTK)

Status

Notable Features

Setup

Quickstart Guide

1. Create and setup your cluster

2. Check on your cluster status

3. Submit a Spark job

4. Log in and Interact with your Spark Cluster

5. Manage and Monitor your Spark Cluster

FAQs

Next Steps

About

Topics

Resources

License

Code of conduct

Security policy

Stars

Watchers

Forks

Releases15

Packages0

Contributors17

Languages

Packages