Sahara (Data Processing) UI User Guide

updated: 'Thu Jun 29 08:54:09 2017, commit 506f85b'

Sahara (Data Processing) UI User Guide

Sahara (Data Processing) UI User Guide¶

This guide assumes that you already have the sahara service and Horizondashboard up and running. Don’t forget to make sure that sahara isregistered in Keystone. If you require assistance with that, please see theinstallation guide.

The sections below give a panel by panel overview of setting up clustersand running jobs. For a description of using the guided cluster and job tools,look atLaunching a cluster via the Cluster Creation Guide andRunning a job via the Job Execution Guide.

Launching a cluster via the sahara UI¶

Registering an Image¶

Navigate to the “Project” dashboard, then the “Data Processing” tab, thenclick on the “Clusters” panel and finally the “Image Registry” tab.
From that page, click on the “Register Image” button at the top right
Choose the image that you’d like to register with sahara
Enter the username of the cloud-init user on the image
Choose plugin and version to make the image available only for the intendedclusters
Click the “Done” button to finish the registration

Create Node Group Templates¶

Navigate to the “Project” dashboard, then the “Data Processing” tab, thenclick on the “Clusters” panel and then the “Node Group Templates” tab.
From that page, click on the “Create Template” button at the top right
Choose your desired Plugin name and Version from the dropdowns and click“Next”
Give your Node Group Template a name (description is optional)
Choose a flavor for this template (based on your CPU/memory/disk needs)
Choose the storage location for your instance, this can be either “EphemeralDrive” or “Cinder Volume”. If you choose “Cinder Volume”, you will need toadd additional configuration
Switch to the Node processes tab and choose which processes should be runfor all instances that are spawned from this Node Group Template
Click on the “Create” button to finish creating your Node Group Template

Create a Cluster Template¶

Navigate to the “Project” dashboard, then the “Data Processing” tab, thenclick on the “Clusters” panel and finally the “Cluster Templates” tab.
From that page, click on the “Create Template” button at the top right
Choose your desired Plugin name and Version from the dropdowns and click“Next”
Under the “Details” tab, you must give your template a name
Under the “Node Groups” tab, you should add one or more nodes that can bebased on one or more templates

To do this, start by choosing a Node Group Template from the dropdown andclick the “+” button
You can adjust the number of nodes to be spawned for this node group viathe text box or the “-” and “+” buttons
Repeat these steps if you need nodes from additional node group templates

Optionally, you can adjust your configuration further by using the “GeneralParameters”, “HDFS Parameters” and “MapReduce Parameters” tabs
If you have Designate DNS service you can choose the domain name in “DNS”tab for internal and external hostname resolution
Click on the “Create” button to finish creating your Cluster Template

Launching a Cluster¶

Navigate to the “Project” dashboard, then the “Data Processing” tab, thenclick on the “Clusters” panel and lastly, click on the “Clusters” tab.
Click on the “Launch Cluster” button at the top right
Choose your desired Plugin name and Version from the dropdowns and click“Next”
Give your cluster a name (required)
Choose which cluster template should be used for your cluster
Choose the image that should be used for your cluster (if you do not see anyoptions here, seeRegistering an Image above)
Optionally choose a keypair that can be used to authenticate to your clusterinstances
Click on the “Create” button to start your cluster

Your cluster’s status will display on the Clusters table
It will likely take several minutes to reach the “Active” state

Scaling a Cluster¶

From the Data Processing/Clusters page (Clusters tab), click on the“Scale Cluster” button of the row that contains the cluster that you want toscale
You can adjust the numbers of instances for existing Node Group Templates
You can also add a new Node Group Template and choose a number of instancesto launch

This can be done by selecting your desired Node Group Template from thedropdown and clicking the “+” button
Your new Node Group will appear below and you can adjust the number ofinstances via the text box or the “+” and “-” buttons

To confirm the scaling settings and trigger the spawning/deletion ofinstances, click on “Scale”

Elastic Data Processing (EDP)¶

Data Sources¶

Data Sources are where the input and output from your jobs are housed.

From the Data Processing/Jobs page (Data Sources tab), click on the“Create Data Source” button at the top right
Give your Data Source a name
Enter the URL of the Data Source

For a swift object, enter <container>/<path> (ie:mycontainer/inputfile).sahara will prependswift:// for you
For an HDFS object, enter an absolute path, a relative path or a full URL:
/my/absolute/path indicates an absolute path in the cluster HDFS
my/path indicates the path/user/hadoop/my/path in the cluster HDFSassuming the defined HDFS user ishadoop
hdfs://host:port/path can be used to indicate any HDFS location

Enter the username and password for the Data Source (also seeAdditional Notes)
Enter an optional description
Click on “Create”
Repeat for additional Data Sources

Job Binaries¶

Job Binaries are where you define/upload the source code (mains and libraries)for your job.

From the Data Processing/Jobs (Job Binaries tab), click on the“Create Job Binary” button at the top right
Give your Job Binary a name (this can be different than the actual filename)
Choose the type of storage for your Job Binary

For “swift”, enter the URL of your binary (<container>/<path>) as well asthe username and password (also seeAdditional Notes)
For “Internal database”, you can choose from “Create a script” or “Uploada new file”

Enter an optional description
Click on “Create”
Repeat for additional Job Binaries

Job Templates (Known as “Jobs” in the API)¶

Job templates are where you define the type of job you’d like to run as wellas which “Job Binaries” are required.

From the Data Processing/Jobs page (Job Templates tab),click on the “Create Job Template” button at the top right
Give your Job Template a name
Choose the type of job you’d like to run
Choose the main binary from the dropdown
- This is required for Hive, Pig, and Spark jobs
- Other job types do not use a main binary
Enter an optional description for your Job Template
Click on the “Libs” tab and choose any libraries needed by your job template
- MapReduce and Java jobs require at least one library
- Other job types may optionally use libraries
Click on “Create”

Jobs (Known as “Job Executions” in the API)¶

Jobs are what you get by “Launching” a job template. You can monitor thestatus of your job to see when it has completed its run

From the Data Processing/Jobs page (Job Templates tab), find the rowthat contains the job template you want to launch and click either“Launch on New Cluster” or “Launch on Existing Cluster” the right sideof that row
Choose the cluster (already running–seeLaunching a Cluster above) onwhich you would like the job to run
Choose the Input and Output Data Sources (Data Sources defined above)
If additional configuration is required, click on the “Configure” tab

Additional configuration properties can be defined by clicking on the “Add”button
An example configuration entry might be mapred.mapper.class for the Nameand org.apache.oozie.example.SampleMapper for the Value

Click on “Launch”. To monitor the status of your job, you can navigate tothe Data Processing/Jobs panel and click on the Jobs tab.
You can relaunch a Job from the Jobs page by using the“Relaunch on New Cluster” or “Relaunch on Existing Cluster” links

Relaunch on New Cluster will take you through the forms to start a newcluster before letting you specify input/output Data Sources and jobconfiguration
Relaunch on Existing Cluster will prompt you for input/output Data Sourcesas well as allow you to change job configuration before launching the job

Example Jobs¶

There are sample jobs located in the sahara repository. In this section, wewill give a walkthrough on how to run those jobs via the Horizon UI. Thesesteps assume that you already have a cluster up and running (in the “Active”state). You may want to clone intohttps://git.openstack.org/cgit/openstack/sahara-tests/so that you will have all of the source code and inputs stored locally.

Sample Pig job -https://git.openstack.org/cgit/openstack/sahara-tests/tree/sahara_tests/scenario/defaults/edp-examples/edp-pig/cleanup-string/example.pig

Load the input data file fromhttps://git.openstack.org/cgit/openstack/sahara-tests/tree/sahara_tests/scenario/defaults/edp-examples/edp-pig/cleanup-string/data/inputinto swift
Click on Project/Object Store/Containers and create a container with anyname (“samplecontainer” for our purposes here)
Click on Upload Object and give the object a name(“piginput” in this case)
Navigate to Data Processing/Jobs/Data Sources, Click on Create Data Source
Name your Data Source (“pig-input-ds” in this sample)
Type = Swift, URL samplecontainer/piginput, fill-in the Sourceusername/password fields with your username/password and click “Create”
Create another Data Source to use as output for the job
Name = pig-output-ds, Type = Swift, URL = samplecontainer/pigoutput,Source username/password, “Create”
Store your Job Binaries in the sahara database
Navigate to Data Processing/Jobs/Job Binaries, Click on Create Job Binary
Name = example.pig, Storage type = Internal database, click Browse andfind example.pig wherever you checked out the sahara project<sahara-tests root>/etc/edp-examples/edp-pig/trim-spaces
Create another Job Binary: Name = edp-pig-udf-stringcleaner.jar,Storage type = Internal database, click Browse and findedp-pig-udf-stringcleaner.jar wherever you checked out the sahara project<sahara-tests root>/sahara_tests/scenario/defaults/edp-examples/edp-pig/cleanup-string/
Create a Job Template
Navigate to Data Processing/Jobs/Job Templates, Click onCreate Job Template
Name = pigsample, Job Type = Pig, Choose “example.pig” as the main binary
Click on the “Libs” tab and choose “edp-pig-udf-stringcleaner.jar”,then hit the “Choose” button beneath the dropdown, then clickon “Create”
Launch your job
To launch your job from the Job Templates page, click on the downarrow at the far right of the screen and choose“Launch on Existing Cluster”
For the input, choose “pig-input-ds”, for output choose “pig-output-ds”.Also choose whichever cluster you’d like to run the job on
For this job, no additional configuration is necessary, so you can justclick on “Launch”
You will be taken to the “Jobs” page where you can see your jobprogress through “PENDING, RUNNING, SUCCEEDED” phases
When your job finishes with “SUCCEEDED”, you can navigate back to ObjectStore/Containers and browse to the samplecontainer to see your output.It should be in the “pigoutput” folder

Sample Spark job -https://git.openstack.org/cgit/openstack/sahara-tests/tree/sahara_tests/scenario/defaults/edp-examples/edp-sparkYou can clone intohttps://git.openstack.org/cgit/openstack/sahara-tests/ for quickeraccess to the files for this sample job.

Store the Job Binary in the sahara database
Navigate to Data Processing/Jobs/Job Binaries, Click on Create Job Binary
Name = sparkexample.jar, Storage type = Internal database, Browse to thelocation <sahara-tests root>/sahara_tests/scenario/defaults/edp-examples/edp-spark/ and choose spark-wordcount.jar, Click “Create”
Create a Job Template
Name = sparkexamplejob, Job Type = Spark,Main binary = Choose sparkexample.jar, Click “Create”
Launch your job
To launch your job from the Job Templates page, click on thedown arrow at the far right of the screen and choose“Launch on Existing Cluster”
Choose whichever cluster you’d like to run the job on
Click on the “Configure” tab
Set the main class to be: sahara.edp.spark.SparkWordCount
Under Arguments, click Add and fill url for the input file,once more click Add and fill url for the output file.
Click on Launch
You will be taken to the “Jobs” page where you can see your jobprogress through “PENDING, RUNNING, SUCCEEDED” phases
When your job finishes with “SUCCEEDED”, you can see your results inyour output file.
The stdout and stderr files of the command used for executing your jobare located at /tmp/spark-edp/<name of job template>/<job id>on Spark master node in case of Spark clusters, or on Spark JobHistorynode in other cases like Vanilla, CDH and so on.

Additional Notes¶

Throughout the sahara UI, you will find that if you try to delete an objectthat you will not be able to delete it if another object depends on it.An example of this would be trying to delete a Job Template that has anexisting Job. In order to be able to delete that job, you wouldfirst need to delete any Job Templates that relate to that job.
In the examples above, we mention adding your username/password for theswift Data Sources. It should be noted that it is possible to configuresahara such that the username/password credentials arenot required. Formore information on that, please refer to:Sahara AdvancedConfiguration Guide

Launching a cluster via the Cluster Creation Guide¶

Under the Data Processing group, choose “Clusters” and then click on the“Clusters” tab. The “Cluster Creation Guide” button is above that table.Click on it.
Click on the “Choose Plugin” button then select the cluster type from thePlugin Name dropdown and choose your target version. When done, clickon “Select” to proceed.
Click on “Create a Master Node Group Template”. Give your template a name,choose a flavor and choose which processes should run on nodes launchedfor this node group. The processes chosen here should be things that aremore server-like in nature (namenode, oozieserver, spark master, etc).Optionally, you can set other options here such as availability zone,storage, security and process specific parameters. Click on “Create”to proceed.
Click on “Create a Worker Node Group Template”. Give your template a name,choose a flavor and choose which processes should run on nodes launchedfor this node group. Processes chosen here should be more worker-like innature (datanode, spark slave, task tracker, etc). Optionally, you can setother options here such as availability zone, storage, security and processspecific parameters. Click on “Create” to proceed.
Click on “Create a Cluster Template”. Give your template a name. Next,click on the “Node Groups” tab and enter the count for each of the nodegroups (these are pre-populated from steps 3 and 4). It would be commonto have 1 for the “master” node group type and some larger number of“worker” instances depending on you desired cluster size. Optionally,you can also set additional parameters for cluster-wide settings viathe other tabs on this page. Click on “Create” to proceed.
Click on “Launch a Cluster”. Give your cluster a name and choose the imagethat you want to use for all instances in your cluster. The clustertemplate that you created in step 5 is already pre-populated. If you wantssh access to the instances of your cluster, select a keypair from thedropdown. Click on “Launch” to proceed. You will be taken to the Clusterspanel where you can see your cluster progress toward the Active state.

Running a job via the Job Execution Guide¶

Under the Data Processing group, choose “Jobs” and then click on the“Jobs” tab. The “Job Execution Guide” button is above that table. Clickon it.
Click on “Select type” and choose the type of job that you want to run.
If your job requires input/output data sources, you will have the optionto create them via the “Create a Data Source” button (Note: This button willnot be shown for job types that do not require data sources). Give yourdata source a name and choose the type. If you have chosen swift, youmay also enter the username and password. Enter the URL for your datasource. For more details on what the URL should look like, seeData Sources.
Click on “Create a job template”. Give your job template a name.Depending on the type of job that you’ve chosen, you may need to selectyour main binary and/or additional libraries (available from the “Libs”tab). If you have not yet uploaded the files to run your program, youcan add them via the “+” icon next to the “Choose a main binary” select box.
Click on “Launch job”. Choose the active cluster where you want to run youjob. Optionally, you can click on the “Configure” tab and provide anyrequired configuration, arguments or parameters for your job. Click on“Launch” to execute your job. You will be taken to the Jobs tab whereyou can monitor the state of your job as it progresses.

updated: 'Thu Jun 29 08:54:09 2017, commit 506f85b'

Except where otherwise noted, this document is licensed underCreative Commons Attribution 3.0 License. See all OpenStack Legal Documents.

found an error? report a bug questions?

Sahara REST API v1.1

Movatterモバイル変換