ApacheSpark Basics

Apache^® Spark™ is a fast,general-purpose engine for large-scale data processing.

Every Spark application consists of adriver programthat manages the execution of your application on a cluster. The workerson a Spark enabled cluster are referred to asexecutors.The driver process runs the user code on these executors.

In a typical Spark application, your code will establishaSparkContext, create aResilientDistributed Dataset (RDD) fromexternal data, and then execute methods known astransformations andactions onthat RDD to arrive at the outcome of an analysis.

An RDD is the main programming abstraction in Spark andrepresents an immutable collection of elements partitioned acrossthe nodes of a cluster that can be operated on in parallel. A Spark applicationcan run locally on a single machine or on a cluster.

Spark is mainly written in Scala^® and has APIs in other programming languages, including MATLAB^®. The MATLAB API for Spark exposes the Spark programing model to MATLAB and enables MATLAB implementations of numerous Spark functions. Many of these MATLAB implementations of Spark functions accept function handles or anonymous functions as inputs to perform various types of analyses.

Running AgainstSpark

To run against Spark means executing an application against a Spark enabled cluster using a supported cluster manager. A cluster can be local or on a network. You can run against Spark in two ways:

Execute commands in aninteractive shell that is connected to Spark.
Create and execute astandalone application against a Spark cluster.

When using an interactive shell, Spark allows you to interactwith data that is distributed on disk or in memory across many machinesand perform ad-hoc analysis. Spark takes care of the underlyingdistribution of work across various machines. Interactive shells areonly available in Python^® and Scala.

The MATLAB API for Spark inMATLAB Compiler™ providesan interactive shell similar to a Spark shell that allows youto debug your application prior to deploying it. The interactive shellonly runs against a local cluster.

When creating and executing standalone applications against Spark,applications are first packaged or compiled as standalone applicationsbefore being executed against a Spark enabled cluster. You canauthor standalone applications in Scala, Java^®, Python,and MATLAB.

The MATLAB API for Spark inMATLAB Compiler letsyou create standalone applications that can run against Spark.

Cluster Managers Supported bySpark

Local

Alocal cluster manager represents apseudo-cluster and works in a nondistributed mode on a single machine.You can configure it to use one worker thread, or on a multicore machine,multiple worker threads. In applications, it is denoted by the wordlocal.

Note

The MATLAB API for Spark, which allows you to interactivelydebug your applications, works only with a local cluster manager.

Standalone

AStandalone cluster manager ships with Spark. It consists of a master and multiple workers. To use a Standalone cluster manager, place a compiled version of Spark on each cluster node. A Standalone cluster manager can be started using scripts provided by Spark. In applications, it is denoted as:spark://host:port. The default port number is7077.

Note

The Standalone cluster manager that ships with Spark is not to be confused with thestandalone application that can run against Spark.MATLAB Compiler does not support the Standalone cluster manager.

YARN

A YARN cluster manager was introduced in Hadoop^® 2.0. Itis typically installed on the same nodes as HDFS™. Therefore,running Spark on YARN lets Spark access HDFS data easily.In applications, it is denoted using the termyarn.There are two modes that are available when starting applicationson YARN:

Inyarn-client mode, the driver runs in the client process, and the application master is used only for requesting resources from YARN.
Inyarn-cluster mode, the Spark driver runs inside an application master process that is managed by YARN on the cluster, and the client can retire after initiating the application.

Note

MATLAB Compiler supports the YARN cluster manager only inyarn-client mode.

Mesos

A Mesos cluster manager is an open-source cluster manager developedby Apache. In applications, it is usually denoted as:mesos://host:port.The default port number is5050.

Note

MATLAB Compiler does not support a Mesos cluster manager.

You can use the following table to see whichMATLAB Compiler deployment option is supported by each cluster manager.

Deploy Against Spark Option	Local Cluster (`local`)	Hadoop Cluster (`yarn-client`)
Deploystandalone applications containing tall arrays.	Not supported.	Supported.
Deploystandalone applications created using the MATLAB APIfor Spark.	Supported.	Supported.
Interactivelydebug your applications using the MATLAB APIfor Spark.	Supported.	Not supported.

Relationship BetweenSpark andHadoop

The relationship between Spark and Hadoop comes intoplay only if you want to run Spark on a cluster that has Hadoop installed.Otherwise, you do not need Hadoop to run Spark.

To run Spark on a cluster you need a shared file system. A Hadoop cluster provides access to a distributed file-system via HDFS and a cluster manager in the form of YARN. Spark can use YARN as a cluster manager for distributing work and use HDFS to access data. Also, some Spark applications can use Hadoop’s MapReduce programming model, but MapReduce does not constitute the core programming model in Spark.

Hadoop is not required to run Spark on cluster. Youcan also use other options such as Mesos.

Note

The deployment options inMATLAB Compiler currently support deploying only against a Spark enabled Hadoop cluster.

Driver

Every Spark application consists of adriver programthat initiates various operations on a cluster. The driver is a processin which themain() method of a program runs. Thedriver process runs user code that creates a SparkContext, createsRDDs, and performs transformations and actions. When a Spark driverexecutes, it performs two duties:

Convert a user program into tasks.
The Spark driver application is responsible for convertinga user program into units of physical execution called tasks. Tasksare the smallest unit of work in Spark.
Schedule tasks on executors.
The Spark driver tries to schedule each task in an appropriatelocation, based on data placement. It also tracks the location ofcached data, and uses it to schedule future tasks that access thatdata.

Once the driver terminates, the application is finished.

Note

When using the MATLAB API for Spark inMATLAB Compiler, MATLAB application code becomes the Spark driver program.

Executor

A Spark executor is a worker process responsible for runningthe individual tasks in a given Spark job. Executors are startedat the beginning of a Spark application and persist for the entirelifetime of an application. Executors perform two roles:

Run the tasks that make up the application, and returnthe results to the driver.
Provide in-memory storage for RDDs that are cachedby user programs.

RDD

AResilient Distributed Dataset orRDD isa programming abstraction in Spark. It represents a collectionof elements distributed across many nodes that can be operated inparallel. RDDs tend to be fault-tolerant. You can create RDDs in twoways:

By loading an external dataset.
By parallelizing a collection of objects in thedriver program.

After creation, you can perform two types of operationsusing RDDs:transformations andactions.

Transformations

Transformations are operations on anexisting RDD that return a new RDD. Many, but not all, transformationsare element-wise operations.

Actions

Actions compute a final result basedon an RDD and either return that result to the driver program or saveit to an external storage system such as HDFS.

Distinguishing Between Transformations and Actions

Check the return data type. Transformations return RDDs, whereasactions return other data types.

SparkConf

SparkConf stores the configuration parametersof the application being deployed to Spark. Every applicationmust be configured prior to being deployed on a Spark cluster.Some of the configuration parameters define properties of the applicationand some are used by Spark to allocate resources on the cluster.The configuration parameters are passed onto a Spark clusterthrough aSparkContext.

SparkContext

A SparkContext represents aconnection to a Spark cluster. It is the entry point to Spark andsets up the internal services necessary to establish a connectionto the Spark execution environment.

Movatterモバイル変換

ApacheSpark Basics

Running AgainstSpark

Cluster Managers Supported bySpark

Local

Standalone

YARN

Mesos

Relationship BetweenSpark andHadoop

Driver

Executor

RDD

Transformations

Actions

Distinguishing Between Transformations and Actions

SparkConf

SparkContext

MATLAB Command