Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Saloni Goyal
Saloni Goyal

Posted on • Edited on

     

Introduction to Apache Spark

MapReduce and Spark are both used forlarge-scale data processing. However, MapReduce has some shortcomings which renders Spark more useful in a number of scenarios.

Source

Shortcomings of MapReduce

  1. Every workflow has to go through a map and reduce phase: Can’t accommodate a join, filter or more complicated workflows like map- reduce-map.
  2. MapReduce relies heavily onreading data from disk: Performance bottleneck, especially bad for iterative algorithms which may cycle through the data several times.
  3. Only native Java programming interface available: Python is also available, but it makes implementation complex and is not very efficient for floating point data.
  4. Not that easy in terms of programming and requires lots of hand coding.

Solution — Apache Spark

  • A new framework: Not a complete replacement of the Hadoop stack, just a replacement for Hadoop MapReduce and more
  • Capable to using Hadoop ecosystem, e.g., HDFS, yarn

Source

Solutions by Spark

  1. Spark providesover 20 highly efficient distributed operations: Can be used in combination
  2. User can choose tocache data in memory: Increases performance for iterative algorithms
  3. Polyglot: Native Java, Python, Scala, R interfaces along withinteractive shell (test and explore data interactively on shell)
  4. Easy to program and does not require that much hand coding

100 TB Benchmark

Spark Architecture

Apache Spark has a well-defined layered architecture where all the spark components and layers are loosely coupled.

Worker Node

  • General JVM executor to execute spark workflows
  • Core on whichall computation is done
  • Interface to rest of the Hadoop ecosystem, e.g., HDFS

Cluster Manager

  • System to manageprovisioning and starting of worker nodes
  • Cluster manager interfaces supported by spark –
  • YARN (Hadoop cluster manager): Same cluster can be used with both Hadoop MR and spark.
  • Standalone: Special spark process that takes care of starting nodes at beginning of computation and restarts them on failures.

Driver Program

  • Interface with cluster
  • Has a JVM that has aspark context: Gateway for us to connect to our spark instance and submit jobs.
  • Jobs can be submitted in -
  • Batch mode : Send program for execution and wait for result
  • Streaming mode: Use spark shell and interact in real-time with data

Source

Cloudera VM Setup vs Amazon EMR

Cloudera VM Setup

  • Using spark in standalone mode
  • Everything running locally on one machine
  • Worker node (executor JVM), spark process and driver program on the same machine

Amazon EMR

  • Supports spark natively
  • Web interface to configure number, type of instances, memory required, etc.
  • Amazon EMR automatically runs YARN to spawn instances and prepares them to be executed with spark.
  • Executor JVMs run on EC2 interfaces
  • Driver program and YARN running on the master node

Resilient Distributed Datasets

Dataset

  • Data storage created from HDFS, S3, HBase, JSON, text, etc.: Once spark reads the data, it can be referenced with an RDD
  • Or from transforming another RDD:RDD are immutable

Distributed

  • Distributes across cluster of machines
  • Data is divided intopartitions and partitions divided across machines

Resilient

  • Spark tracks history of each partition
  • Error recovery like node failures, slow processes

Spark Transformations

RDD are immutable but we can transform one RDD to another RDD. Spark doeslazy transformations (execution will not start until an action is triggered).

Narrow transformations

Source

  • Like map and filter
  • Do not imply transferring data through the network
  • Depends onmemory and CPU

Wide transformations

Source

  • For example, groupByKey transfers data with same key to same partitionShuffle operation across network
  • Also depended oninterconnection speed between nodes

For further reading:Apache Spark Tutorial: Get Started With Serving ML Models With Spark

Source:Introduction to Apache Spark

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

Extremely curious.
  • Education
    The LNM Institute of Information Technology
  • Work
    Software Development Engineer II at Adobe
  • Joined

More fromSaloni Goyal

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp