parallelize() method in SparkContext

Question 1

I am trying to understand the effect of giving differentnumSlices to theparallelize() method inSparkContext. Given below is theSyntax of the method

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

I ran spark-shell inlocal mode

spark-shell --master local

My understanding is,numSlices decides the no of partitions of the resultant RDD(after callingsc.parallelize()). Consider few examples below

Case 1

scala> sc.parallelize(1 to 9, 1);res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> res0.partitions.sizeres2: Int = 1

Case 2

scala> sc.parallelize(1 to 9, 2);res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> res3.partitions.sizeres4: Int = 2

Case 3

scala> sc.parallelize(1 to 9, 3);res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22scala> res3.partitions.sizeres6: Int = 2

Case 4

scala> sc.parallelize(1 to 9, 4);res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22scala> res3.partitions.sizeres8: Int = 2

Question 1 : Incase 3 &case 4, I was expecting the partition size to be3 &4 respectively, but both cases have partition size of only2. What is the reason for this?

Question 2 : In each case there is a number associated withParallelCollectionRDD[no]. ie In Case 1 it isParallelCollectionRDD[0], In case 2 it isParallelCollectionRDD[1] & so on. What exactly those numbers signify?

Question 2

I would like to add that, when number of partitions is not specified, Number of Threads that you specify for Spark Master configuration plays a role when Spark decides the number of slices for you. A reference would betutorialkart.com/apache-spark/spark-parallelize-example

Question 3

Question 1: That's a typo on your part. You're callingres3.partitions.size, instead ofres5 andres7 respectively. When I do it with the correct number, it works as expected.

Question 2: That's the id of theRDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:

scala> sc.parallelize(1 to 9,1)res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22

There are now three different RDDs with three different ids. We can run the following to check:

scala> (res0.id, res1.id, res2.id)res3: (Int, Int, Int) = (0,1,2)

Question 4

Thanks @Matthew Graves for the answer. so stupid on my part to make that typo :).

Question 5

I'm glad I'm not the only one

Matthew Gray 1,67616 silver badges17 bronze badges · Accepted Answer · 2015-11-18 20:26:05Z

Question 1: That's a typo on your part. You're callingres3.partitions.size, instead ofres5 andres7 respectively. When I do it with the correct number, it works as expected.

Question 2: That's the id of theRDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:

scala> sc.parallelize(1 to 9,1)res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22

There are now three different RDDs with three different ids. We can run the following to check:

scala> (res0.id, res1.id, res2.id)res3: (Int, Int, Int) = (0,1,2)

Thanks @Matthew Graves for the answer. so stupid on my part to make that typo :).

Movatterモバイル変換

Collectives™ on Stack Overflow

parallelize() method in SparkContext

1 Answer1

2 Comments

Your Answer

Sign up orlog in

Post as a guest

Related

Hot Network Questions

Subscribe to RSS