18

I am trying to understand the effect of giving differentnumSlices to theparallelize() method inSparkContext. Given below is theSyntax of the method

def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]

I ran spark-shell inlocal mode

spark-shell --master local

My understanding is,numSlices decides the no of partitions of the resultant RDD(after callingsc.parallelize()). Consider few examples below

Case 1

scala> sc.parallelize(1 to 9, 1);res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> res0.partitions.sizeres2: Int = 1

Case 2

scala> sc.parallelize(1 to 9, 2);res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> res3.partitions.sizeres4: Int = 2

Case 3

scala> sc.parallelize(1 to 9, 3);res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22scala> res3.partitions.sizeres6: Int = 2

Case 4

scala> sc.parallelize(1 to 9, 4);res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22scala> res3.partitions.sizeres8: Int = 2

Question 1 : Incase 3 &case 4, I was expecting the partition size to be3 &4 respectively, but both cases have partition size of only2. What is the reason for this?

Question 2 : In each case there is a number associated withParallelCollectionRDD[no]. ie In Case 1 it isParallelCollectionRDD[0], In case 2 it isParallelCollectionRDD[1] & so on. What exactly those numbers signify?

askedNov 18, 2015 at 19:24
Raj's user avatar
1
  • I would like to add that, when number of partitions is not specified, Number of Threads that you specify for Spark Master configuration plays a role when Spark decides the number of slices for you. A reference would betutorialkart.com/apache-spark/spark-parallelize-exampleCommentedMar 25, 2018 at 7:46

1 Answer1

24

Question 1: That's a typo on your part. You're callingres3.partitions.size, instead ofres5 andres7 respectively. When I do it with the correct number, it works as expected.

Question 2: That's the id of theRDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:

scala> sc.parallelize(1 to 9,1)res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22

There are now three different RDDs with three different ids. We can run the following to check:

scala> (res0.id, res1.id, res2.id)res3: (Int, Int, Int) = (0,1,2)
answeredNov 18, 2015 at 19:44
Matthew Gray's user avatar
Sign up to request clarification or add additional context in comments.

2 Comments

Thanks @Matthew Graves for the answer. so stupid on my part to make that typo :).
I'm glad I'm not the only one

Your Answer

Sign up orlog in

Sign up using Google
Sign up using Email and Password

Post as a guest

Required, but never shown

By clicking “Post Your Answer”, you agree to ourterms of service and acknowledge you have read ourprivacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.