I am trying to understand the effect of giving differentnumSlices to theparallelize() method inSparkContext. Given below is theSyntax of the method
def parallelize[T](seq: Seq[T], numSlices: Int = defaultParallelism)(implicit arg0: ClassTag[T]): RDD[T]I ran spark-shell inlocal mode
spark-shell --master localMy understanding is,numSlices decides the no of partitions of the resultant RDD(after callingsc.parallelize()). Consider few examples below
Case 1
scala> sc.parallelize(1 to 9, 1);res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> res0.partitions.sizeres2: Int = 1Case 2
scala> sc.parallelize(1 to 9, 2);res3: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> res3.partitions.sizeres4: Int = 2Case 3
scala> sc.parallelize(1 to 9, 3);res5: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22scala> res3.partitions.sizeres6: Int = 2Case 4
scala> sc.parallelize(1 to 9, 4);res7: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[3] at parallelize at <console>:22scala> res3.partitions.sizeres8: Int = 2Question 1 : Incase 3 &case 4, I was expecting the partition size to be3 &4 respectively, but both cases have partition size of only2. What is the reason for this?
Question 2 : In each case there is a number associated withParallelCollectionRDD[no]. ie In Case 1 it isParallelCollectionRDD[0], In case 2 it isParallelCollectionRDD[1] & so on. What exactly those numbers signify?
- I would like to add that, when number of partitions is not specified, Number of Threads that you specify for Spark Master configuration plays a role when Spark decides the number of slices for you. A reference would betutorialkart.com/apache-spark/spark-parallelize-examplearjun– arjun2018-03-25 07:46:09 +00:00CommentedMar 25, 2018 at 7:46
1 Answer1
Question 1: That's a typo on your part. You're callingres3.partitions.size, instead ofres5 andres7 respectively. When I do it with the correct number, it works as expected.
Question 2: That's the id of theRDD in the Spark Context, used for keeping the graph straight. See what happens when I run the same command three times:
scala> sc.parallelize(1 to 9,1)res0: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[0] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res1: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[1] at parallelize at <console>:22scala> sc.parallelize(1 to 9,1)res2: org.apache.spark.rdd.RDD[Int] = ParallelCollectionRDD[2] at parallelize at <console>:22There are now three different RDDs with three different ids. We can run the following to check:
scala> (res0.id, res1.id, res2.id)res3: (Int, Int, Int) = (0,1,2)Explore related questions
See similar questions with these tags.

