Core Spark functionality.
Core Spark functionality.org.apache.spark.SparkContext serves as the main entry point toSpark, whileorg.apache.spark.rdd.RDD is the data type representing a distributed collection,and provides most parallel operations.
In addition,org.apache.spark.rdd.PairRDDFunctions contains operations available only on RDDsof key-value pairs, such asgroupByKey andjoin;org.apache.spark.rdd.DoubleRDDFunctionscontains operations available only on RDDs of Doubles; andorg.apache.spark.rdd.SequenceFileRDDFunctions contains operations available on RDDs that canbe saved as SequenceFiles. These operations are automatically available on any RDD of the righttype (e.g. RDD[(Int, Int)] through implicit conversions.
Java programmers should reference theorg.apache.spark.api.java packagefor Spark programming APIs in Java.
Classes and methods marked withExperimental are user-facing features which have not been officially adopted by theSpark project. These are subject to change or removal in minor releases.
Classes and methods marked withDeveloper API are intended for advanced users want to extend Spark through lowerlevel interfaces. These are subject to changes or removal in minor releases.
Provides several RDD implementations.
Provides several RDD implementations. Seeorg.apache.spark.rdd.RDD.
The deterministic level of RDD's output (i.e. whatRDD#compute returns). This explains howthe output will diff when Spark reruns the tasks for the RDD. There are 3 deterministic levels:1. DETERMINATE: The RDD output is always the same data set in the same order after a rerun.2. UNORDERED: The RDD output is always the same data set but the order can be different after a rerun.3. INDETERMINATE. The RDD output can be different after a rerun.
Note that, the output of an RDD usually relies on the parent RDDs. When the parent RDD's outputis INDETERMINATE, it's very likely the RDD's output is also INDETERMINATE.
(Since version 9)