Spark Core#
Public Classes#
| Main entry point for Spark functionality. |
| A Resilient Distributed Dataset (RDD), the basic abstraction in Spark. |
| A broadcast variable created with |
| A shared variable that can be accumulated, i.e., has a commutative and associative "add" operation. |
Helper object that defines how to accumulate values of a given type. | |
| Configuration for a Spark application. |
Resolves paths to files added through | |
| Flags for controlling the storage of an RDD. |
Contextual information about a task which can be read or mutated during execution. | |
| Wraps an RDD in a barrier stage, which forces Spark to launch tasks of this stage together. |
A | |
| Carries all task infos of a barrier task. |
| Thread that is recommended to be used in PySpark when the pinned thread mode is enabled. |
Provides utility method to determine Spark versions with given input string. |
Spark Context APIs#
| Create an |
| Add an archive to be downloaded with this Spark job on every node. |
| Add a file to be downloaded with this Spark job on every node. |
Add a tag to be assigned to all the jobs started by this thread. | |
| Add a .py or .zip dependency for all tasks to be executed on this SparkContext in the future. |
A unique identifier for the Spark application. | |
| Read a directory of binary files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI as a byte array. |
| Load data from a flat binary file, assuming each record is a set of numbers with the specified numerical format (see ByteBuffer), and the number of bytes per record is constant. |
| Broadcast a read-only variable to the cluster, returning a |
Cancel all jobs that have been scheduled or are running. | |
| Cancel active jobs for the specified group. |
Cancel active jobs that have the specified tag. | |
Clear the current thread's job tags. | |
Default min number of partitions for Hadoop RDDs when not given by user | |
Default level of parallelism to use when not given by user (e.g. | |
Dump the profile stats into directorypath | |
Create an | |
Return the directory where RDDs are checkpointed. | |
Return a copy of this SparkContext's configuration | |
Get the tags that are currently set to be assigned to all the jobs started by this thread. | |
Get a local property set in this thread, or null if it is missing. | |
| Get or instantiate a |
Get a Java system property, such asjava.home. | |
| Read an 'old' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
| Read an 'old' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. |
Returns a list of archive paths that are added to resources. | |
Returns a list of file paths that are added to resources. | |
| Read a 'new API' Hadoop InputFormat with arbitrary key and value class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
| Read a 'new API' Hadoop InputFormat with arbitrary key and value class, from an arbitrary Hadoop configuration, which is passed in as a Python dict. |
| Distribute a local Python collection to form an RDD. |
| Load an RDD previously saved using |
| Create a new RDD of int containing elements fromstart toend (exclusive), increased bystep every element. |
Return the resource information of this | |
Remove a tag previously added to be assigned to all the jobs started by this thread. | |
| Executes the given partitionFunc on the specified set of partitions, returning the result as an array of elements. |
| Read a Hadoop SequenceFile with arbitrary key and value Writable class from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
| Set the directory under which RDDs are going to be checkpointed. |
Set the behavior of job cancellation from jobs started in this thread. | |
Set a human readable description of the current job. | |
| Assigns a group ID to all the jobs started by this thread until the group ID is set to a different value or cleared. |
| Set a local property that affects jobs submitted from this thread, such as the Spark fair scheduler pool. |
| Control our logLevel. |
| Set a Java system property, such asspark.executor.memory. |
Print the profile stats to stdout | |
Get SPARK_USER for user who is running SparkContext. | |
Return the epoch time when the | |
Return | |
Shut down the | |
| Read a text file from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI, and return it as an RDD of Strings. |
Return the URL of the SparkUI instance started by this | |
| Build the union of a list of RDDs. |
The version of Spark on which this application is running. | |
| Read a directory of text files from HDFS, a local file system (available on all nodes), or any Hadoop-supported file system URI. |
RDD APIs#
| Aggregate the elements of each partition, and then the results for all the partitions, using a given combine functions and a neutral "zero value." |
| Aggregate the values of each key, using given combine functions and a neutral "zero value". |
Marks the current stage as a barrier stage, where Spark must launch all tasks together. | |
Persist this RDD with the default storage level (MEMORY_ONLY). | |
| Return the Cartesian product of this RDD and another one, that is, the RDD of all pairs of elements |
Mark this RDD for checkpointing. | |
| Removes an RDD's shuffles and it's non-persisted ancestors. |
| Return a new RDD that is reduced intonumPartitions partitions. |
| For each key k inself orother, return a resulting RDD that contains a tuple with the list of values for that key inself as well asother. |
Return a list that contains all the elements in this RDD. | |
Return the key-value pairs in this RDD to the master as a dictionary. | |
| When collect rdd, use this method to specify job group. |
| Generic function to combine the elements for each key using a custom set of aggregation functions. |
The | |
Return the number of elements in this RDD. | |
| Approximate version of count() that returns a potentially incomplete result within a timeout, even if not all tasks have finished. |
| Return approximate number of distinct elements in the RDD. |
Count the number of elements for each key, and return the result to the master as a dictionary. | |
Return the count of each unique value in this RDD as a dictionary of (value, count) pairs. | |
| Return a new RDD containing the distinct elements in this RDD. |
| Return a new RDD containing only the elements that satisfy a predicate. |
Return the first element in this RDD. | |
| Return a new RDD by first applying a function to all elements of this RDD, and then flattening the results. |
Pass each value in the key-value pair RDD through a flatMap function without changing the keys; this also retains the original RDD's partitioning. | |
| Aggregate the elements of each partition, and then the results for all the partitions, using a given associative function and a neutral "zero value." |
| Merge the values for each key using an associative function "func" and a neutral "zeroValue" which may be added to the result an arbitrary number of times, and must not change the result (e.g., 0 for addition, or 1 for multiplication.). |
| Applies a function to all elements of this RDD. |
Applies a function to each partition of this RDD. | |
| Perform a right outer join ofself andother. |
Gets the name of the file to which this RDD was checkpointed | |
Returns the number of partitions in RDD | |
Get the | |
Get the RDD's current storage level. | |
| Return an RDD created by coalescing all elements within each partition into a list. |
| Return an RDD of grouped items. |
| Group the values for each key in the RDD into a single sequence. |
| Alias for cogroup but with support for multiple RDDs. |
| Compute a histogram using the provided buckets. |
| A unique ID for this RDD (within its SparkContext). |
| Return the intersection of this RDD and another one. |
Return whether this RDD is checkpointed and materialized, either reliably or locally. | |
Returns true if and only if the RDD contains no elements at all. | |
Return whether this RDD is marked for local checkpointing. | |
| Return an RDD containing all pairs of elements with matching keys inself andother. |
| Creates tuples of the elements in this RDD by applyingf. |
| Return an RDD with the keys of each tuple. |
| Perform a left outer join ofself andother. |
Mark this RDD for local checkpointing using Spark's existing caching layer. | |
| Return the list of values in the RDD for keykey. |
| Return a new RDD by applying a function to each element of this RDD. |
| Return a new RDD by applying a function to each partition of this RDD. |
| Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. |
| Return a new RDD by applying a function to each partition of this RDD, while tracking the index of the original partition. |
Pass each value in the key-value pair RDD through a map function without changing the keys; this also retains the original RDD's partitioning. | |
| Find the maximum item in this RDD. |
| Compute the mean of this RDD's elements. |
| Approximate operation to return the mean within a timeout or meet the confidence. |
| Find the minimum item in this RDD. |
| Return the name of this RDD. |
| Return a copy of the RDD partitioned using the specified partitioner. |
| Set this RDD's storage level to persist its values across operations after the first time it is computed. |
| Return an RDD created by piping elements to a forked external process. |
| Randomly splits this RDD with the provided weights. |
| Reduces the elements of this RDD using the specified commutative and associative binary operator. |
| Merge the values for each key using an associative and commutative reduce function. |
| Merge the values for each key using an associative and commutative reduce function, but return the results immediately to the master as a dictionary. |
| Return a new RDD that has exactly numPartitions partitions. |
Repartition the RDD according to the given partitioner and, within each resulting partition, sort records by their keys. | |
| Perform a right outer join ofself andother. |
| Return a sampled subset of this RDD. |
| Return a subset of this RDD sampled by key (via stratified sampling). |
Compute the sample standard deviation of this RDD's elements (which corrects for bias in estimating the standard deviation by dividing by N-1 instead of N). | |
Compute the sample variance of this RDD's elements (which corrects for bias in estimating the variance by dividing by N-1 instead of N). | |
| Output a Python RDD of key-value pairs (of form |
| Output a Python RDD of key-value pairs (of form |
| Output a Python RDD of key-value pairs (of form |
| Output a Python RDD of key-value pairs (of form |
| Save this RDD as a SequenceFile of serialized objects. |
| Output a Python RDD of key-value pairs (of form |
| Save this RDD as a text file, using string representations of elements. |
| Assign a name to this RDD. |
| Sorts this RDD by the given keyfunc |
| Sorts this RDD, which is assumed to consist of (key, value) pairs. |
Return a | |
Compute the standard deviation of this RDD's elements. | |
| Return each value inself that is not contained inother. |
| Return each (key, value) pair inself that has no pair with matching key inother. |
| Add up the elements in this RDD. |
| Approximate operation to return the sum within a timeout or meet the confidence. |
| Take the first num elements of the RDD. |
| Get the N elements from an RDD ordered in ascending order or as specified by the optional key function. |
| Return a fixed-size sampled subset of this RDD. |
A description of this RDD and its recursive dependencies for debugging. | |
| Return an iterator that contains all of the elements in this RDD. |
| Get the top N elements from an RDD. |
| Aggregates the elements of this RDD in a multi-level tree pattern. |
| Reduces the elements of this RDD in a multi-level tree pattern. |
| Return the union of this RDD and another one. |
| Mark the RDD as non-persistent, and remove all blocks for it from memory and disk. |
Return an RDD with the values of each tuple. | |
Compute the variance of this RDD's elements. | |
| Specify a |
| Zips this RDD with another one, returning key-value pairs with the first element in each RDD second element in each RDD, etc. |
Zips this RDD with its element indices. | |
Zips this RDD with generated unique Long ids. |
Broadcast and Accumulator#
| Destroy all data and metadata related to this broadcast variable. |
| Write a pickled representation of value to the open file or socket. |
| Read a pickled representation of value from the open file or socket. |
| Read the pickled representation of an object from the open file and return the reconstituted object hierarchy specified therein. |
| Delete cached copies of this broadcast on the executors. |
Return the broadcasted value | |
| Adds a term to this accumulator's value |
Get the accumulator's value; only usable in driver program | |
| Add two values of the accumulator's data type, returning a new value; for efficiency, can also updatevalue1 in place and return it. |
| Provide a "zero value" for the type, compatible in dimensions with the providedvalue (e.g., a zero vector) |
Management#
Return thread target wrapper which is recommended to be used in PySpark when the pinned thread mode is enabled. | |
| Does this configuration contain a given key? |
| Get the configured value for some key, or return a default otherwise. |
Get all values as a list of key-value pairs. | |
| Set a configuration property. |
| Set multiple parameters, passed as a list of key-value pairs. |
| Set application name. |
| Set an environment variable to be passed to executors. |
| Set a configuration property, if not already set. |
| Set master URL to connect to. |
| Set path where Spark is installed on worker nodes. |
Returns a printable version of the configuration, as a list of key=value pairs, one per line. | |
| Get the absolute path of a file added through |
Get the root directory that contains files added through | |
How many times this task has been attempted. | |
CPUs allocated to the task. | |
Return the currently active | |
Get a local property set upstream in the driver, or None if it is missing. | |
The ID of the RDD partition that is computed by this task. | |
Resources allocated to the task. | |
The ID of the stage that this task belong to. | |
An ID that is unique to this task attempt (within the same | |
| Returns a new RDD by applying a function to each partition of the wrapped RDD, where tasks are launched together in a barrier stage. |
| Returns a new RDD by applying a function to each partition of the wrapped RDD, while tracking the index of the original partition. |
| This function blocks until all tasks in the same stage have reached this routine. |
How many times this task has been attempted. | |
Sets a global barrier and waits until all tasks in this stage hit this barrier. | |
CPUs allocated to the task. | |
Return the currently active | |
Get a local property set upstream in the driver, or None if it is missing. | |
Returns | |
The ID of the RDD partition that is computed by this task. | |
Resources allocated to the task. | |
The ID of the stage that this task belong to. | |
An ID that is unique to this task attempt (within the same | |
| Given a Spark version string, return the (major version number, minor version number). |