pyspark.RDD.aggregate#
- RDD.aggregate(zeroValue,seqOp,combOp)[source]#
Aggregate the elements of each partition, and then the results for allthe partitions, using a given combine functions and a neutral “zerovalue.”
The functions
op(t1,t2)is allowed to modifyt1and return itas its result value to avoid object allocation; however, it should notmodifyt2.The first function (seqOp) can return a different result type, U, thanthe type of this RDD. Thus, we need one operation for merging a T intoan U and one operation for merging two U
New in version 1.1.0.
- Parameters
- zeroValueU
the initial value for the accumulated result of each partition
- seqOpfunction
a function used to accumulate results within a partition
- combOpfunction
an associative function used to combine results from different partitions
- Returns
- U
the aggregated result
See also
Examples
>>>seqOp=(lambdax,y:(x[0]+y,x[1]+1))>>>combOp=(lambdax,y:(x[0]+y[0],x[1]+y[1]))>>>sc.parallelize([1,2,3,4]).aggregate((0,0),seqOp,combOp)(10, 4)>>>sc.parallelize([]).aggregate((0,0),seqOp,combOp)(0, 0)