pyspark.RDD.groupByKey #

RDD.groupByKey(numPartitions=None,partitionFunc=<functionportable_hash>)[source]#

Group the values for each key in the RDD into a single sequence.Hash-partitions the resulting RDD with numPartitions partitions.

New in version 0.7.0.

Parameters

numPartitionsint, optional: the number of partitions in newRDD
partitionFuncfunction, optional, defaultportable_hash: function to compute the partition index

Returns

RDD: aRDD containing the keys and the grouped result for each key

See also

RDD.reduceByKey()
RDD.combineByKey()
RDD.aggregateByKey()
RDD.foldByKey()

Notes

If you are grouping in order to perform an aggregation (such as asum or average) over each key, using reduceByKey or aggregateByKey willprovide much better performance.

Examples

>>>rdd=sc.parallelize([("a",1),("b",1),("a",1)])>>>sorted(rdd.groupByKey().mapValues(len).collect())[('a', 2), ('b', 1)]>>>sorted(rdd.groupByKey().mapValues(list).collect())[('a', [1, 1]), ('b', [1])]

Show Source

Movatterモバイル変換

pyspark.RDD.groupByKey#

pyspark.RDD.groupByKey #