pyspark.RDD.groupByKey#
- RDD.groupByKey(numPartitions=None,partitionFunc=<functionportable_hash>)[source]#
Group the values for each key in the RDD into a single sequence.Hash-partitions the resulting RDD with numPartitions partitions.
New in version 0.7.0.
- Parameters
- numPartitionsint, optional
the number of partitions in new
RDD- partitionFuncfunction, optional, defaultportable_hash
function to compute the partition index
- Returns
Notes
If you are grouping in order to perform an aggregation (such as asum or average) over each key, using reduceByKey or aggregateByKey willprovide much better performance.
Examples
>>>rdd=sc.parallelize([("a",1),("b",1),("a",1)])>>>sorted(rdd.groupByKey().mapValues(len).collect())[('a', 2), ('b', 1)]>>>sorted(rdd.groupByKey().mapValues(list).collect())[('a', [1, 1]), ('b', [1])]