Movatterモバイル変換


[0]ホーム

URL:


Skip to main content
Ctrl+K

pyspark.RDD.groupByKey#

RDD.groupByKey(numPartitions=None,partitionFunc=<functionportable_hash>)[source]#

Group the values for each key in the RDD into a single sequence.Hash-partitions the resulting RDD with numPartitions partitions.

New in version 0.7.0.

Parameters
numPartitionsint, optional

the number of partitions in newRDD

partitionFuncfunction, optional, defaultportable_hash

function to compute the partition index

Returns
RDD

aRDD containing the keys and the grouped result for each key

Notes

If you are grouping in order to perform an aggregation (such as asum or average) over each key, using reduceByKey or aggregateByKey willprovide much better performance.

Examples

>>>rdd=sc.parallelize([("a",1),("b",1),("a",1)])>>>sorted(rdd.groupByKey().mapValues(len).collect())[('a', 2), ('b', 1)]>>>sorted(rdd.groupByKey().mapValues(list).collect())[('a', [1, 1]), ('b', [1])]

[8]ページ先頭

©2009-2025 Movatter.jp