NotificationsYou must be signed in to change notification settings
Fork30
Star230

Provide groupByKey shortcuts for groupBy.as#213

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Jump to bottom

Merged

EnricoMi merged 3 commits intomasterfromgroupbykey

Dec 9, 2023

Merged

Provide groupByKey shortcuts for groupBy.as#213

EnricoMi merged 3 commits intomasterfromgroupbykey

Dec 9, 2023

Conversation

Copy link

Contributor

EnricoMi commentedDec 8, 2023

This provides shortcuts forgroupBy(...).as[...] that make it easier to use column-basedgroupByKey.

CallingDataset.groupBy(...).as[K, T] should be preferred over callingDataset.groupByKey(...) whenever possible. The former allows Catalyst to exploit existing partitioning and ordering of the Dataset, while the latter hides from Catalyst which columns are used to create the keys.

When the dataset is already partitioned and ordered by the grouping columns,Dataset.groupByKey(...) will repartition and order the entire dataset again.

Example:

Callingds.groupByKey(_.id) hides from Catalyst that columnid is the grouping key, whileds.groupBy($"id").as[Int, V] tells Catalyst thatds is to be grouped by (partitioned and ordered by) columnid.

The new column-basedgroupByKey methods make it easier for users to find a way to express the grouping by expressions. Looking at theDataset API, the user findsgroupByKey withColumn. The existinggroupBy method returns aRelationalGroupedDataset, which provides theas[K, V] method, which allows for the same semantics, but is difficult to find.

The new column-basedgroupByKey methods further do not require the user to specify the typeV of the originalDataset[V], asgroupByKey has access to the type / encoder:

ds.groupBy($"id").as[Int, V]

vs.

ds.groupByKey[Int]($"id")

Copy link

github-actionsbot commentedDec 8, 2023•
edited
Loading

Test Results

    566 files ±  0     566 suites ±0 1h 29m 12s⏱️ +11s
    536 tests +  2     536✔️ +  2 0💤 ±0 0❌ ±0
16 828 runs +72 16 826✔️ +72 2💤 ±0 0❌ ±0

Results for commit411afe8. ± Comparison against base commit8314de4.

This pull requestremoves 28 andadds 30 tests.Note that renamed tests count towards both.

uk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroupsuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups reverseuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition numuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverseuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with stateuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple keyuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and stateuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroupsuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups reverseuk.co.gresearch.spark.GroupBySortedSuite ‑ df.groupBySorted should flatMapSortedGroups with partition num…

uk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroupsuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups reverseuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition numuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with partition num and reverseuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with stateuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple keyuk.co.gresearch.spark.GroupBySuite ‑ df.groupByKeySorted should flatMapSortedGroups with tuple key and stateuk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroupsuk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups reverseuk.co.gresearch.spark.GroupBySuite ‑ df.groupBySorted should flatMapSortedGroups with partition num…

♻️ This comment has been updated with latest results.

Provide groupByKey shortcuts for groupBy.as

1668470

EnricoMi force-pushed thegroupbykey branch from41c5c69 to1668470Compare

December 8, 2023 18:44

EnricoMi added2 commits

December 9, 2023 17:16

Add to README.md

d9df9fc

Improve wording in README.md

411afe8

EnricoMi merged commit119c854 intomaster

Dec 9, 2023

EnricoMi deleted the groupbykey branch

December 9, 2023 21:51

Labels

None yet

Movatterモバイル変換

Navigation Menu

Search code, repositories, users, issues, pull requests...

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Provide groupByKey shortcuts for groupBy.as#213

Provide groupByKey shortcuts for groupBy.as#213

Uh oh!

Conversation

EnricoMi commentedDec 8, 2023

Uh oh!

github-actionsbot commentedDec 8, 2023•
edited
Loading

Uh oh!

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Movatterモバイル変換

Provide groupByKey shortcuts for groupBy.as#213

Provide groupByKey shortcuts for groupBy.as#213

Uh oh!

Conversation

EnricoMi commentedDec 8, 2023

Uh oh!

github-actionsbot commentedDec 8, 2023• editedLoading Uh oh!There was an error while loading.Please reload this page.

Uh oh!

Test Results

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

github-actionsbot commentedDec 8, 2023•
edited
Loading