Movatterモバイル変換


[0]ホーム

URL:


Skip to content

Navigation Menu

Sign in
Appearance settings

Search code, repositories, users, issues, pull requests...

Provide feedback

We read every piece of feedback, and take your input very seriously.

Saved searches

Use saved searches to filter your results more quickly

Sign up
Appearance settings

Optimize plan phase for foreground gcs#45208

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.

Already on GitHub?Sign in to your account

Conversation

@PeterSolMS
Copy link
Contributor

Background:

We have an optimization for gen 0 and gen 1 collections in plan_phase where we have a list of the marked objects (the "mark list") so we can visit only the surviving objects in plan_phase rather than all objects.

When we execute a gen 0 or gen 1 collection while a background collection is in progress (we call these "foreground collections") , we don't use the mark list because we still have to turn off the background mark bits for the objects that don't survive.

The Optimization:

However, as the background mark bits are not stored in the objects themselves, but in a side table, we can turn off the background mark bits in bulk for the dead objects between surviving objects. That insight enables us to still use the mark list, and save a significant amount of execution time in foreground collections.

Profile Data

Here's some profile data for our GC benchmark program GCPerfSim.dll executed with parameters "-tc 250 -tagb 5000 -tlgb 2 -lohar 0 -sohsi 50 -sohSizeRange 96-256 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -pohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time" on a 128 core AMD machine (256 virtual processors). This set of parameters causes many more foreground GCs to happen than is typical.

The below shows the original source code with the CPU samples listed in the left column. Note in particular the high counts for the code sections handling the situation where the mark list is not being used:

plan_phase_source_profile_baseline.txt

For comparison, here's the changed source code - note that the previously expensive sections not using the mark list have become much cheaper, and the new section using bgc_clear_batch_mark_array_bits to turn off background mark bits in bulk is much cheaper than the section it replaces:

plan_phase_source_profile_optimized.txt

The here are charts for the exclusive CPU samples in plan_phase. Both for the baseline and the optimization, 3 profile runs were done:

plan_phase exclusive count

Conclusions are:

  • Regular gen 0 and gen 1 GC cost in plan_phase is about the same with and without the optimization
  • Foreground gen 0 and gen 1 GC cost in plan_phase is much more consistent and typically much lower with the optimization.

gfoidl and mangod9 reacted with thumbs up emoji
The key point is to clear the background GC mark bits for objects that the foreground GC found to be dead.The existing code walked all the dead objects individually and cleared their mark bits, but as it turns out, it is significantly cheaper to turn off the mark bits in bulk.
@ghost
Copy link

Tagging subscribers to this area: @dotnet/gc
See info in area-owners.md if you want to be subscribed.

Issue Details

Background:

We have an optimization for gen 0 and gen 1 collections in plan_phase where we have a list of the marked objects (the "mark list") so we can visit only the surviving objects in plan_phase rather than all objects.

When we execute a gen 0 or gen 1 collection while a background collection is in progress (we call these "foreground collections") , we don't use the mark list because we still have to turn off the background mark bits for the objects that don't survive.

The Optimization:

However, as the background mark bits are not stored in the objects themselves, but in a side table, we can turn off the background mark bits in bulk for the dead objects between surviving objects. That insight enables us to still use the mark list, and save a significant amount of execution time in foreground collections.

Profile Data

Here's some profile data for our GC benchmark program GCPerfSim.dll executed with parameters "-tc 250 -tagb 5000 -tlgb 2 -lohar 0 -sohsi 50 -sohSizeRange 96-256 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -pohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time" on a 128 core AMD machine (256 virtual processors). This set of parameters causes many more foreground GCs to happen than is typical.

The below shows the original source code with the CPU samples listed in the left column. Note in particular the high counts for the code sections handling the situation where the mark list is not being used:

plan_phase_source_profile_baseline.txt

For comparison, here's the changed source code - note that the previously expensive sections not using the mark list have become much cheaper, and the new section using bgc_clear_batch_mark_array_bits to turn off background mark bits in bulk is much cheaper than the section it replaces:

plan_phase_source_profile_optimized.txt

The here are charts for the exclusive CPU samples in plan_phase. Both for the baseline and the optimization, 3 profile runs were done:

plan_phase exclusive count

Conclusions are:

  • Regular gen 0 and gen 1 GC cost in plan_phase is about the same with and without the optimization
  • Foreground gen 0 and gen 1 GC cost in plan_phase is much more consistent and typically much lower with the optimization.
Author:PeterSolMS
Assignees:-
Labels:

area-GC-coreclr

Milestone:-

@Maoni0
Copy link
Member

Maoni0 commentedNov 25, 2020
edited
Loading

this is a great change, and the perf data/charts are greatly appreciated, thanks@PeterSolMS!

I'm also wondering about the cost of callingbackground_object_marked on each object vsbgc_clear_batch_mark_array_bits in the case when we don't use the mark list. I could imagine that clearing it in batch is faster than clearing individual objects. in the baseline txt we see thatbackground_object_marked really doesn't take much time

151,0 |                        background_object_marked (xl, TRUE); 72,5K|                        xl = xl + Align (size (xl));

but I wonder if these numbers really match up with the lines. anyway, just a thought :)

Copy link
Member

@Maoni0Maoni0 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others.Learn more.

LGTM (and I think it's better to take@mangod9's suggestion)

@PeterSolMS
Copy link
ContributorAuthor

@Maoni0 regarding the cost of calling background_object_marked vs. bgc_clear_batch_mark_array_bits, I think we can get a clearer picture of the cost by looking at the assembly level profile - I produced this by hand from data reported by PerfView in its log file after I obtained the source level profile, and from a disassembly produced with dumpbin /disasm coreclr.dll.

Here's the one for the baseline:

plan_phase_asm_profile_baseline.txt

The cost of calls is typically reported on the following instruction (the return address), so in our case that would be 13,631 CPU samples. Interestingly enough this isn't really the most expensive part of the loop - rather, fetching the next object's method table totally dominates the loop's execution time.

The assembly level profile for the optimized case looks like this:

plan_phase_asm_profile_optimized.txt

So the conclusion is that the call to bgc_clear_batch_mark_array_bits is 3,828 CPU samples, about 3.5 times lower than calling background_object_marked on each object. But the big savings are in not having to fetch the method table of the dead objects.

@PeterSolMSPeterSolMS merged commit88680a4 intodotnet:masterNov 26, 2020
@ghostghost locked asresolvedand limited conversation to collaboratorsDec 26, 2020
Sign up for freeto subscribe to this conversation on GitHub. Already have an account?Sign in.

Reviewers

@mangod9mangod9mangod9 left review comments

@Maoni0Maoni0Maoni0 approved these changes

Assignees

No one assigned

Projects

None yet

Milestone

No milestone

Development

Successfully merging this pull request may close these issues.

4 participants

@PeterSolMS@Maoni0@mangod9@Dotnet-GitSync-Bot

[8]ページ先頭

©2009-2025 Movatter.jp