- Notifications
You must be signed in to change notification settings - Fork5.2k
Optimize plan phase for foreground gcs#45208
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to ourterms of service andprivacy statement. We’ll occasionally send you account related emails.
Already on GitHub?Sign in to your account
Optimize plan phase for foreground gcs#45208
Uh oh!
There was an error while loading.Please reload this page.
Conversation
The key point is to clear the background GC mark bits for objects that the foreground GC found to be dead.The existing code walked all the dead objects individually and cleared their mark bits, but as it turns out, it is significantly cheaper to turn off the mark bits in bulk.
ghost commentedNov 25, 2020
Tagging subscribers to this area: @dotnet/gc Issue DetailsBackground:We have an optimization for gen 0 and gen 1 collections in plan_phase where we have a list of the marked objects (the "mark list") so we can visit only the surviving objects in plan_phase rather than all objects. When we execute a gen 0 or gen 1 collection while a background collection is in progress (we call these "foreground collections") , we don't use the mark list because we still have to turn off the background mark bits for the objects that don't survive. The Optimization:However, as the background mark bits are not stored in the objects themselves, but in a side table, we can turn off the background mark bits in bulk for the dead objects between surviving objects. That insight enables us to still use the mark list, and save a significant amount of execution time in foreground collections. Profile DataHere's some profile data for our GC benchmark program GCPerfSim.dll executed with parameters "-tc 250 -tagb 5000 -tlgb 2 -lohar 0 -sohsi 50 -sohSizeRange 96-256 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -pohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time" on a 128 core AMD machine (256 virtual processors). This set of parameters causes many more foreground GCs to happen than is typical. The below shows the original source code with the CPU samples listed in the left column. Note in particular the high counts for the code sections handling the situation where the mark list is not being used: plan_phase_source_profile_baseline.txt For comparison, here's the changed source code - note that the previously expensive sections not using the mark list have become much cheaper, and the new section using bgc_clear_batch_mark_array_bits to turn off background mark bits in bulk is much cheaper than the section it replaces: plan_phase_source_profile_optimized.txt The here are charts for the exclusive CPU samples in plan_phase. Both for the baseline and the optimization, 3 profile runs were done: Conclusions are:
|
Uh oh!
There was an error while loading.Please reload this page.
Maoni0 commentedNov 25, 2020 • edited
Loading Uh oh!
There was an error while loading.Please reload this page.
edited
Uh oh!
There was an error while loading.Please reload this page.
this is a great change, and the perf data/charts are greatly appreciated, thanks@PeterSolMS! I'm also wondering about the cost of calling but I wonder if these numbers really match up with the lines. anyway, just a thought :) |
Maoni0 left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others.Learn more.
LGTM (and I think it's better to take@mangod9's suggestion)
PeterSolMS commentedNov 26, 2020
@Maoni0 regarding the cost of calling background_object_marked vs. bgc_clear_batch_mark_array_bits, I think we can get a clearer picture of the cost by looking at the assembly level profile - I produced this by hand from data reported by PerfView in its log file after I obtained the source level profile, and from a disassembly produced with dumpbin /disasm coreclr.dll. Here's the one for the baseline: plan_phase_asm_profile_baseline.txt The cost of calls is typically reported on the following instruction (the return address), so in our case that would be 13,631 CPU samples. Interestingly enough this isn't really the most expensive part of the loop - rather, fetching the next object's method table totally dominates the loop's execution time. The assembly level profile for the optimized case looks like this: plan_phase_asm_profile_optimized.txt So the conclusion is that the call to bgc_clear_batch_mark_array_bits is 3,828 CPU samples, about 3.5 times lower than calling background_object_marked on each object. But the big savings are in not having to fetch the method table of the dead objects. |

Background:
We have an optimization for gen 0 and gen 1 collections in plan_phase where we have a list of the marked objects (the "mark list") so we can visit only the surviving objects in plan_phase rather than all objects.
When we execute a gen 0 or gen 1 collection while a background collection is in progress (we call these "foreground collections") , we don't use the mark list because we still have to turn off the background mark bits for the objects that don't survive.
The Optimization:
However, as the background mark bits are not stored in the objects themselves, but in a side table, we can turn off the background mark bits in bulk for the dead objects between surviving objects. That insight enables us to still use the mark list, and save a significant amount of execution time in foreground collections.
Profile Data
Here's some profile data for our GC benchmark program GCPerfSim.dll executed with parameters "-tc 250 -tagb 5000 -tlgb 2 -lohar 0 -sohsi 50 -sohSizeRange 96-256 -lohsi 0 -pohsi 0 -sohpi 0 -lohpi 0 -pohpi 0 -sohfi 0 -lohfi 0 -pohfi 0 -allocType reference -testKind time" on a 128 core AMD machine (256 virtual processors). This set of parameters causes many more foreground GCs to happen than is typical.
The below shows the original source code with the CPU samples listed in the left column. Note in particular the high counts for the code sections handling the situation where the mark list is not being used:
plan_phase_source_profile_baseline.txt
For comparison, here's the changed source code - note that the previously expensive sections not using the mark list have become much cheaper, and the new section using bgc_clear_batch_mark_array_bits to turn off background mark bits in bulk is much cheaper than the section it replaces:
plan_phase_source_profile_optimized.txt
The here are charts for the exclusive CPU samples in plan_phase. Both for the baseline and the optimization, 3 profile runs were done:
Conclusions are: