Posted onMar 16, 2021

Duplicate Digest Email Incident Retro From January

Sorry all that this incident retro is a couple of months late. We did the retro internally immediately after the incident occurred and then I put it on my TODO list to make it public on DEV and that's where it stayed for far too long. 😬

Background Context

Over the past few months, we have been working hard to make the Forem setup more configurable for our creators. In an effort to do this we recently updated how we create a Forem's community name. This was done by merging the following PR that appended thecollective_noun, if it was not disabled for the particular Forem, to theSiteConfig.community_name through a DataUpdateScript:

Adds collective_noun Fields Back to SiteConfig Model#12055

juliannatetreault posted onDec 28, 2020

What type of PR is this? (check all applicable)

[ ] Refactor
[ ] Feature
[x] Bug Fix
[ ] Optimization
[ ] Documentation Update

Description

This PR adds thecollective_noun andcollective_noun_disabled columns back to theSiteConfig model after running into adata_update script failure in another PR. This is necessary so that when running thedata_update script responsible for updating thecommunity_name, we have access to thecollective_noun andcollective_noun_disabled methods. Additionally, this PR adds adata_update script that conditionally updatesSiteConfig.community_nameifSiteConfig.collective_noun_disabled returnsfalse. If, however, the forem hascollective_noun_disabled set totrue, then thecommunity_name will remain as-is and untouched.

Related Tickets & Documents

Relates to PRs #11846 , PR #11973 , and PR #12054

QA Instructions, Screenshots, Recordings

To QA these changes, first ensure that all checks pass and the Travis build is green, then:

Run thedata_update script manually and ensure that your Community's name updates accordingly

UI accessibility concerns?

N/A

Added tests?

[x] Yes
[ ] No, and this is why:please replace this line with details on why testshave not been included
[ ] I need help with writing tests

Added to documentation?

[ ]Developer Docs and/orAdmin Guide
[ ] README
[x] No documentation needed

[optional] Are there any post-deployment tasks we need to perform?

Thedata_update script included in this PR needs to successfully run!

[optional] What gif best describes this PR or how it makes you feel?

View on GitHub

Data Update Script:

moduleDataUpdateScriptsclassAppendCollectiveNounToCommunityNamedefrunreturnifSiteConfig.collective_noun_disabled||SiteConfig.collective_noun.blank?SiteConfig.community_name="#{SiteConfig.community_name}#{SiteConfig.collective_noun}"endendend

The PR nicely addresses and updates all places in the code where we were usingcommunity_name and therefore went out without any problems.

However, 4 days later on Wednesday, December, 30th, we began to get reports of folks receiving duplicate digest emails.

I quickly jumped in and determined the problem to be anEmails::EnqueueDigestWorker that was continuously being restarted by deploys and sending the duplicate digest emails.

Bug Trigger*

NOTE: I did not label this section "Cause" because I believe the real, root cause of this problem came much earlier in the year. Scroll down for more details.

TheEmails::EnqueueDigestWorker is a worker that we have "turned off" for DEV because it takes too long to complete given the high number of users we have. We know it will keep restarting on deploys so we choose to use a rake task to send digest emails. To keep the worker from running we have a guard clause at the beginning of the worker that tells it to exit early if the environment is DEV. The guard clause at the time of the incident looked like this:

returnifSiteConfig.community_name=="DEV"

The problem was, when the above PR was deployed we changedcommunity_name from "DEV" to "DEV Community" causing this guard clause to no longer work which allowed the worker to start running.

Mitigation

The first thing I did to mitigate the number of those affected by this problem was to clear all of the existing digest worker jobs out of Sidekiq. Once that was done, I pushed out a quick fix to update the hardcoded string to the current community name by adding "Community" to it.

Bug Fix:Update dev.to check to use DEV Community#12082

mstruve posted onDec 30, 2020

Recently we changed our site name on dev.to from DEV to DEV Community. This is used in our code in a couple of places and needs to be updated. From a prod console

irb(main):002:0> SiteConfig.community_name == "DEV Community"=> true

Without the guard clauses working properly we enqueued our Email Digest worker which takes hours to complete for DEV bc of all the users. Since we restart Sidekiq on every deployment, this worker was continuously restarting which is why the duplicate emails were getting sent.

View on GitHub

While this fixed the immediate issue, the hardcoded string is what got us into this jam in the first place. To prevent future problems I quickly followed up with a second, more resilient PR, that used a more robust method to check for the DEV environment.

Refactor:Use dev_to? to Check for DEV#12083

mstruve posted onDec 30, 2020

What type of PR is this? (check all applicable)

[x] Refactor

Description

Rather than checking the community name which we just found out can change easily, lets check the domain like we do in other places since that is far less likely to change.

Related Tickets & Documents

https://github.com/forem/forem/pull/12082

View on GitHub

Root Cause and Learnings

Even though thethe community name PR was the trigger for this issue, it was far from the root cause. This issue was really caused by the poor decision by myself to hard code this string in the first place.

Bug Fix:Run Individual Digest Workers inline for DEV#10070

mstruve posted onAug 29, 2020

What type of PR is this? (check all applicable)

[x] Bug Fix

Description

Followup tohttps://github.com/forem/forem/pull/10065, we need to also ensure the individual workers run inline and are not piling up in Sidekiq since running them in parallel is what is stressing the database.

Added tests?

[x] yes

View on GitHub

Hardcoded strings, especially those for variables like this that can change, should be avoided at all costs when writing code. At the very least you should be writing a method or using a CONSTANT to represent the string you want to check. This allows you to easily Find/Replace that method or CONSTANT in the future so it doesn't get lost in the shuffle as it did here.

The second problem was that I deployed that fix without a PR review. I announced the problem in Slack and linked the PR, but based on the Slack message our database was in a dire state which is why I chose to expedite the fix. In the future, trying to get at least 1 PR review should always be the goal even if it means waiting an extra minute or two. Even if you don't immediately take the advice of the PR review, at least you can go back and update things later.

The first fix that I pushed out in order to quickly update the digest worker worked but was un-ideal. This was pointed on in the PR review. I was still able to quickly push the hotfix but then immediately went back and wrote more thoughtful and robust code after.That flow is how the initial bug should have been handled. Our recently renewed commitment to sticking more to our processes now is what led to us fixing this problem correctly the second time around, while the first time we left the door open for future problems.

Finally, the more we can move away from this "DEV is a special snowflake" concept the better. If every environment is treated the same then you don't have to worry about edge cases popping up unexpectedly in production for specific environments. The final goal is to remove that guard clause altogether so every Forem sends its email digests the same way. Once again, this eliminates another special case solution that lowers the risk of future surprises.

Handling Hotfixes

Always try to get a PR review even for hotfixes! Most of us are not running life-saving software. If your site remains down for an extra min or two because you did the responsible thing and got a code review, it's not the end of the world. Having a good code review will decrease your chances of shipping unreliable code that could cause future issues.

If a fix MUST go out immediately and you cannot get a review beforehand(maybe it's super late and no one is online), thenget a follow-up review. Ask someone to review the code after the fact to make sure it is sound and is not so hacky as to lead to future problems.

TL;DR

Don't hard code strings and get a PR review! And if all else fails, learn from your mistakes and strive not to make them again. Happy coding!

Top comments(0)

For further actions, you may consider blocking this person and/orreporting abuse

Movatterモバイル変換

DEV Community

Duplicate Digest Email Incident Retro From January

Background Context

Adds collective_noun Fields Back to SiteConfig Model#12055

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

QA Instructions, Screenshots, Recordings

UI accessibility concerns?

Added tests?

Added to documentation?

[optional] Are there any post-deployment tasks we need to perform?

[optional] What gif best describes this PR or how it makes you feel?

Bug Trigger*

Mitigation

Bug Fix:Update dev.to check to use DEV Community#12082

Refactor:Use dev_to? to Check for DEV#12083

What type of PR is this? (check all applicable)

Description

Related Tickets & Documents

Root Cause and Learnings

Bug Fix:Run Individual Digest Workers inline for DEV#10070

What type of PR is this? (check all applicable)

Description

Added tests?

Handling Hotfixes

TL;DR

Top comments(0)

More fromMolly Struve (she/her)