Movatterモバイル変換


[0]ホーム

URL:


Skip to content
DEV Community
Log in Create account

DEV Community

Molly Struve (she/her)
Molly Struve (she/her)

Posted on

     

Duplicate Digest Email Incident Retro From January

Sorry all that this incident retro is a couple of months late. We did the retro internally immediately after the incident occurred and then I put it on my TODO list to make it public on DEV and that's where it stayed for far too long. 😬

Background Context

Over the past few months, we have been working hard to make the Forem setup more configurable for our creators. In an effort to do this we recently updated how we create a Forem's community name. This was done by merging the following PR that appended thecollective_noun, if it was not disabled for the particular Forem, to theSiteConfig.community_name through a DataUpdateScript:

GitHub logo Adds collective_noun Fields Back to SiteConfig Model#12055

What type of PR is this? (check all applicable)

  • [ ] Refactor
  • [ ] Feature
  • [x] Bug Fix
  • [ ] Optimization
  • [ ] Documentation Update

Description

This PR adds thecollective_noun andcollective_noun_disabled columns back to theSiteConfig model after running into adata_update script failure in another PR. This is necessary so that when running thedata_update script responsible for updating thecommunity_name, we have access to thecollective_noun andcollective_noun_disabled methods. Additionally, this PR adds adata_update script that conditionally updatesSiteConfig.community_nameifSiteConfig.collective_noun_disabled returnsfalse. If, however, the forem hascollective_noun_disabled set totrue, then thecommunity_name will remain as-is and untouched.

Related Tickets & Documents

Relates to PRs #11846 , PR #11973 , and PR #12054

QA Instructions, Screenshots, Recordings

To QA these changes, first ensure that all checks pass and the Travis build is green, then:

  • Run thedata_update script manually and ensure that your Community's name updates accordingly

UI accessibility concerns?

N/A

Added tests?

  • [x] Yes
  • [ ] No, and this is why:please replace this line with details on why testshave not been included
  • [ ] I need help with writing tests

Added to documentation?

[optional] Are there any post-deployment tasks we need to perform?

Thedata_update script included in this PR needs to successfully run!

[optional] What gif best describes this PR or how it makes you feel?

The Office: Days of Nonsense

Data Update Script:

moduleDataUpdateScriptsclassAppendCollectiveNounToCommunityNamedefrunreturnifSiteConfig.collective_noun_disabled||SiteConfig.collective_noun.blank?SiteConfig.community_name="#{SiteConfig.community_name}#{SiteConfig.collective_noun}"endendend
Enter fullscreen modeExit fullscreen mode

The PR nicely addresses and updates all places in the code where we were usingcommunity_name and therefore went out without any problems.

However, 4 days later on Wednesday, December, 30th, we began to get reports of folks receiving duplicate digest emails.

I quickly jumped in and determined the problem to be anEmails::EnqueueDigestWorker that was continuously being restarted by deploys and sending the duplicate digest emails.

Bug Trigger*

  • NOTE: I did not label this section "Cause" because I believe the real, root cause of this problem came much earlier in the year. Scroll down for more details.

TheEmails::EnqueueDigestWorker is a worker that we have "turned off" for DEV because it takes too long to complete given the high number of users we have. We know it will keep restarting on deploys so we choose to use a rake task to send digest emails. To keep the worker from running we have a guard clause at the beginning of the worker that tells it to exit early if the environment is DEV. The guard clause at the time of the incident looked like this:

returnifSiteConfig.community_name=="DEV"
Enter fullscreen modeExit fullscreen mode

The problem was, when the above PR was deployed we changedcommunity_name from "DEV" to "DEV Community" causing this guard clause to no longer work which allowed the worker to start running.

Mitigation

The first thing I did to mitigate the number of those affected by this problem was to clear all of the existing digest worker jobs out of Sidekiq. Once that was done, I pushed out a quick fix to update the hardcoded string to the current community name by adding "Community" to it.

GitHub logo Bug Fix:Update dev.to check to use DEV Community#12082

Recently we changed our site name on dev.to from DEV to DEV Community. This is used in our code in a couple of places and needs to be updated. From a prod console

irb(main):002:0> SiteConfig.community_name == "DEV Community"=> true

Without the guard clauses working properly we enqueued our Email Digest worker which takes hours to complete for DEV bc of all the users. Since we restart Sidekiq on every deployment, this worker was continuously restarting which is why the duplicate emails were getting sent.

While this fixed the immediate issue, the hardcoded string is what got us into this jam in the first place. To prevent future problems I quickly followed up with a second, more resilient PR, that used a more robust method to check for the DEV environment.

GitHub logo Refactor:Use dev_to? to Check for DEV#12083

What type of PR is this? (check all applicable)

  • [x] Refactor

Description

Rather than checking the community name which we just found out can change easily, lets check the domain like we do in other places since that is far less likely to change.

Related Tickets & Documents

https://github.com/forem/forem/pull/12082

alt_text

Root Cause and Learnings

Even though thethe community name PR was the trigger for this issue, it was far from the root cause. This issue was really caused by the poor decision by myself to hard code this string in the first place.

GitHub logo Bug Fix:Run Individual Digest Workers inline for DEV#10070

What type of PR is this? (check all applicable)

  • [x] Bug Fix

Description

Followup tohttps://github.com/forem/forem/pull/10065, we need to also ensure the individual workers run inline and are not piling up in Sidekiq since running them in parallel is what is stressing the database.

Added tests?

  • [x] yes

alt_text

Hardcoded strings, especially those for variables like this that can change, should be avoided at all costs when writing code. At the very least you should be writing a method or using a CONSTANT to represent the string you want to check. This allows you to easily Find/Replace that method or CONSTANT in the future so it doesn't get lost in the shuffle as it did here.

The second problem was that I deployed that fix without a PR review. I announced the problem in Slack and linked the PR, but based on the Slack message our database was in a dire state which is why I chose to expedite the fix. In the future, trying to get at least 1 PR review should always be the goal even if it means waiting an extra minute or two. Even if you don't immediately take the advice of the PR review, at least you can go back and update things later.

The first fix that I pushed out in order to quickly update the digest worker worked but was un-ideal. This was pointed on in the PR review. I was still able to quickly push the hotfix but then immediately went back and wrote more thoughtful and robust code after.That flow is how the initial bug should have been handled. Our recently renewed commitment to sticking more to our processes now is what led to us fixing this problem correctly the second time around, while the first time we left the door open for future problems.

Finally, the more we can move away from this "DEV is a special snowflake" concept the better. If every environment is treated the same then you don't have to worry about edge cases popping up unexpectedly in production for specific environments. The final goal is to remove that guard clause altogether so every Forem sends its email digests the same way. Once again, this eliminates another special case solution that lowers the risk of future surprises.

Handling Hotfixes

Always try to get a PR review even for hotfixes! Most of us are not running life-saving software. If your site remains down for an extra min or two because you did the responsible thing and got a code review, it's not the end of the world. Having a good code review will decrease your chances of shipping unreliable code that could cause future issues.

If a fix MUST go out immediately and you cannot get a review beforehand(maybe it's super late and no one is online), thenget a follow-up review. Ask someone to review the code after the fact to make sure it is sound and is not so hacky as to lead to future problems.

TL;DR

Don't hard code strings and get a PR review! And if all else fails, learn from your mistakes and strive not to make them again. Happy coding!

Top comments(0)

Subscribe
pic
Create template

Templates let you quickly answer FAQs or store snippets for re-use.

Dismiss

Are you sure you want to hide this comment? It will become hidden in your post, but will still be visible via the comment'spermalink.

For further actions, you may consider blocking this person and/orreporting abuse

International Speaker 🗣Runner 🏃‍♀️ Always Ambitious. Never Satisfied. I ride 🦄's IRL
  • Location
    Ocala, FL
  • Education
    MIT '10 Aerospace Engineering
  • Pronouns
    she/her
  • Work
    Staff Site Reliability Engineer at Netflix
  • Joined

More fromMolly Struve (she/her)

DEV Community

We're a place where coders share, stay up-to-date and grow their careers.

Log in Create account

[8]ページ先頭

©2009-2025 Movatter.jp