So, you want to deploy continuously? We decided to take the dive at Opendoor to reduce the human cost and error potential of manual deployments. We currently use Github, Circle, and Heroku to deploy our main Rails app, so implementing continuous deployment required a bit of thought to coordinate all of these services.
Continuous deployment had been a long-running desire of our engineers, but there were perennial tooling constraints that had made itjust difficult enough that it never happened. Serendipitously, the release ofCircle Workflows over the summer gave us the primitives within Circle to guarantee atomic lock-step releases. That, plus a recent switch tousing Heroku Review Apps, made the migration more tractable.
We wanted to share some code that runs our deployments and note some best practices for deploying the specific stack we’re using.
Before we get into the nitty-gritty, here’s how a commit eventually lands on production:
Our process biases towards redundancy and taking extra precautions to ensure production stability. Your team’s implementation might look different — for example, some deployment environments can deploy new code to a percentage of production instead of having a dedicated staging environment.
Workflows are how we kick-off tests and deployments on Circle. We run a variety of tests on every pull-request and on themaster branch upon merging; when all of those workflow jobs finish, we kick off a deployment job. The relevant part of our Circle configuration looks like this:
workflows:
version: 2
pr:
jobs:
- lint
- unit_test
- ...
- deploy:
requires:
- lint
- unit_test
- ...
filters:
branches:
only: masterHaving each of your “pre-deploy” tests separated into their own semantic workflow jobs is helpful for alerting engineers as soon as possible if a particular step fails on a pull-request, without waiting for the whole workflow to finish. For example, linting takes less than a minute, and having it run independently of the longer unit and integration tests means a lint failure notification is sent minutes earlier.
Thedeploy job doesn’t require much setup to deploy to Heroku — we only need to checkout the code and install any dependencies our deploy scripts need. That part of our Circle config looks like:
version: 2
jobs:
deploy:
docker:
- ...
parallelism: 1
steps:
- checkout
- ...
- run:
name: Deploy
command: bin/ci/circle-lock --branch master --job-name deploy bin/ci/deployIn the last step, we wrap our underlyingbin/ci/deploy script with acircle-lock script. The lock script enforces that only onedeploy job on themaster branch can be running at a time; if there are multiple, the ones occurring afterward will poll Circle until they are unblocked.
We adaptedthis script posted in the Circle forums tosupport one additional behavior — if you have three deploy builds, the “middle” builds will exit when they detect the latest build. We added this behavior to deal with stacking deploys (more on that below).
We do a number of tasks inside ourdeploy script. Most of our deploy communication happens using the Heroku API, which is authenticated through environment variables. The Circle docs havea script that we used verbatim to configure the build environment at the start of the deploy. We useJQ heavily to parse the Heroku API inside our bash scripts.
Once your team hits a certain size, they will be merging code faster than your system can deploy. You need to consider how you want “stacking” deploys to work in your continuous deployment setup.
In one type of setup, each deploy will have exactly one merge. There are benefits:
However, it means that deploys can back-up for hours, depending on how long each deploy takes.
The other type of setup “skips” stacked deploys, which is what we chose to get code out into the wild faster, at the cost of some deploys growing in size. We adapted thecircle-lock scriptinto our own version which automatically skips if later builds are detected. This implicitly relies on Circle build numbers being incremented as the branch moves forward, so we don’t rebuild old master branch deploys. If a deploy flakes and there’s another one coming, we wait for that.
To notify the team (see below for more on that), at deploy time we generate notes about the commits about to be deployed. We use the Heroku API to get the “current” release, figure out the diff using vanilla git, and then format them for consumption. The code looks something like this:
DEPLOY_SHA=$(git rev-parse --short HEAD)
CURRENT_SHA=$(heroku releases -a ${APP_NAME} | awk '/Deploy/ {print $3}' | head -n 1)
DEPLOY_NOTES="/tmp/deploy_notes.txt"
git --no-pager log --reverse\
--pretty=format:"{%\"author_name%\":%\"%an%\", %\"title%\": %\"%s%\", %\"title_link%\": %\"${GITHUB_URL}/commit/%h%\", %\"text%\": %\"%b%\"},"\
--no-merges ${DEPLOY_BRANCH}...${CURRENT_SHA} > ${DEPLOY_NOTES}If you haven’t already, enablePreboot on your Heroku app. Without preboot, your app probably has anywhere from moments to minutes of downtime during a deploy. That might have been fine if you were deploying at non-peak hours once a day, but in a continuously deploying world you can count on multiple deploys per hour.
The only gotcha with preboot is there is a time period where your new servers are starting up and your old servers are still serving code. This doesnot mean requests are getting routed to new and old code simultaneously; instead, you should be aware that any connections your new code establishes during boot-up will be occurring while the old code still maintains its connections.
We use Postgres at Opendoor, which means we have to be especially aware of the concurrent database connections doubling during deploys. Using something like PgBouncer can help with this, or any other pooling proxy for the data stores your code is connecting.
Much like self-driving cars, our self-driving deploys need to have safety checks at every step of the process and allow for human intervention.
After the Heroku step of the staging deployment finishes, we first check that the servers become available using the Heroku API. Something like this should work for many Heroku apps:
down_dynos = []
while (started + timeout) > Time.now
response = heroku_api_get("https://api.heroku.com/apps/#{app}/dynos")
dyno_states = JSON.parse(response.body).map do |e|
e.values_at('name', 'state')
end
down_dynos = dyno_states.select { |_, state| state != 'up' }
break if down_dynos.any? { |_, state| state == 'crashed' }
break if down_dynos.empty?
endNext, we check that certain critical endpoints like the homepage return 200 responses. We also have all of these checks in our testing suite at the unit and integration levels, but still perform one last triple-check before pushing to production.
Before deployment to production starts, we check if the latest “release” on Heroku is arollback. If we see that it’s a rollback, we take it as a sign that the code may not be in a releasable state (i.e. because it contains a regression). Our deploy script check looks like this:
if [[ "$CI" == "true" ]]; then
LAST_DEPLOY=$(heroku releases -a ${APP_NAME} --json | jq '.[0].description')
set +e
echo $LAST_DEPLOY | grep "Rollback to"
if [[ $? == 0 ]]; then
echo "The current release is a rollback; can't deploy"
exit 0
fi
set -e
fiIn the event of a rollback, an engineer must manually deploy from their machine to cause the latest release status to change and implicitly “re-activate” continuous deployment.
Finally, we can set a Circle environment variable to act as a kill-switch for production continuous deployment. We haven’t had to use this yet, but it’s the final safeguard before code is allowed to ship to production.
if [[ "$CI" == "true" && "$PRODUCTION_CD_ENABLED" != "true" && "$DEPLOY_ENV" == "production" ]]; then
exit 0
fiEngineers usually want to know when their code lands on production without continuously checking the status of Circle. We have a Slack-based notification systems in place to help with this.
We have a general#alerts-deploy channel for automated messages sent from the deploy script and theHeroku Slack app. The default Heroku integration is easy to setup and is the “canonical” source of when your new code is deployed; however, we also need our own Slack messages within the deploy script to message Opendoor-specific errors and status updates.
For example, when our “check whether dynos started” assertion fails, we send an@oncall-flavored alert to the Slack channel for our on-call team to investigate.
We also echo the release notes generated earlier into the Slack channel:
Continuing the release note code from earlier, our code to send that to Slack looks like this to transform the notes into the correct JSON:
DEPLOY_NOTES_SAFE=$(sed 's/\"/\\"/g; s/%\\\"/\"/g' ${DEPLOY_NOTES})
slack "Deployed to production" "${DEPLOY_NOTES_SAFE%?}"We’ve been on continuous deployment for a few months. It has freed up engineering hours from monitoring manual deployments and even caught show-stopping bugs earlier in the process.
From a typical engineer’s perspective this process of “deploying” is decoupled from the specific platforms on which our code runs. As we think ahead to when and if we want to change infrastructure and tools, we can iterate on our platform without changing day-to-day workflows.
We’re looking for engineers of all backgrounds to build products and technologies that empower everyone with the freedom to move.
Find out more about Opendoor jobs onStackShare or on ourcareers site.
The Opendoor Engineering and Data Science Blog