Movatterモバイル変換


[0]ホーム

URL:


Will Button, profile picture
Uploaded byWill Button
PDF, PPTX372 views

Build an Infra Product with AWS Fargate

The document discusses Aaptiv's migration from Heroku to AWS using Fargate, detailing considerations, successes, and failures of the process. The transition aimed to enhance infrastructure by breaking a monolith into microservices to accommodate company growth while maintaining ease of use. Aaptiv leveraged tools like Jenkins and CloudFormation to streamline deployments and support their developers effectively.

Embed presentation

Download as PDF, PPTX
AWS TO FARGATEBUILDING INFRASTRUCTURE PRODUCTSAT AAPTIVHey,I’m Will Button, an infrastructure engineer at Aaptiv and today I’m going to share with you how we built an infrastructure product at Aaptiv in our migration from Heroku toAWS using Fargate.We’ll talk about what that migration looked like, what our considerations and guiding principles were, and I’ll share some of the successes and failures we had along theway.
AAPTIV LETS YOU WORKOUT WHENYOU WANT, WHERE YOU WANT, THEWAY YOU WANT.First, let me tell you what Aaptiv is.We’re an audio fitness company that helps you workout whenever and where ever you are.Let me ask you this:How many of you had a new year’s resolution to get in better shape?
https://www.statista.com/statistics/378105/new-years-resolution/- 45%- No plan to succeed- No idea what to do @ gym
THEN PUT YOURPHONE AWAYPICK A WORKOUTAND HIT PLAY- Find a workout and hit play- A music playlist from top artists- A professional trainer coaching you through the entire workout- No need to figure out what to do with your phone
INFRASTRUCTURE @AAPTIVTHE TEAMJem AltieriIvan Lee (split)MyselfEngineering TeamInfrastructure team @Aaptiv
HEROKU▸ Why Heroku in the first place?▸ Small team▸ Fast setup▸ Fail fast- Initial work with Aaptiv was only ~6 people (3 employees, 3 contractors)- Future was unknown (early stage startup)- MVP built by a single person (iOS app, Android app, API)- Built-in tools for automatic scaling and deployment
WHY MIGRATE TO AWS?JANUARY 2017▸ Large scale marketing efforts▸ Driving new levels of traffic▸ Load testing showed weakness in the monolith▸ Inefficient scaling▸ MVP proving successful- Going to spend more than ever before on marketing and ads in Jan 2017 —>- If successful, would result in more traffic than ever before —>- Load testing in December showed signs of weakness in the monolith API —>- Resulted in scaling to higher numbers than acceptable —>- Good news: the MVP was successful, we knew who our customers were and what they wanted —>- This gave us the insight we needed to address the problems identified during load testing if we made it through Jan 2017
CONSIDERATIONS MOVING TO AWSWITHOUT BRINGING THE SYSTEM DOWN…▸ Everything in Heroku is publicly exposed▸ Monolith —> Microservices▸ Grow into distinct teams▸ Maintain the ease of use we had in Heroku- Fortunately, Jan 2017 was hugely successful- Now it’s time to build a system to support the next few years- We defined what we wanted that post-migration environment to look like —>- Everything in Heroku is publicly exposed. We wanted to leverage AWS VPC to only expose endpoints that needed to be public —>- Breaking the monolith API into Microservices would take time, but we wanted to create an environment that supported that goal.- Because the engineering teams started work on this while we were building the AWS environment, this actually started to happen before we migrated- Resulted in some wonky use-cases of whitelisted IPs and iptables hi jinx to support running in both environments- As we grew, we needed to break development up into teams. Individual services made this easier (vs. monolith) —>- From Jan 2017 - Jan 2019, we grew from about 4 to over 40 engineers —>- The most important consideration for me was maintaining the ease of use we had by using Heroku.
A SUCCESSFUL PRODUCT ISONE THAT MEETS YOURCUSTOMERS WHERE THEYALREADY ARE.— Probably some important dudeTEXTFrom an infrastructure perspective, our customers are the engineers who write the code that powers Aaptiv.For any business, a successful product is one that meets your customers where they already are.Engineers live in editors and GitHub, not the AWS console so it was important to make sure that the tool we built did too.
PUTTING THE “PRODUCT” IN “INFRASTRUCTURE AS A PRODUCT”ENGINEERS WRITE CODE▸ Define the run-time environment in the same repo▸ Documented▸ Audit history▸ Follows deployment process▸ Changes are peer-reviewed▸ The “right stuff” happens by default- There were a lot of things we liked about this approach:- The runtime configuration could be defined in the same repo as the code it supported ->- The runtime configuration is documented ->- Which makes it audit-able ->- And it could follow the code through its deployment process of dev —> staging —> prod ->- It can be peer-reviewed like any other code change in GitHub prior to going to production ->- And most importantly: software engineers typically don’t know a lot about security groups, IAM profiles, VPCs, load balancers, etc… so if there’s a requirement like“services need to be load balanced across at least two availability zones in the same region”, that should happen by default
MANY PATHS LEAD TO THE SAME DESTINATIONHOW ARE YOU GONNA BUILD IT?▸ Kubernetes▸ EC2 instances▸ Elastic Container Service (ECS)When it came time to start building it, we had some different options:
- Kubernetes would have been a valid option.  - Didn’t have a lot of K8 experience in-house and didn’t know what we didn’t know- Could have used EC2 instances behind load balancers.- Felt like overkill, these were nodejs apps.- EC2 instances are slow to bring online when auto-scaling (or they live for a long time because)- ECS was new but sounded like a nice compromise between the two: you get to run your apps in docker containers in a K8-like environment without having to managethe K8 itself.- This resonated with us because of the extremely small size of the infrastructure team and lack of desire to grow it bigger
THE INITIAL PATHECS ON EC2▸ Running ECS on our own EC2 instances▸ Jenkins for deployments▸ (👋 to all the Jenkins-haters! Teach me what’s new!)▸ Using AWS API to build/define ECS services▸ then…So we started building out an ECS cluster using our own ec2 instances. ->We used Jenkins for deploying everything and building our docker images- which I’ll show you in just a minute.BTW: Shout out to all the Jenkins haters out there- I’d love to hear what you’re using instead and how it’s working for you.Our Jenkins jobs were all built using the Jenkins pipeline. ->This keeps our Jenkins configuration as code, and allows us to track history, audit changes, and peer review code changes.For deploying to ECS, we used the AWS API. ->And then…
FARGATEEC2AWS announced Fargate, which caused us to stop and pause.At this point, we’d done a lot of framework building, test deployments, and architecting the system but we hadn’t actually deployed any production facing services yet.Fargate looked pretty promising because it removed one more layer of maintenance and headache for us: the ec2 instances.Normally- I’m the last dude on Earth to jump on a new product.It’s not that I don’t like them, it’s just that they come with bugs and my customers aren’t giving me money to find and fix bugs in someone else’s code.The other consideration is that AWS loves to release early and release often which usually means they launch with a product that does about 90% of what you think itwas going to do, leaving you to implement the rest. As a builder, I love this approach because you throw it out there and see what people do with it- then build what theyneed. It’s a true MVP. As a consumer, I hate it.
THE INITIAL PATH PATH 0.2ECS ON EC2 FARGATE▸ Running ECS on our own EC2 instances Fargate▸ Jenkins for deployments▸ (👋 to all the Jenkins-haters! Teach me what’s new!)👈▸ Using AWS API CloudFormation to build/define ECSservices▸ then…We gave Fargate a shotWe’re still using Jenkins for deployments (and I still want to hear what you’re using) ->But we realized that there was a lot, I mean a lot of code to support the different API calls, check for constraints and respond properly, order the API calls in the rightsequence… ->During this time, we realized we were reinventing CloudFormation. So instead of trying to reinvent CloudFormation, we decided to just use CloudFormation which turnedinto a library we now call soa-templates. ->And then…We should shift gears here real quick because I’ve given you a lot of background information and I want to show you what the finished (well, semi-finished product) lookslike. I think that will provide the context to make the rest of this talk meaningful as we dive deeper into how this works.
It all starts when our engineers are ready to push their changes through the deployment cycle and ultimately to production.Inside their repo- they’ll have a Jenkinsfile-
@Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily: '1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'It looks like this- it’s a declarative file that defines what the runtime architecture should look like for dev, staging, and production.TL;DR - explain all the fields.Well… most of them.We have global GitHub web hooks set up to respond to opening and closing pull requests. So when they open a pull request, Jenkins receives a web hook notification.
Since this is a pull request, we create a fully functioning deployed environment that can be used for testing, validation, or having a conversation about the proposed codechanges.Additional commits to the branch for this PR triggers a rebuild of the environment so this service always reflects the latest code in the pull request.Here’s how that works:
Jenkins starts executing the pipeline script.You can pass in a parameter for which agent you want it to run on, then we move into the stages.The first stage is “configure build” where we do a bunch of housekeeping work. This is where we determine if this is a PR branch or a deploy to staging or production, weget the repo name, the githash and some other stuff that we’re going to need later.Next we run the test suite for your code. If the tests fail, the deployment stops here.From there, we build the docker image for your application and push it to the ECR repo for ECS. There is a docker repo in ECS for every code repo we have, whichmakes it easy to find docker images for a specific project if necessary.Then we get to the real meat of the build:We’re going to deploy the service.This is where we grab the soa-template library from it’s repo-
checkout([$class: 'GitSCM',branches: [[name: '*/master']],doGenerateSubmoduleConfigurations: false,extensions: [],submoduleCfg: [],userRemoteConfigs: []])And you can specify the branch you want from that repo. So for the engineers, they always want master because that’s where our production version is.But for those of us who work on it, it’s a way for us to create a branch to iterate and test on before making changes to master and potentially affecting production.The soa-template contains our CloudFormation template
myapp-pr-99.aaptiv.comAnd that CF template is going to create an ELB, the listeners, and the target group for this service. ->Everything that needs to happen that the engineers don’t know need to happen, are done by default by this template.That includes things like- spanning multiple availability zones- Using the right security groups- Preventing public access unless necessaryThen it creates the ECS task definition which includes the docker image from the ECR repository we built in the previous step.and then the ECS service. ->Along with the ECS service we create autoscaling alerts to increase and decrease the number of running tasks based on CPU, memory, or latency. ->All of the logs from their application are sent to CloudWatch Logs, where they get transported to Splunk. All logs in Splunk are indexed by environment, allowing us tosee and trace events through every service in a single location. ->Finally, we create a Route53 DNS entry so engineers always have a DNS entry of project name - pr - pr number to access their dynamic environment.
From here one of two things can happen: ->First: if the clock strikes midnight, all running dev services are deleted. We do this because no one is working at midnight, so it helps us control costs. The next day if youneed the environment brought back up- you can just replay the build in Jenkins and it gets recreated. ->The second possibility is the PR is approved and merged into the main branch. When that happens…
TEXTMERGE TO MASTER▸ PR branch environment deleted▸ Deploy to staging▸ Same as PR deploy▸ Creates release tagThe PR branch is deleted, destroying all resources that were created as part of that environment.The code is deployed to staging and follows the same build steps used for the PR branch.Once the ECS service deploy happens, the staging environment is running the new code and if that happens successfully:
TAGGED WITH SEMVER(-ISH) AND GIT HASHWe cut a pre-release tag in GitHub. It sort of follows sem-ver, meaning the patch level is incremented by 1 to indicate a new build. The major and minor version numbersare set in the Jenkinsfile, allowing the team responsible for the code to indicate major and minor builds.We also append the githash to the release number. The docker image that was built as part of this release is also tagged with that githash, allowing us to identify theexact image being used if needed.
TEXTRELEASE TO PRODUCTION▸ Build with parameters▸ Specify tag to release▸ Docker image is not built▸ GitHub tag markedReleasing from staging to production is done via a parameterized build.Specify the environment and tag to release and the pipeline runs again, this time for production.One difference though is that the docker image isn’t built- the image we created during the staging build is used, and we’re able to identify it because we tagged it withthe git commit hash.After the deploy completes successfully, the GitHub release tag is changed from being a “Pre-release” to “Latest Release”
SHOW AND TELLUNDER THE HOODNow that you’ve got an overview of how this works from the customer’s perspective (i.e. the engineering teams), let’s take a look under the hood at some of the majorcomponents.
GITHUBWEBHOOKSJENKINS GITHUBPLUGINIt starts when an engineer pushes code to GitHub.We have a global web hook that we’ve configured to send a notification to our Jenkins server for specific events.The payload of the web hook tells Jenkins both the repository and action that fired the event.
The only thing that needs to be configured on Jenkins is a project for that GitHub repo, and specify that the build is controlled by a Jenkinsfile.
@Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily: '1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'Earlier, you saw what this Jenkinsfile looks like- it’s fairly declarative and specifies what the runtime configuration for the service should look like.The thing I’d like to draw your attention to is this: We import a Groovy library called aaptivPipelineLib.In that library, we call a function called nodePipeline. This executes the build steps necessary for a node.js project. We currently have support for python, and limitedsupport for Lambda functions and Java projects.
The aaptivPipelineLib exists because in our Jenkins server, we defined it as a Global Pipeline Library. That makes it available to all projects in our Jenkins server. Definingit is done by giving it a name, and specifying where the source code should come from- in our case it comes from a GitHub repo.
@Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily: '1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'If we take a look at the aaptivPipelineLib project, you can see this groovy file nodePipeline.groovy.So in the Jenkinsfile for our project, when we called the nodePipeline function, it’s actually calling this groovy file within the library.
parameters {choice(name: "environment",choices: "stagingnprodndev",description: "Which environment should I deploy?")string(name: "tag",defaultValue: "",description: "Which git-tag will be deployed?")}Let’s dig a little deeper into what that does. The first thing it does is define some parameters.These are the same parameters you saw in Jenkins when we specified the release tag and environment to release to production.
agent {label args.agentLabel}Next we have the agent specification that allows the engineers to specify which build servers to run on. We currently have separate build servers for nodejs, python, andjava projects.
stage('Configure build') {steps {script {slackChannel = args.slackChannel ?: 'jenkins-deployments'environment = 'dev'if (BRANCH_NAME == args.releaseBranch) {if (params.environment) {environment = params.environment} else if (params.tag == null || params.tag.isEmpty()) {environment = 'staging'}}if (args.proxyImage) {proxyRepoUri = getEcrRepo(args.proxyImage.projectName)proxyImageUrl = "${proxyRepoUri}:${gitHash}"}runTests = trueif (params.tag) {runTests = false}runBuild = trueif (params.tag || !(BRANCH_NAME == args.releaseBranch || BRANCH_NAME.startsWith("PR-"))) {runBuild = false}runDeploy = trueif (!(BRANCH_NAME == args.releaseBranch || BRANCH_NAME.startsWith("PR-"))) {runDeploy = false}runTagRelease = falseif (BRANCH_NAME == args.releaseBranch) {runTagRelease = true}if (args.customSecurityGroups) {customSecurityGroupId = args.customSecurityGroups[environment]}}}}Then we start running through the defined stages for the build.The first one is “Configure build” and I’m going to kind of just gloss over this because it just sets up some conditions for the following stages- mainly determining whetherwe are building a dev, staging, or production build and setting some global variables that are used by the following stages.
stage('Run tests') {when { expression { return runTests } }steps {npmTest("test", projectName)}}The next stage is “run Tests” which calls another groovy function: npmTest
def call(def env, def project, def nodeVersion="default", reporter="true") {// Get a random, free port on Jenkins build node (so we can run multiple Skyfit tests in parallel).def randomPort = getAvailablePort()// Runs npm testssh """source ~/.bash_profileaaptivsecrets env_export --env ${env} ${project} --outfile env.propertiessource ${WORKSPACE}/env.propertiesnvm use ${nodeVersion}npm installPORT=$randomPort npm testif [ "${reporter}" == "true" ]; thennode_modules/.bin/nyc report --reporter=cobertura --dir coveragefi"""cobertura autoUpdateHealth: false, autoUpdateStability: false, coberturaReportFile: 'coverage/cobertura-coverage.xml',conditionalCoverageTargets: '70, 0, 0', failUnhealthy: false, failUnstable: false, lineCoverageTargets: '80, 0, 0', maxNumberOfBuilds: 0,methodCoverageTargets: '80, 0, 0', onlyStable: false, sourceEncoding: 'ASCII', zoomCoverageChart: false}If we take a look at that file, there are a couple of things noteworthy going on here.First is that it’s just executing some shell commands.Next we have nvm installed that allows the engineers to specify which version of nodejs their project uses.And then we run the npm test command. That’s a pretty widely accepted convention for running tests in a node.js project, so it makes it easy for us to run the testswithout having to know the implementation details of testing.The remainder of the function publishes the test results and code coverage back to Jenkins where it’s displayed visually in their project.
def call(Map args) {if (params.tag == null || params.tag.isEmpty()) {println("No tag set, so build the image")def projectName = args.projectNamedef repoUri = args.repoUridef validChars = "[^w^-^.^_]"def cleanBranchName = "${BRANCH_NAME}".replaceAll(validChars,"")def gitHash = args.gitHashdef buildPath = args.buildPathif (buildPath == null || buildPath.isEmpty()) {buildPath = "."}echo "Building ${projectName}"sh "source ~/.bash_profile"def dockerLogin = sh(script: '/usr/bin/aws ecr get-login --no-include-email --region us-east-1', returnStdout: true).trim()sh "${dockerLogin}"echo "Building Docker Image"sh "docker build -t ${projectName} ${buildPath}"sh "docker tag ${projectName} ${repoUri}:${cleanBranchName}"sh "docker tag ${projectName} ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker tag ${projectName} ${repoUri}:${gitHash}"sh "echo 'Pushing branch ${cleanBranchName} build ${BUILD_NUMBER} to ECR'"sh "docker push ${repoUri}:${cleanBranchName}"sh "docker push ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker push ${repoUri}:${gitHash}"sh "docker rmi ${repoUri}:${cleanBranchName}"sh "docker rmi ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker rmi ${repoUri}:${gitHash}"}}Build Image is very similar, it calls a function called buildImage that contains this code.The main tasks performed here are building and tagging the image, then pushing it up to the AWS ECR repository. Which sounds redundant because it is, but I’m notsure how else to refer to it…
Which brings us to Deploy Service, and it’s really the meat of the deployment.
templateParamsJson = templateParamsJson + """{"ParameterKey": "ImageUrl", "ParameterValue": "${imageUrl}"},{"ParameterKey": "DdApiKey", "ParameterValue": "${DD_API_KEY}"},{"ParameterKey": "BranchName", "ParameterValue": "${cleanBranch}"}"""println(templateParamsJson)def templateBody = libraryResource "com/aaptiv/autoscaling_loadbalancer.yaml"writeFile(file: "template${BUILD_TAG}.yaml", text: templateBody, encoding: "UTF-8")The first thing we do is write out all of our parameters to create the json file that we provide to CloudFormation to define the stack.
def getStackStatus(stackName) {def result = ""try {result = sh(script: """aws cloudformation describe-stacks --stack-name ${stackName}--query 'Stacks[0].StackStatus'""", returnStdout: true).trim().replace(""","")println("Stack status is: ${result}")} catch(ex) {println("Stack does not exist")}return result}Then we check to see if the stack exists. If it doesn’t, we need to create it- this is commonly the case for PR branches.If the stack does exist, we need to update it to deploy the requested changes to it.
def getStackChanges(stackVar, gitHash) {def changesetVar = stackVar.replace("ClientRequestToken", "ClientToken")writeFile(file: "stack_${BUILD_TAG}_changeset.json", text: changesetVar, encoding:"UTF-8")def changeSetId = sh(script: """aws cloudformation create-change-set --template-body file://"template${BUILD_TAG}.yaml" --change-set-name hash-${gitHash}${BUILD_NUMBER} --change-set-type UPDATE --cli-input-json file://"stack_${BUILD_TAG}_changeset.json" --query 'Id' """, returnStdout: true).trim()try {println("waiting for change set creation to complete")sh(script: """aws cloudformation wait change-set-create-complete --change-set-name ${changeSetId}""")} catch(ex) {println("trying to wait for changeset creation threw an error, perhaps no changes")}def changes = sh(script: """aws cloudformation describe-change-set --change-set-name ${changeSetId} --query 'Changes' """, returnStdout: true).trim()println("changes for change set ${changeSetId}: ${changes}")return changes}In cases where the stack already exists, we create a change set that describes the changes to be applied to the stack. All of our cloud formation commands are done using the aws cli And then we use the cloud formation wait function to wait for cloud formation to complete, wrapped in a try/catch block so that we’re able to identify builds thatcomplete successfully.
Mappings:SubnetByScheme:'internet-facing':AZ1SubnetId: subnet-xxxxxxxxAZ2SubnetId: subnet-xxxxxxxxAZ3SubnetId: subnet-xxxxxxxxAZ4SubnetId: subnet-xxxxxxxx'internal':AZ1SubnetId: subnet-xxxxxxxxAZ2SubnetId: subnet-xxxxxxxxAZ3SubnetId: subnet-xxxxxxxxAZ4SubnetId: subnet-xxxxxxxxSecurityGroupByEnvironment:'dev':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'sg-xxxxxxxx''staging':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'bogusIdThatShouldNeverGetUsed''prod':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'bogusIdThatShouldNeverGetUsed'SplunkForwarderARN:'dev':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-dev''staging':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-staging''prod':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-prod’Our cloud formation template starts out defining some mappings- this is where define things like the subnets and security groups based on the environment to ensurethat running tasks are distributed across multiple availability zones and use the correct security groups. This means that the engineers deploying the code don’t have tohave this knowledge but can still deploy according to the security and availability constraints we’ve defined.
Resources:# Once we decide on how we do environments, this will need to change to mapping , ratherthan an iffServiceDNSName:Type: "AWS::Route53::RecordSet"Properties:HostedZoneId: !If [EnvironmentIsProd , 'xxxxxxxx', ‘xxxxxxxx']Name: !If- EnvironmentIsProd- !Join ['', [!Ref ProjectName,'.', 'aaptiv.com', .]]- !If- EnvironmentIsStaging- !Join ['', [!Ref ProjectName , ".", 'aapdev.com', .]] #Staging Goes Directlyto Aapdev- !Join ['', [!Ref ProjectName, '-', !Ref BranchName, ".", 'aapdev.com', .]]#Other Branchs Get PrefixesTTL: '300'Type: 'CNAME'ResourceRecords:- !GetAtt LoadBalancer.DNSNameIn the resources section of our cloud formation template, we define the DNS name for the deployed service. Again, this makes it easier for developers to find the URL fortheir deployed project because it always follows the same naming convention.
LoadBalancer:Type: AWS::ElasticLoadBalancingV2::LoadBalancerProperties:Scheme: !If [InternetFacing , 'internet-facing', 'internal']IpAddressType: ipv4Tags:- Key: BranchNameValue: !Ref BranchName- Key: NameValue: !Join- '-'- - !Ref Environment- !Ref ProjectName- !Ref BranchName- Key: ServiceValue: !Ref ProjectName- Key: EnvValue: !Ref Environment- Key: RoleValue: "Load Balancer"- Key: TeamValue: !Ref TeamWe define the load balancer and by using a lookup, we can correctly provision the load balancer as either internet-facing or internal.One additional thing we do here is apply tags for the load balancer name, service, environment, role, and team. We use this for cost allocation, allowing us to break downour operating costs by each of these tags and control our expenses.
LoadBalancerListener:Type: AWS::ElasticLoadBalancingV2::ListenerProperties:DefaultActions:- TargetGroupArn: !Ref 'TargetGroup'Type: 'forward'LoadBalancerArn: !Ref 'LoadBalancer'Port: !If [InternetFacing , 443, 80]Protocol: !If [InternetFacing , 'HTTPS', 'HTTP']Certificates:- CertificateArn: !If- InternetFacing- !If- EnvironmentIsProd- 'arn:aws:acm:us-east-1:1234567890:certificate/xxxxxxxx'- ‘arn:aws:acm:us-east-1:1234567890:certificate/xxxxxxxx'- !Ref "AWS::NoValue"LoadBalancerRedirectListener:Type: AWS::ElasticLoadBalancingV2::ListenerCondition: InternetFacingProperties:DefaultActions:- Type: 'redirect'RedirectConfig:Port: '443'Protocol: 'HTTPS'StatusCode: 'HTTP_301'LoadBalancerArn: !Ref 'LoadBalancer'Port: 80Protocol: 'HTTP'The load balancer has to have listeners, so we define those as well.For internet-facing load balancers, we setup SSL, configure the certificate and create an automatic HTTP 301 redirect to SSL for any traffic received over HTTP.
TaskDefinition:Type: 'AWS::ECS::TaskDefinition'Properties:Family: !Join ['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cpu: !Ref CPUMemory: !Ref MemoryRequiresCompatibilities:- FARGATEVolumes:-Name: "aaptiv_logs"ContainerDefinitions:- Name: !Join ['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cpu: !Ref CPUMemory: !Ref MemoryImage: !Ref ImageUrlEssential: trueEnvironment:- Name: ENVValue: !Ref Environment- Name: BRANCH_NAMEValue: !Ref BranchName- Name: PROJECT_NAMEValue: !Ref ProjectName- Name: NODE_ENVValue: "production"- Name: BRANCH_OVERRIDEValue: !If [EnvironmentIsDev, !Ref BranchOverride, ""]LogConfiguration:LogDriver: "awslogs"Options:"awslogs-group": !Join- '-'- - !Ref Environment- !Ref ProjectName- !Ref BranchName"awslogs-region": "us-east-1""awslogs-stream-prefix": !Ref ProjectNamePortMappings:- ContainerPort: !Ref ContainerPortHostPort: !Ref ContainerPortProtocol: 'TCP'Then we create the ECS task definition:The task definition is the description of the ECS environment for this service. It includes the definition for the docker containers you want to run, memory and cpurequirements, and whether your task runs on EC2 or Fargate.This is largely just a variable substitution exercise, setting the parameters for the task based on the values supplied by the Jenkinsfile and the environment. Rememberthat we got these values into the CloudFormation template by writing them out to a json file in the Jenkins stage then supplying that json file as a cli argument when wecalled the cloud formation command.Things like memory, cpu, and exposed port are specified by the Jenkinsfile.Environment, branch, and project name are calculated by the pipeline library.
Service:Type: 'AWS::ECS::Service'DependsOn: LoadBalancerRuleProperties:ServiceName: !Join ['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cluster: !Ref ClusterNameLaunchType: FARGATEDeploymentConfiguration:MaximumPercent: 200MinimumHealthyPercent: 50DesiredCount: !If [EnvironmentIsDev, 1, !Ref DesiredCount]HealthCheckGracePeriodSeconds: !Ref HealthCheckGracePeriodSecondsNetworkConfiguration:AwsvpcConfiguration:AssignPublicIp: DISABLEDSecurityGroups:- !FindInMap [ "SecurityGroupByEnvironment",!Ref Environment, 'base' ] ##Lets ittalk to its own env- !If ##Extra Security Group only added for dev access to staging- RequiresCrossEnvAccess- !FindInMap [ "SecurityGroupByEnvironment",!Ref Environment,'crossEnvAccess' ]- !Ref "AWS::NoValue"- !If- UseCustomSecurityGroup- !Ref CustomSecurityGroupId- !Ref "AWS::NoValue"Subnets:- !FindInMap [ "SubnetByScheme",'internal', 'AZ1SubnetId' ] #Services shouldAlways be on the- !FindInMap [ "SubnetByScheme",'internal', 'AZ2SubnetId' ] #DMZ , regardless ofwhether loadbalacner- !FindInMap [ "SubnetByScheme",'internal', 'AZ3SubnetId' ] # is internet facing- !FindInMap [ "SubnetByScheme",'internal', 'AZ4SubnetId' ]TaskDefinition: !Ref TaskDefinitionAnd then we define the service.In ECS, a service is a running instance of a task definition.Some of the key things we do here is set the desired count- that is, the number of running tasks the service should have. For production environments, this defaults to aminimum of 3 to ensure there aren’t single-point-of-failure services. In dev, we always set it to 1.We also define the network configuration for the service here, with most of it being determined by the environment: dev, staging, or prod.
CLOUD FORMATION RESOURCES▸ Load balancer target group▸ Cloud Watch alarms for scale in/out▸ Kinesis Firehose Delivery▸ Cloud Watch Subscription Filters for logsAnd this continues for the remaining stack resources, but it’s largely the same process: define the resource and populate the configuration values based on parametersspecified in the Jenkinsfile or the environment.The resources include the load balancer target group, cloud watch alarms for scaling, kinesis firehose delivery subscriptions for logs,
post {always {notifySlack(currentBuild.currentResult, slackChannel, environment)notifyInflux()}success {script {if (deployServiceOutput && deploymentId) {def deploymentUrl = "http://${deployServiceOutput.hostname}"githubHelper.updateDeployment(…)}}}failure {script {if (deploymentId) {githubHelper.updateDeployment(…)}}}unstable {script {if (deploymentId) {githubHelper.updateDeployment(…)}}}cleanup {deleteDir()}}At the end of our build, we now know whether is was successful or not so - we post a notification in Slack- We send the build metrics to InfluxDB - We update GitHub with the deployment url and build status- Finally, we clean up our working directory to free up the space on the build servers
COMPLETING THE MIGRATION FROM HEROKU TO AWSMOVING DAY▸ Deploy to AWS▸ Change URL in mobile clients▸ 🍻 🍻 🍻 🍻So with all of that built, we could finally migrate from Heroku to AWS.Building that took a long time but we felt like it was the right thing to do. One of the important considerations for me was to ensure the engineers had the same userexperience or better after the migration, so building that user experience took time.Once it was done though- we needed to deploy the services running in Heroku to AWS using the new pipeline, then the mobile engineers deployed a new version of ourapp with the URLs for the new service locations.
NOT EVERYONE UPDATES▸ Some users never update▸ Shutting down Heroku wouldbreak the app for them▸ Not supporting 2 versions of API▸ Deploy nginx on Heroku▸ Someday, we’ll get to take it downThere was one remaining step.Some users don’t update their app. This meant we couldn’t shut down Heroku entirely, and weren’t willing to support 2 versions of our API.Fortunately with Heroku, you can deploy docker images as well as code. So, we deployed an nginx container that simply redirected all the traffic it received to our newAPI URL.This had the effect of sending all traffic to our new API, even the users who didn’t update.Eventually, we’ll be able to turn off Heroku completely- either because everyone has upgraded or we release a breaking change incompatible with the older clients.
TEXTSIX MONTHS LATER▸ 1 Monolith —> 9 MicroServices▸ Services behind VPC▸ Improved scalability▸ Improved security▸ Improved agilityIt’s been about six months since we made the cutover.Since then, we’ve successfully broken our single monolith API into about 9 different microservices, All of the services are protected behind the VPC, including the databases they talk to- something we didn’t have with Heroku.We are much more scalable, and this results from both separating our services out allowing them scale independently and from allowing each service to iterate and growbased on that team’s schedule.I already mentioned the security improvements by moving everything behind the VPC, but we gained additional security by setting sensible defaults for security groupsand public access in the pipeline, removing that burden from the engineers.And we’re much more agile- engineers can bring environments up and down through a simple GitHub pull request.
DECEMBER 2016DECEMBER 2017DECEMBER 2019Just to provide some perspective- Here was an average day in December 2016- ranging from 50 rpm and peaking somewhere just over 200 rpm.In December 2017, we were between 700 and 1400 rpmAnd last month we ranged from just under 2000 rpm and peaked just over 13000 rpm.FWIW- I have no idea what happened to December 2018.
One final metric on our agility- this is the number of builds per day in Jenkins.You can see it ranges from a low of around 40-ish on December 30th to almost 150 on 12/20 and 1/8.Now most of these are PR builds, but I think it speaks pretty well to the confidence and usefulness of our pipeline.
The last slide I want to show you is on the costs associated with this project.We did our initial cutover in May. As we increased our utilization of AWS, you can see the blue line representing increased AWS costs, and the red line showingdecreasing costs in Heroku as we turned things off.The really interesting line to me is the yellow one- between May and August, we added several new micro services to AWS. The cost of those is reflected in the blue line.The yellow line shows what those same services would have cost us in Heroku had we chosen to build them there instead. The difference between the yellow line and theblue line is roughly $10k/month, which adds up to some nice savings really quickly.
WHERE DO WE GO FROM HERE?‣ Pull requests‣ Datadog metrics —> InfluxDB‣ AWS Service TagsSo the next steps for us haven’t been really decided yet, but I think two areas that will be on our radar shortly are data dog metrics and AWS service tags.
DATADOG METRICSWe use Datadog for a lot of our infrastructure metrics, including ECS. The really interesting thing we’ve stumbled upon due to the way ECS and Datadog work is that you can get metrics for your running tasks, but once a task has exited(either intentionally or due to a crash), getting the metrics becomes difficult.You saw in the pipeline that we’re sending build stats to InfluxDB, and we use it a lot of other places as well. I think it might make sense for us to have a container agentthat sends CPU and memory metrics to InfluxDB as well if doing so gives us more insight and persistence of our metrics.
AWS TAGGINGAnother area I’d like to see us pursue is additional AWS tags. AWS recently added the ability to tag ECS Services, which would allow us to eliminate some of these No TagKey costs in our AWS bill. Unfortunately, you have to changeto a new resource id format to get that feature, which means deleting and recreating some resources- like our production API.So we need to evaluate what that would look like and what options we’d have to do so without taking a downtime hit.
TEXTWRAPPING IT UP▸ Move to AWS for security,scalability, agility, costs▸ Built custom pipeline for ECSFargate using Jenkins Pipelinesand CloudFormation▸ Feature requirements primarilydriven by creating a great userexperience for the engineers▸ Pipeline modularized to takeadvantage of shared components
QUESTIONS?@WFBUTTON

Recommended

PDF
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
PDF
AWS Container services
PPTX
Amazon EKS Deep Dive
PDF
Getting Started on Amazon EKS
PDF
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
PDF
Journey towards serverless infrastructure
PDF
20201013 - Serverless Architecture Conference - How to migrate your existing ...
PDF
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
PDF
DevOps Fest 2020. immutable infrastructure as code. True story.
PDF
From Heroku to Amazon AWS
PDF
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
PPTX
Introduction to DevOps on AWS
PPTX
Building Modern Applications on AWS.pptx
PDF
Lessons learned from writing over 300,000 lines of infrastructure code
PDF
Infrastructure as Code
PDF
Cloud for Grownups - 🛑 No Kubernetes, 🌀 No Complexity, ✅ Just AWS-Powered Res...
PDF
Java on AWS Without the Headaches - Fast Builds, Cheap Deploys, No Kubernetes
PPTX
Dev348 ReInvent Corteva Agriscience
PDF
Successful DevOps implementation for small teams a true story
PDF
What we talk about when we talk about DevOps
PPTX
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
PDF
DevOps at Tradeshift - AWS community day nordics
PDF
Nils Rhode - Does it always have to be k8s - TeC Day 2019
PDF
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
PDF
Build an app on aws for your first 10 million users (2)
PDF
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
PDF
AWS in Practice
PDF
Julia Furst MorgadoManaging EKS Clusters at Scale using Blueprints and Infra...
PDF
Deploy Nodejs on Docker
PPTX
Mongoose and MongoDB 101

More Related Content

PDF
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
PDF
AWS Container services
PPTX
Amazon EKS Deep Dive
PDF
Getting Started on Amazon EKS
PDF
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
PDF
Journey towards serverless infrastructure
PDF
20201013 - Serverless Architecture Conference - How to migrate your existing ...
PDF
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019
20211028 ADDO Adapting to Covid with Serverless Craeg Strong Ariel Partners
AWS Container services
Amazon EKS Deep Dive
Getting Started on Amazon EKS
20211202 NADOG Adapting to Covid with Serverless Craeg Strong Ariel Partners
Journey towards serverless infrastructure
20201013 - Serverless Architecture Conference - How to migrate your existing ...
Deep Dive on Amazon Elastic Container Service (ECS) | AWS Summit Tel Aviv 2019

Similar to Build an Infra Product with AWS Fargate

PDF
DevOps Fest 2020. immutable infrastructure as code. True story.
PDF
From Heroku to Amazon AWS
PDF
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
PPTX
Introduction to DevOps on AWS
PPTX
Building Modern Applications on AWS.pptx
PDF
Lessons learned from writing over 300,000 lines of infrastructure code
PDF
Infrastructure as Code
PDF
Cloud for Grownups - 🛑 No Kubernetes, 🌀 No Complexity, ✅ Just AWS-Powered Res...
PDF
Java on AWS Without the Headaches - Fast Builds, Cheap Deploys, No Kubernetes
PPTX
Dev348 ReInvent Corteva Agriscience
PDF
Successful DevOps implementation for small teams a true story
PDF
What we talk about when we talk about DevOps
PPTX
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
PDF
DevOps at Tradeshift - AWS community day nordics
PDF
Nils Rhode - Does it always have to be k8s - TeC Day 2019
PDF
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
PDF
Build an app on aws for your first 10 million users (2)
PDF
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
PDF
AWS in Practice
PDF
Julia Furst MorgadoManaging EKS Clusters at Scale using Blueprints and Infra...
DevOps Fest 2020. immutable infrastructure as code. True story.
From Heroku to Amazon AWS
20141021 AWS Cloud Taekwon - Startup Best Practices on AWS
Introduction to DevOps on AWS
Building Modern Applications on AWS.pptx
Lessons learned from writing over 300,000 lines of infrastructure code
Infrastructure as Code
Cloud for Grownups - 🛑 No Kubernetes, 🌀 No Complexity, ✅ Just AWS-Powered Res...
Java on AWS Without the Headaches - Fast Builds, Cheap Deploys, No Kubernetes
Dev348 ReInvent Corteva Agriscience
Successful DevOps implementation for small teams a true story
What we talk about when we talk about DevOps
Openstack Summit Tokyo 2015 - Building a private cloud to efficiently handle ...
DevOps at Tradeshift - AWS community day nordics
Nils Rhode - Does it always have to be k8s - TeC Day 2019
20211202 North America DevOps Group NADOG Adapting to Covid With Serverless C...
Build an app on aws for your first 10 million users (2)
AWS reinvent 2019 recap - Riyadh - Containers and Serverless - Paul Maddox
AWS in Practice
Julia Furst MorgadoManaging EKS Clusters at Scale using Blueprints and Infra...

More from Will Button

PDF
Deploy Nodejs on Docker
PPTX
Mongoose and MongoDB 101
PPTX
No More Mr. Nice Guy The MEAN Stack
PDF
DevOps for Developers
PPTX
Mongo Sharding: Case Study
PPTX
Practical MongoDB
PPTX
Mongo db mug_2012-02-07
PDF
Effective Telepresence and Remote Collaboration
PDF
Traxticsearch
Deploy Nodejs on Docker
Mongoose and MongoDB 101
No More Mr. Nice Guy The MEAN Stack
DevOps for Developers
Mongo Sharding: Case Study
Practical MongoDB
Mongo db mug_2012-02-07
Effective Telepresence and Remote Collaboration
Traxticsearch

Recently uploaded

PDF
IAAM Meetup #7 chez Onepoint - Construire un Rag-as-a-service en production. ...
PPTX
Building AI agents in Java - Devoxx Belgium 2025
PDF
Bring AI and build AI agents into your Jakarta EE Apps with LangChain4J-CDI
PDF
DSD-INT 2025 DevOps - Automated testing and delivery of Delft3D FM - van West...
PPTX
Future of Software Testing: AI-Powered Open Source Testing Tools
PPTX
Key Benefits of Odoo Customization Services
PDF
Custom MGA Software Development: Better Apporach
PDF
CCM_External_Sales_Commissions_Standard_Configuration_2022-3.pdf
 
PDF
Oracle AI Database 26ai _ AI-Native Database for Enterprises.pdf
PDF
SCORM Cloud: The 5 categories of content distribution
PDF
Presentation Empowerment Technology in ICT
PDF
Fundamental of Information Systems introduction.pdf
PPTX
AI Clinic Management Tool for Dermatologists Making Skin Care Smarter, Simple...
PDF
DSD-INT 2025 The Singapore Regional Model in Delft3D FM - Zijl
PDF
DSD-INT 2025 Validation of SFINCS on Historical River Floods at the Global Sc...
PDF
DSD-INT 2025 Century-Scale Impacts of Sediment-Based Coastal Restoration unde...
PDF
DSD-INT 2025 Transport and Fate of Microplastics in Fluvial System (Rhine Riv...
PDF
DSD-INT 2025 Thermal and chemical plumes of sea cooling water from PEM platfo...
PDF
DSD-INT 2025 Flood Early Warning System for the Trans-African Hydrometeorolog...
PDF
DSD-INT 2025 Coupling SFINCS to a flood risk model to evaluate the effects of...
IAAM Meetup #7 chez Onepoint - Construire un Rag-as-a-service en production. ...
Building AI agents in Java - Devoxx Belgium 2025
Bring AI and build AI agents into your Jakarta EE Apps with LangChain4J-CDI
DSD-INT 2025 DevOps - Automated testing and delivery of Delft3D FM - van West...
Future of Software Testing: AI-Powered Open Source Testing Tools
Key Benefits of Odoo Customization Services
Custom MGA Software Development: Better Apporach
CCM_External_Sales_Commissions_Standard_Configuration_2022-3.pdf
 
Oracle AI Database 26ai _ AI-Native Database for Enterprises.pdf
SCORM Cloud: The 5 categories of content distribution
Presentation Empowerment Technology in ICT
Fundamental of Information Systems introduction.pdf
AI Clinic Management Tool for Dermatologists Making Skin Care Smarter, Simple...
DSD-INT 2025 The Singapore Regional Model in Delft3D FM - Zijl
DSD-INT 2025 Validation of SFINCS on Historical River Floods at the Global Sc...
DSD-INT 2025 Century-Scale Impacts of Sediment-Based Coastal Restoration unde...
DSD-INT 2025 Transport and Fate of Microplastics in Fluvial System (Rhine Riv...
DSD-INT 2025 Thermal and chemical plumes of sea cooling water from PEM platfo...
DSD-INT 2025 Flood Early Warning System for the Trans-African Hydrometeorolog...
DSD-INT 2025 Coupling SFINCS to a flood risk model to evaluate the effects of...

Build an Infra Product with AWS Fargate

  • 1.
    AWS TO FARGATEBUILDINGINFRASTRUCTURE PRODUCTSAT AAPTIVHey,I’m Will Button, an infrastructure engineer at Aaptiv and today I’m going to share with you how we built an infrastructure product at Aaptiv in our migration from Heroku toAWS using Fargate.We’ll talk about what that migration looked like, what our considerations and guiding principles were, and I’ll share some of the successes and failures we had along theway.
  • 3.
    AAPTIV LETS YOU WORKOUTWHENYOU WANT, WHERE YOU WANT, THEWAY YOU WANT.First, let me tell you what Aaptiv is.We’re an audio fitness company that helps you workout whenever and where ever you are.Let me ask you this:How many of you had a new year’s resolution to get in better shape?
  • 4.
  • 5.
    THEN PUT YOURPHONEAWAYPICK A WORKOUTAND HIT PLAY- Find a workout and hit play- A music playlist from top artists- A professional trainer coaching you through the entire workout- No need to figure out what to do with your phone
  • 6.
    INFRASTRUCTURE @AAPTIVTHE TEAMJemAltieriIvan Lee (split)MyselfEngineering TeamInfrastructure team @Aaptiv
  • 7.
    HEROKU▸ Why Herokuin the first place?▸ Small team▸ Fast setup▸ Fail fast- Initial work with Aaptiv was only ~6 people (3 employees, 3 contractors)- Future was unknown (early stage startup)- MVP built by a single person (iOS app, Android app, API)- Built-in tools for automatic scaling and deployment
  • 8.
    WHY MIGRATE TOAWS?JANUARY 2017▸ Large scale marketing efforts▸ Driving new levels of traffic▸ Load testing showed weakness in the monolith▸ Inefficient scaling▸ MVP proving successful- Going to spend more than ever before on marketing and ads in Jan 2017 —>- If successful, would result in more traffic than ever before —>- Load testing in December showed signs of weakness in the monolith API —>- Resulted in scaling to higher numbers than acceptable —>- Good news: the MVP was successful, we knew who our customers were and what they wanted —>- This gave us the insight we needed to address the problems identified during load testing if we made it through Jan 2017
  • 9.
    CONSIDERATIONS MOVING TOAWSWITHOUT BRINGING THE SYSTEM DOWN…▸ Everything in Heroku is publicly exposed▸ Monolith —> Microservices▸ Grow into distinct teams▸ Maintain the ease of use we had in Heroku- Fortunately, Jan 2017 was hugely successful- Now it’s time to build a system to support the next few years- We defined what we wanted that post-migration environment to look like —>- Everything in Heroku is publicly exposed. We wanted to leverage AWS VPC to only expose endpoints that needed to be public —>- Breaking the monolith API into Microservices would take time, but we wanted to create an environment that supported that goal.- Because the engineering teams started work on this while we were building the AWS environment, this actually started to happen before we migrated- Resulted in some wonky use-cases of whitelisted IPs and iptables hi jinx to support running in both environments- As we grew, we needed to break development up into teams. Individual services made this easier (vs. monolith) —>- From Jan 2017 - Jan 2019, we grew from about 4 to over 40 engineers —>- The most important consideration for me was maintaining the ease of use we had by using Heroku.
  • 10.
    A SUCCESSFUL PRODUCTISONE THAT MEETS YOURCUSTOMERS WHERE THEYALREADY ARE.— Probably some important dudeTEXTFrom an infrastructure perspective, our customers are the engineers who write the code that powers Aaptiv.For any business, a successful product is one that meets your customers where they already are.Engineers live in editors and GitHub, not the AWS console so it was important to make sure that the tool we built did too.
  • 11.
    PUTTING THE “PRODUCT”IN “INFRASTRUCTURE AS A PRODUCT”ENGINEERS WRITE CODE▸ Define the run-time environment in the same repo▸ Documented▸ Audit history▸ Follows deployment process▸ Changes are peer-reviewed▸ The “right stuff” happens by default- There were a lot of things we liked about this approach:- The runtime configuration could be defined in the same repo as the code it supported ->- The runtime configuration is documented ->- Which makes it audit-able ->- And it could follow the code through its deployment process of dev —> staging —> prod ->- It can be peer-reviewed like any other code change in GitHub prior to going to production ->- And most importantly: software engineers typically don’t know a lot about security groups, IAM profiles, VPCs, load balancers, etc… so if there’s a requirement like“services need to be load balanced across at least two availability zones in the same region”, that should happen by default
  • 12.
    MANY PATHS LEADTO THE SAME DESTINATIONHOW ARE YOU GONNA BUILD IT?▸ Kubernetes▸ EC2 instances▸ Elastic Container Service (ECS)When it came time to start building it, we had some different options:
- Kubernetes would have been a valid option. - Didn’t have a lot of K8 experience in-house and didn’t know what we didn’t know- Could have used EC2 instances behind load balancers.- Felt like overkill, these were nodejs apps.- EC2 instances are slow to bring online when auto-scaling (or they live for a long time because)- ECS was new but sounded like a nice compromise between the two: you get to run your apps in docker containers in a K8-like environment without having to managethe K8 itself.- This resonated with us because of the extremely small size of the infrastructure team and lack of desire to grow it bigger
  • 13.
    THE INITIAL PATHECSON EC2▸ Running ECS on our own EC2 instances▸ Jenkins for deployments▸ (👋 to all the Jenkins-haters! Teach me what’s new!)▸ Using AWS API to build/define ECS services▸ then…So we started building out an ECS cluster using our own ec2 instances. ->We used Jenkins for deploying everything and building our docker images- which I’ll show you in just a minute.BTW: Shout out to all the Jenkins haters out there- I’d love to hear what you’re using instead and how it’s working for you.Our Jenkins jobs were all built using the Jenkins pipeline. ->This keeps our Jenkins configuration as code, and allows us to track history, audit changes, and peer review code changes.For deploying to ECS, we used the AWS API. ->And then…
  • 14.
    FARGATEEC2AWS announced Fargate,which caused us to stop and pause.At this point, we’d done a lot of framework building, test deployments, and architecting the system but we hadn’t actually deployed any production facing services yet.Fargate looked pretty promising because it removed one more layer of maintenance and headache for us: the ec2 instances.Normally- I’m the last dude on Earth to jump on a new product.It’s not that I don’t like them, it’s just that they come with bugs and my customers aren’t giving me money to find and fix bugs in someone else’s code.The other consideration is that AWS loves to release early and release often which usually means they launch with a product that does about 90% of what you think itwas going to do, leaving you to implement the rest. As a builder, I love this approach because you throw it out there and see what people do with it- then build what theyneed. It’s a true MVP. As a consumer, I hate it.
  • 15.
    THE INITIAL PATHPATH 0.2ECS ON EC2 FARGATE▸ Running ECS on our own EC2 instances Fargate▸ Jenkins for deployments▸ (👋 to all the Jenkins-haters! Teach me what’s new!)👈▸ Using AWS API CloudFormation to build/define ECSservices▸ then…We gave Fargate a shotWe’re still using Jenkins for deployments (and I still want to hear what you’re using) ->But we realized that there was a lot, I mean a lot of code to support the different API calls, check for constraints and respond properly, order the API calls in the rightsequence… ->During this time, we realized we were reinventing CloudFormation. So instead of trying to reinvent CloudFormation, we decided to just use CloudFormation which turnedinto a library we now call soa-templates. ->And then…We should shift gears here real quick because I’ve given you a lot of background information and I want to show you what the finished (well, semi-finished product) lookslike. I think that will provide the context to make the rest of this talk meaningful as we dive deeper into how this works.
  • 16.
    It all startswhen our engineers are ready to push their changes through the deployment cycle and ultimately to production.Inside their repo- they’ll have a Jenkinsfile-
  • 17.
    @Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily:'1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'It looks like this- it’s a declarative file that defines what the runtime architecture should look like for dev, staging, and production.TL;DR - explain all the fields.Well… most of them.We have global GitHub web hooks set up to respond to opening and closing pull requests. So when they open a pull request, Jenkins receives a web hook notification.
  • 18.
    Since this isa pull request, we create a fully functioning deployed environment that can be used for testing, validation, or having a conversation about the proposed codechanges.Additional commits to the branch for this PR triggers a rebuild of the environment so this service always reflects the latest code in the pull request.Here’s how that works:
  • 19.
    Jenkins starts executingthe pipeline script.You can pass in a parameter for which agent you want it to run on, then we move into the stages.The first stage is “configure build” where we do a bunch of housekeeping work. This is where we determine if this is a PR branch or a deploy to staging or production, weget the repo name, the githash and some other stuff that we’re going to need later.Next we run the test suite for your code. If the tests fail, the deployment stops here.From there, we build the docker image for your application and push it to the ECR repo for ECS. There is a docker repo in ECS for every code repo we have, whichmakes it easy to find docker images for a specific project if necessary.Then we get to the real meat of the build:We’re going to deploy the service.This is where we grab the soa-template library from it’s repo-
  • 20.
    checkout([$class: 'GitSCM',branches: [[name:'*/master']],doGenerateSubmoduleConfigurations: false,extensions: [],submoduleCfg: [],userRemoteConfigs: []])And you can specify the branch you want from that repo. So for the engineers, they always want master because that’s where our production version is.But for those of us who work on it, it’s a way for us to create a branch to iterate and test on before making changes to master and potentially affecting production.The soa-template contains our CloudFormation template
  • 21.
    myapp-pr-99.aaptiv.comAnd that CFtemplate is going to create an ELB, the listeners, and the target group for this service. ->Everything that needs to happen that the engineers don’t know need to happen, are done by default by this template.That includes things like- spanning multiple availability zones- Using the right security groups- Preventing public access unless necessaryThen it creates the ECS task definition which includes the docker image from the ECR repository we built in the previous step.and then the ECS service. ->Along with the ECS service we create autoscaling alerts to increase and decrease the number of running tasks based on CPU, memory, or latency. ->All of the logs from their application are sent to CloudWatch Logs, where they get transported to Splunk. All logs in Splunk are indexed by environment, allowing us tosee and trace events through every service in a single location. ->Finally, we create a Route53 DNS entry so engineers always have a DNS entry of project name - pr - pr number to access their dynamic environment.
  • 22.
    From here oneof two things can happen: ->First: if the clock strikes midnight, all running dev services are deleted. We do this because no one is working at midnight, so it helps us control costs. The next day if youneed the environment brought back up- you can just replay the build in Jenkins and it gets recreated. ->The second possibility is the PR is approved and merged into the main branch. When that happens…
  • 23.
    TEXTMERGE TO MASTER▸PR branch environment deleted▸ Deploy to staging▸ Same as PR deploy▸ Creates release tagThe PR branch is deleted, destroying all resources that were created as part of that environment.The code is deployed to staging and follows the same build steps used for the PR branch.Once the ECS service deploy happens, the staging environment is running the new code and if that happens successfully:
  • 24.
    TAGGED WITH SEMVER(-ISH)AND GIT HASHWe cut a pre-release tag in GitHub. It sort of follows sem-ver, meaning the patch level is incremented by 1 to indicate a new build. The major and minor version numbersare set in the Jenkinsfile, allowing the team responsible for the code to indicate major and minor builds.We also append the githash to the release number. The docker image that was built as part of this release is also tagged with that githash, allowing us to identify theexact image being used if needed.
  • 25.
    TEXTRELEASE TO PRODUCTION▸Build with parameters▸ Specify tag to release▸ Docker image is not built▸ GitHub tag markedReleasing from staging to production is done via a parameterized build.Specify the environment and tag to release and the pipeline runs again, this time for production.One difference though is that the docker image isn’t built- the image we created during the staging build is used, and we’re able to identify it because we tagged it withthe git commit hash.After the deploy completes successfully, the GitHub release tag is changed from being a “Pre-release” to “Latest Release”
  • 26.
    SHOW AND TELLUNDERTHE HOODNow that you’ve got an overview of how this works from the customer’s perspective (i.e. the engineering teams), let’s take a look under the hood at some of the majorcomponents.
  • 27.
    GITHUBWEBHOOKSJENKINS GITHUBPLUGINIt startswhen an engineer pushes code to GitHub.We have a global web hook that we’ve configured to send a notification to our Jenkins server for specific events.The payload of the web hook tells Jenkins both the repository and action that fired the event.
  • 28.
    The only thingthat needs to be configured on Jenkins is a project for that GitHub repo, and specify that the build is controlled by a Jenkinsfile.
  • 29.
    @Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily:'1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'Earlier, you saw what this Jenkinsfile looks like- it’s fairly declarative and specifies what the runtime configuration for the service should look like.The thing I’d like to draw your attention to is this: We import a Groovy library called aaptivPipelineLib.In that library, we call a function called nodePipeline. This executes the build steps necessary for a node.js project. We currently have support for python, and limitedsupport for Lambda functions and Java projects.
  • 30.
    The aaptivPipelineLib existsbecause in our Jenkins server, we defined it as a Global Pipeline Library. That makes it available to all projects in our Jenkins server. Definingit is done by giving it a name, and specifying where the source code should come from- in our case it comes from a GitHub repo.
  • 31.
    @Library('aaptivPipelineLib') _nodePipeline(releaseBranch: 'master',releaseFamily:'1.3.x',scheme: 'internal',containerCounts: ['dev': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'staging': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],'prod': ['minimumCount': 1,'desiredCount': 1,'maximumCount': 4],],scaleOutThreshold: '1.0',scaleInThreshold: '0.7',cpu: 256,memory: 512,containerPort: 3000,healthCheckPath: '/health',healthCheckIntervalSeconds: 60,healthCheckTimeoutSeconds: 30,healthyThresholdCount: 2,unhealthyThresholdCount: 10,agentLabel: 'nodejs'If we take a look at the aaptivPipelineLib project, you can see this groovy file nodePipeline.groovy.So in the Jenkinsfile for our project, when we called the nodePipeline function, it’s actually calling this groovy file within the library.
  • 32.
    parameters {choice(name: "environment",choices:"stagingnprodndev",description: "Which environment should I deploy?")string(name: "tag",defaultValue: "",description: "Which git-tag will be deployed?")}Let’s dig a little deeper into what that does. The first thing it does is define some parameters.These are the same parameters you saw in Jenkins when we specified the release tag and environment to release to production.
  • 33.
    agent {label args.agentLabel}Nextwe have the agent specification that allows the engineers to specify which build servers to run on. We currently have separate build servers for nodejs, python, andjava projects.
  • 34.
    stage('Configure build') {steps{script {slackChannel = args.slackChannel ?: 'jenkins-deployments'environment = 'dev'if (BRANCH_NAME == args.releaseBranch) {if (params.environment) {environment = params.environment} else if (params.tag == null || params.tag.isEmpty()) {environment = 'staging'}}if (args.proxyImage) {proxyRepoUri = getEcrRepo(args.proxyImage.projectName)proxyImageUrl = "${proxyRepoUri}:${gitHash}"}runTests = trueif (params.tag) {runTests = false}runBuild = trueif (params.tag || !(BRANCH_NAME == args.releaseBranch || BRANCH_NAME.startsWith("PR-"))) {runBuild = false}runDeploy = trueif (!(BRANCH_NAME == args.releaseBranch || BRANCH_NAME.startsWith("PR-"))) {runDeploy = false}runTagRelease = falseif (BRANCH_NAME == args.releaseBranch) {runTagRelease = true}if (args.customSecurityGroups) {customSecurityGroupId = args.customSecurityGroups[environment]}}}}Then we start running through the defined stages for the build.The first one is “Configure build” and I’m going to kind of just gloss over this because it just sets up some conditions for the following stages- mainly determining whetherwe are building a dev, staging, or production build and setting some global variables that are used by the following stages.
  • 35.
    stage('Run tests') {when{ expression { return runTests } }steps {npmTest("test", projectName)}}The next stage is “run Tests” which calls another groovy function: npmTest
  • 36.
    def call(def env,def project, def nodeVersion="default", reporter="true") {// Get a random, free port on Jenkins build node (so we can run multiple Skyfit tests in parallel).def randomPort = getAvailablePort()// Runs npm testssh """source ~/.bash_profileaaptivsecrets env_export --env ${env} ${project} --outfile env.propertiessource ${WORKSPACE}/env.propertiesnvm use ${nodeVersion}npm installPORT=$randomPort npm testif [ "${reporter}" == "true" ]; thennode_modules/.bin/nyc report --reporter=cobertura --dir coveragefi"""cobertura autoUpdateHealth: false, autoUpdateStability: false, coberturaReportFile: 'coverage/cobertura-coverage.xml',conditionalCoverageTargets: '70, 0, 0', failUnhealthy: false, failUnstable: false, lineCoverageTargets: '80, 0, 0', maxNumberOfBuilds: 0,methodCoverageTargets: '80, 0, 0', onlyStable: false, sourceEncoding: 'ASCII', zoomCoverageChart: false}If we take a look at that file, there are a couple of things noteworthy going on here.First is that it’s just executing some shell commands.Next we have nvm installed that allows the engineers to specify which version of nodejs their project uses.And then we run the npm test command. That’s a pretty widely accepted convention for running tests in a node.js project, so it makes it easy for us to run the testswithout having to know the implementation details of testing.The remainder of the function publishes the test results and code coverage back to Jenkins where it’s displayed visually in their project.
  • 37.
    def call(Map args){if (params.tag == null || params.tag.isEmpty()) {println("No tag set, so build the image")def projectName = args.projectNamedef repoUri = args.repoUridef validChars = "[^w^-^.^_]"def cleanBranchName = "${BRANCH_NAME}".replaceAll(validChars,"")def gitHash = args.gitHashdef buildPath = args.buildPathif (buildPath == null || buildPath.isEmpty()) {buildPath = "."}echo "Building ${projectName}"sh "source ~/.bash_profile"def dockerLogin = sh(script: '/usr/bin/aws ecr get-login --no-include-email --region us-east-1', returnStdout: true).trim()sh "${dockerLogin}"echo "Building Docker Image"sh "docker build -t ${projectName} ${buildPath}"sh "docker tag ${projectName} ${repoUri}:${cleanBranchName}"sh "docker tag ${projectName} ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker tag ${projectName} ${repoUri}:${gitHash}"sh "echo 'Pushing branch ${cleanBranchName} build ${BUILD_NUMBER} to ECR'"sh "docker push ${repoUri}:${cleanBranchName}"sh "docker push ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker push ${repoUri}:${gitHash}"sh "docker rmi ${repoUri}:${cleanBranchName}"sh "docker rmi ${repoUri}:${cleanBranchName}.${BUILD_NUMBER}"sh "docker rmi ${repoUri}:${gitHash}"}}Build Image is very similar, it calls a function called buildImage that contains this code.The main tasks performed here are building and tagging the image, then pushing it up to the AWS ECR repository. Which sounds redundant because it is, but I’m notsure how else to refer to it…
  • 38.
    Which brings usto Deploy Service, and it’s really the meat of the deployment.
  • 39.
    templateParamsJson = templateParamsJson+ """{"ParameterKey": "ImageUrl", "ParameterValue": "${imageUrl}"},{"ParameterKey": "DdApiKey", "ParameterValue": "${DD_API_KEY}"},{"ParameterKey": "BranchName", "ParameterValue": "${cleanBranch}"}"""println(templateParamsJson)def templateBody = libraryResource "com/aaptiv/autoscaling_loadbalancer.yaml"writeFile(file: "template${BUILD_TAG}.yaml", text: templateBody, encoding: "UTF-8")The first thing we do is write out all of our parameters to create the json file that we provide to CloudFormation to define the stack.
  • 40.
    def getStackStatus(stackName) {defresult = ""try {result = sh(script: """aws cloudformation describe-stacks --stack-name ${stackName}--query 'Stacks[0].StackStatus'""", returnStdout: true).trim().replace(""","")println("Stack status is: ${result}")} catch(ex) {println("Stack does not exist")}return result}Then we check to see if the stack exists. If it doesn’t, we need to create it- this is commonly the case for PR branches.If the stack does exist, we need to update it to deploy the requested changes to it.
  • 41.
    def getStackChanges(stackVar, gitHash){def changesetVar = stackVar.replace("ClientRequestToken", "ClientToken")writeFile(file: "stack_${BUILD_TAG}_changeset.json", text: changesetVar, encoding:"UTF-8")def changeSetId = sh(script: """aws cloudformation create-change-set --template-body file://"template${BUILD_TAG}.yaml" --change-set-name hash-${gitHash}${BUILD_NUMBER} --change-set-type UPDATE --cli-input-json file://"stack_${BUILD_TAG}_changeset.json" --query 'Id' """, returnStdout: true).trim()try {println("waiting for change set creation to complete")sh(script: """aws cloudformation wait change-set-create-complete --change-set-name ${changeSetId}""")} catch(ex) {println("trying to wait for changeset creation threw an error, perhaps no changes")}def changes = sh(script: """aws cloudformation describe-change-set --change-set-name ${changeSetId} --query 'Changes' """, returnStdout: true).trim()println("changes for change set ${changeSetId}: ${changes}")return changes}In cases where the stack already exists, we create a change set that describes the changes to be applied to the stack. All of our cloud formation commands are done using the aws cli And then we use the cloud formation wait function to wait for cloud formation to complete, wrapped in a try/catch block so that we’re able to identify builds thatcomplete successfully.
  • 42.
    Mappings:SubnetByScheme:'internet-facing':AZ1SubnetId: subnet-xxxxxxxxAZ2SubnetId: subnet-xxxxxxxxAZ3SubnetId:subnet-xxxxxxxxAZ4SubnetId: subnet-xxxxxxxx'internal':AZ1SubnetId: subnet-xxxxxxxxAZ2SubnetId: subnet-xxxxxxxxAZ3SubnetId: subnet-xxxxxxxxAZ4SubnetId: subnet-xxxxxxxxSecurityGroupByEnvironment:'dev':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'sg-xxxxxxxx''staging':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'bogusIdThatShouldNeverGetUsed''prod':'base': 'sg-xxxxxxxx''external': 'sg-xxxxxxxx''crossEnvAccess': 'bogusIdThatShouldNeverGetUsed'SplunkForwarderARN:'dev':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-dev''staging':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-staging''prod':'name': ‘arn:aws:lambda:us-east-1:1234567890:function:splunk-prod’Our cloud formation template starts out defining some mappings- this is where define things like the subnets and security groups based on the environment to ensurethat running tasks are distributed across multiple availability zones and use the correct security groups. This means that the engineers deploying the code don’t have tohave this knowledge but can still deploy according to the security and availability constraints we’ve defined.
  • 43.
    Resources:# Once wedecide on how we do environments, this will need to change to mapping , ratherthan an iffServiceDNSName:Type: "AWS::Route53::RecordSet"Properties:HostedZoneId: !If [EnvironmentIsProd , 'xxxxxxxx', ‘xxxxxxxx']Name: !If- EnvironmentIsProd- !Join ['', [!Ref ProjectName,'.', 'aaptiv.com', .]]- !If- EnvironmentIsStaging- !Join ['', [!Ref ProjectName , ".", 'aapdev.com', .]] #Staging Goes Directlyto Aapdev- !Join ['', [!Ref ProjectName, '-', !Ref BranchName, ".", 'aapdev.com', .]]#Other Branchs Get PrefixesTTL: '300'Type: 'CNAME'ResourceRecords:- !GetAtt LoadBalancer.DNSNameIn the resources section of our cloud formation template, we define the DNS name for the deployed service. Again, this makes it easier for developers to find the URL fortheir deployed project because it always follows the same naming convention.
  • 44.
    LoadBalancer:Type: AWS::ElasticLoadBalancingV2::LoadBalancerProperties:Scheme: !If[InternetFacing , 'internet-facing', 'internal']IpAddressType: ipv4Tags:- Key: BranchNameValue: !Ref BranchName- Key: NameValue: !Join- '-'- - !Ref Environment- !Ref ProjectName- !Ref BranchName- Key: ServiceValue: !Ref ProjectName- Key: EnvValue: !Ref Environment- Key: RoleValue: "Load Balancer"- Key: TeamValue: !Ref TeamWe define the load balancer and by using a lookup, we can correctly provision the load balancer as either internet-facing or internal.One additional thing we do here is apply tags for the load balancer name, service, environment, role, and team. We use this for cost allocation, allowing us to break downour operating costs by each of these tags and control our expenses.
  • 45.
    LoadBalancerListener:Type: AWS::ElasticLoadBalancingV2::ListenerProperties:DefaultActions:- TargetGroupArn:!Ref 'TargetGroup'Type: 'forward'LoadBalancerArn: !Ref 'LoadBalancer'Port: !If [InternetFacing , 443, 80]Protocol: !If [InternetFacing , 'HTTPS', 'HTTP']Certificates:- CertificateArn: !If- InternetFacing- !If- EnvironmentIsProd- 'arn:aws:acm:us-east-1:1234567890:certificate/xxxxxxxx'- ‘arn:aws:acm:us-east-1:1234567890:certificate/xxxxxxxx'- !Ref "AWS::NoValue"LoadBalancerRedirectListener:Type: AWS::ElasticLoadBalancingV2::ListenerCondition: InternetFacingProperties:DefaultActions:- Type: 'redirect'RedirectConfig:Port: '443'Protocol: 'HTTPS'StatusCode: 'HTTP_301'LoadBalancerArn: !Ref 'LoadBalancer'Port: 80Protocol: 'HTTP'The load balancer has to have listeners, so we define those as well.For internet-facing load balancers, we setup SSL, configure the certificate and create an automatic HTTP 301 redirect to SSL for any traffic received over HTTP.
  • 46.
    TaskDefinition:Type: 'AWS::ECS::TaskDefinition'Properties:Family: !Join['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cpu: !Ref CPUMemory: !Ref MemoryRequiresCompatibilities:- FARGATEVolumes:-Name: "aaptiv_logs"ContainerDefinitions:- Name: !Join ['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cpu: !Ref CPUMemory: !Ref MemoryImage: !Ref ImageUrlEssential: trueEnvironment:- Name: ENVValue: !Ref Environment- Name: BRANCH_NAMEValue: !Ref BranchName- Name: PROJECT_NAMEValue: !Ref ProjectName- Name: NODE_ENVValue: "production"- Name: BRANCH_OVERRIDEValue: !If [EnvironmentIsDev, !Ref BranchOverride, ""]LogConfiguration:LogDriver: "awslogs"Options:"awslogs-group": !Join- '-'- - !Ref Environment- !Ref ProjectName- !Ref BranchName"awslogs-region": "us-east-1""awslogs-stream-prefix": !Ref ProjectNamePortMappings:- ContainerPort: !Ref ContainerPortHostPort: !Ref ContainerPortProtocol: 'TCP'Then we create the ECS task definition:The task definition is the description of the ECS environment for this service. It includes the definition for the docker containers you want to run, memory and cpurequirements, and whether your task runs on EC2 or Fargate.This is largely just a variable substitution exercise, setting the parameters for the task based on the values supplied by the Jenkinsfile and the environment. Rememberthat we got these values into the CloudFormation template by writing them out to a json file in the Jenkins stage then supplying that json file as a cli argument when wecalled the cloud formation command.Things like memory, cpu, and exposed port are specified by the Jenkinsfile.Environment, branch, and project name are calculated by the pipeline library.
  • 47.
    Service:Type: 'AWS::ECS::Service'DependsOn: LoadBalancerRuleProperties:ServiceName:!Join ['-',[!Ref Environment,!Ref ProjectName, !Ref BranchName] ]Cluster: !Ref ClusterNameLaunchType: FARGATEDeploymentConfiguration:MaximumPercent: 200MinimumHealthyPercent: 50DesiredCount: !If [EnvironmentIsDev, 1, !Ref DesiredCount]HealthCheckGracePeriodSeconds: !Ref HealthCheckGracePeriodSecondsNetworkConfiguration:AwsvpcConfiguration:AssignPublicIp: DISABLEDSecurityGroups:- !FindInMap [ "SecurityGroupByEnvironment",!Ref Environment, 'base' ] ##Lets ittalk to its own env- !If ##Extra Security Group only added for dev access to staging- RequiresCrossEnvAccess- !FindInMap [ "SecurityGroupByEnvironment",!Ref Environment,'crossEnvAccess' ]- !Ref "AWS::NoValue"- !If- UseCustomSecurityGroup- !Ref CustomSecurityGroupId- !Ref "AWS::NoValue"Subnets:- !FindInMap [ "SubnetByScheme",'internal', 'AZ1SubnetId' ] #Services shouldAlways be on the- !FindInMap [ "SubnetByScheme",'internal', 'AZ2SubnetId' ] #DMZ , regardless ofwhether loadbalacner- !FindInMap [ "SubnetByScheme",'internal', 'AZ3SubnetId' ] # is internet facing- !FindInMap [ "SubnetByScheme",'internal', 'AZ4SubnetId' ]TaskDefinition: !Ref TaskDefinitionAnd then we define the service.In ECS, a service is a running instance of a task definition.Some of the key things we do here is set the desired count- that is, the number of running tasks the service should have. For production environments, this defaults to aminimum of 3 to ensure there aren’t single-point-of-failure services. In dev, we always set it to 1.We also define the network configuration for the service here, with most of it being determined by the environment: dev, staging, or prod.
  • 48.
    CLOUD FORMATION RESOURCES▸Load balancer target group▸ Cloud Watch alarms for scale in/out▸ Kinesis Firehose Delivery▸ Cloud Watch Subscription Filters for logsAnd this continues for the remaining stack resources, but it’s largely the same process: define the resource and populate the configuration values based on parametersspecified in the Jenkinsfile or the environment.The resources include the load balancer target group, cloud watch alarms for scaling, kinesis firehose delivery subscriptions for logs,
  • 49.
    post {always {notifySlack(currentBuild.currentResult,slackChannel, environment)notifyInflux()}success {script {if (deployServiceOutput && deploymentId) {def deploymentUrl = "http://${deployServiceOutput.hostname}"githubHelper.updateDeployment(…)}}}failure {script {if (deploymentId) {githubHelper.updateDeployment(…)}}}unstable {script {if (deploymentId) {githubHelper.updateDeployment(…)}}}cleanup {deleteDir()}}At the end of our build, we now know whether is was successful or not so - we post a notification in Slack- We send the build metrics to InfluxDB - We update GitHub with the deployment url and build status- Finally, we clean up our working directory to free up the space on the build servers
  • 50.
    COMPLETING THE MIGRATIONFROM HEROKU TO AWSMOVING DAY▸ Deploy to AWS▸ Change URL in mobile clients▸ 🍻 🍻 🍻 🍻So with all of that built, we could finally migrate from Heroku to AWS.Building that took a long time but we felt like it was the right thing to do. One of the important considerations for me was to ensure the engineers had the same userexperience or better after the migration, so building that user experience took time.Once it was done though- we needed to deploy the services running in Heroku to AWS using the new pipeline, then the mobile engineers deployed a new version of ourapp with the URLs for the new service locations.
  • 51.
    NOT EVERYONE UPDATES▸Some users never update▸ Shutting down Heroku wouldbreak the app for them▸ Not supporting 2 versions of API▸ Deploy nginx on Heroku▸ Someday, we’ll get to take it downThere was one remaining step.Some users don’t update their app. This meant we couldn’t shut down Heroku entirely, and weren’t willing to support 2 versions of our API.Fortunately with Heroku, you can deploy docker images as well as code. So, we deployed an nginx container that simply redirected all the traffic it received to our newAPI URL.This had the effect of sending all traffic to our new API, even the users who didn’t update.Eventually, we’ll be able to turn off Heroku completely- either because everyone has upgraded or we release a breaking change incompatible with the older clients.
  • 52.
    TEXTSIX MONTHS LATER▸1 Monolith —> 9 MicroServices▸ Services behind VPC▸ Improved scalability▸ Improved security▸ Improved agilityIt’s been about six months since we made the cutover.Since then, we’ve successfully broken our single monolith API into about 9 different microservices, All of the services are protected behind the VPC, including the databases they talk to- something we didn’t have with Heroku.We are much more scalable, and this results from both separating our services out allowing them scale independently and from allowing each service to iterate and growbased on that team’s schedule.I already mentioned the security improvements by moving everything behind the VPC, but we gained additional security by setting sensible defaults for security groupsand public access in the pipeline, removing that burden from the engineers.And we’re much more agile- engineers can bring environments up and down through a simple GitHub pull request.
  • 53.
    DECEMBER 2016DECEMBER 2017DECEMBER2019Just to provide some perspective- Here was an average day in December 2016- ranging from 50 rpm and peaking somewhere just over 200 rpm.In December 2017, we were between 700 and 1400 rpmAnd last month we ranged from just under 2000 rpm and peaked just over 13000 rpm.FWIW- I have no idea what happened to December 2018.
  • 54.
    One final metricon our agility- this is the number of builds per day in Jenkins.You can see it ranges from a low of around 40-ish on December 30th to almost 150 on 12/20 and 1/8.Now most of these are PR builds, but I think it speaks pretty well to the confidence and usefulness of our pipeline.
  • 55.
    The last slideI want to show you is on the costs associated with this project.We did our initial cutover in May. As we increased our utilization of AWS, you can see the blue line representing increased AWS costs, and the red line showingdecreasing costs in Heroku as we turned things off.The really interesting line to me is the yellow one- between May and August, we added several new micro services to AWS. The cost of those is reflected in the blue line.The yellow line shows what those same services would have cost us in Heroku had we chosen to build them there instead. The difference between the yellow line and theblue line is roughly $10k/month, which adds up to some nice savings really quickly.
  • 56.
    WHERE DO WEGO FROM HERE?‣ Pull requests‣ Datadog metrics —> InfluxDB‣ AWS Service TagsSo the next steps for us haven’t been really decided yet, but I think two areas that will be on our radar shortly are data dog metrics and AWS service tags.
  • 57.
    DATADOG METRICSWe useDatadog for a lot of our infrastructure metrics, including ECS. The really interesting thing we’ve stumbled upon due to the way ECS and Datadog work is that you can get metrics for your running tasks, but once a task has exited(either intentionally or due to a crash), getting the metrics becomes difficult.You saw in the pipeline that we’re sending build stats to InfluxDB, and we use it a lot of other places as well. I think it might make sense for us to have a container agentthat sends CPU and memory metrics to InfluxDB as well if doing so gives us more insight and persistence of our metrics.
  • 58.
    AWS TAGGINGAnother areaI’d like to see us pursue is additional AWS tags. AWS recently added the ability to tag ECS Services, which would allow us to eliminate some of these No TagKey costs in our AWS bill. Unfortunately, you have to changeto a new resource id format to get that feature, which means deleting and recreating some resources- like our production API.So we need to evaluate what that would look like and what options we’d have to do so without taking a downtime hit.
  • 59.
    TEXTWRAPPING IT UP▸Move to AWS for security,scalability, agility, costs▸ Built custom pipeline for ECSFargate using Jenkins Pipelinesand CloudFormation▸ Feature requirements primarilydriven by creating a great userexperience for the engineers▸ Pipeline modularized to takeadvantage of shared components
  • 60.

[8]ページ先頭

©2009-2025 Movatter.jp