Tom Leach and Travis Thieman of GameChanger talk about their experiences migrating their build and deploy pipeline from being heavily based on Chef to one based around Docker.
This presentation is split in to two main sections. The first section covers the motivations for why GameChanger, as a fast-growing startup, identified a need to replace it's existing Chef-based deploy model with a model which reduces deploy-time risk and allows its engineering team to scale.
The second section is a high-level walkthrough of the new GameChanger deploy pipeline based around Docker.
5. • Scorekeeping
• 150+ Stats
• Live Gamestream
• Team management
• 12TB (10 histories
of pro sports)
• 10 MongoDB shards
• 100-400 app servers
• 50K games/day
(10K concurrent)
• 3000 w/s, 30,000 r/s
GameChanger’s market is amateur sports. Whereas ESPN caters to handful of top professional teams in the country, GameChanger provides free tools to the millions of
amateur sports teams around the world.
6. ELIMINATING DEPLOY-TIME RISK
This graph shows the number of requests/second received by one of our services over the last week. The area under the graph is broken down by host. You can see that
we are scaling our hosts up and down in response to demand.
At GC we have an extremely spike traffic profile so using autoscaling is important to control costs. Therefore it’s very important not only to deploy new application code
to existing servers but also to be able to very reliably build new servers with minimal risk.
7. Chef Server
App
Server
App
Server
App
Server
App
Server
App
Server
App
Server
30 30 30 30 30
30
To illustrate the risks associated with a traditional Configuration Management approach to building servers, let’s look at the typical Chef architecture.
This is a CM server which hosts the current valid configuration data for the cluster (and by configuration data we also mean setup scripts etc)
The Developer is responsible for pushing new configuration to the CM server and then all app servers periodically pull and execute the latest scripts.
Risks:
- CM server is a SPOF. Chef is painful to scale out.
- CM server needs to be scaled to support max conceivable cluster size (or we have problems when we need it most)
- Thundering herd
8. github.com/miketheman/knife-role-spaghetti
This is a visualization of GC’s role/recipe dependencies in Chef before we moved to Docker.
Risks:
- Spaghetti-like dependencies are impossible to reason about (what happens if I upgrade node.js?)
- Dependencies are indirect and not explicit
- Testing is expensive and time consuming. Devs are disincentivized from testing.
- Coupling issues not discovered until deploy time -> can take down your cluster.
- Rollback can be painful
10. HOW DOES DOCKER
ELIMINATETHESE RISKS?
• Assets are baked into an immutable image at build time
• No deploy-time dependencies on 3rd party repos
• Docker registry is simple and easy to scale
• Dependencies simple, explicit and direct
• Rollback is trivial
11. SCALING ENGINEERING
A less obvious problem with traditional CM approaches is how they inhibit the scaling of engineering. Let’s illustrate with an example…
12. Application
FeatureTeam
Like many companies, GC’s product started out as a small Python app developed by a couple of people. At this point we only have a few users so we can run on a
couple of servers and deployment is simple manual step.
13. Application
FeatureTeam FeatureTeam
As we build more features our application gets bigger and we hire more people to help build and maintain those features. We’re still doing some form of manual
deployment at this point, and though it’s starting to become a bottleneck we’re still prioritizing feature development.
14. Monolithic Application
FeatureTeam FeatureTeam FeatureTeam
We grow further and our application grows accumulating more and more responsibilities. The need to coordinate test + build + deploy necessitates an Ops team to own
this problem.
15. Monolithic Application
FeatureTeam FeatureTeam FeatureTeam
Deployment
(Test + Build + Deploy)
OpsTeam
Following a more “Dev Ops” mantra, these responsibilities form more of a continuum. Devs care about getting their code to prod, Ops care about what the code does,
both cooperate.
Deploying a monolithic app in this way actually works pretty well. The tech stack is fairly static, and forms a shared context which minimizes the friction between dev and
ops teams.
The problem for GC was that this monolithic architecture scaled poorly for us:
- Poor ownership boundaries
- Quality of shared components suffered
- Introducing new languages is difficult to sell
- Different features have different CAP requirements
- Operational problems derived from indirect coupling
16. μ
FeatureTeam FeatureTeam FeatureTeam
OpsTeam
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Deployment
(Test + Build + Deploy)
Solution: Teams own collections of independently-scalable microservices with clear ownership boundaries. Teams ar
But this poses a problem for our previous deployment approach:
- Suddenly Ops need to know how to deploy an ever growing list of technologies
- Information friction between Dev and Ops is high as the context is dynamic
- Deployment using something like Chef becomes more and more complex
- As feature teams are added, Ops becomes a bottleneck, the relationship risks becoming adversarial
17. –Melvin Conway, 1968
“Any organization that designs a system … will
inevitably produce a design whose structure is a
copy of the organization's communication structure.”
Conway’s Law
A collection of teams that design a system will inevitably produce a design which evolves from the minimum amount of out-of-band communication required between
those teams.
18. μ
Feature + OpsTeam Feature + OpsTeam Feature + OpsTeam
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Deployment Deployment Deployment
In the face of the need high-traffic high-complexity communication to get software deployed, teams will be motivated towards compartmentalizing the way they approach
deployment. This is much better as the contextual footprint for each mini-Ops team is manageable.
But there is still a problem here. We risk duplicating effort across teams on “core” deployment activities.
19. CORE DEPLOYMENTTASKS
• Log rotation
• User account creation,
sudoers, SSH keys
• Continuous Integration
• Metrics
• DNS
• Monitoring & alerting
• ulimits
• Tool installation
• …
All of these are important. Doing them well requires that they be owned and continuously improved and maintained as a first class system asset. On feature teams they
will not be treated in this way, we’ll duplicate effort building several half-formed implementations of these things.
20. μ
FeatureTeam FeatureTeam FeatureTeam
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Build Build Build
OpsTeam
“Core” Deployment Pipeline
We still needed an Ops Team to own the core parts of deployment, but needed a way to ensure the interface between the feature and Ops teams to have low information
friction and not require the Ops team to understand n different tech stacks.
We could have tried to use Chef to do this by making each team own its own roles, but you end up running into problems around shared dependencies, global state and
indirect coupling.
Docker provides a neat abstraction which allows these responsibilities to be separated clearly and scalably.
21. HOW DOES DOCKER ALLOW USTO
SCALE ENGINEERING?
1. Development team has complete control over what they deploy
2. Core deployment can still be owned by a dedicated team as a first
class concern
3. Small shared context needed for cross-team communication
1. allows us to scale teams out linearly without creating a centralized bottleneck
2. eliminates wasted duplicate effort and the effort of maintaining a substandard system
3. eliminates waste effort communicating complex requirements in an out of band way
23. Test Build Deploy
We’re going to run through the test-build-deploy pipeline at GameChanger. We’re using a separate service for
each of those, so let’s introduce the cast of characters.
28. Test Build Deploy
We’re going to go through what it takes to wire up an application to work with our pipeline…
29. Test Build Deploy
…and while we’re doing it, we’re going to highlight the ways in which Docker helps us achieve the goals that Tom
was talking about earlier.
30. Python + Postgres:A Simple Application
Let’s consider a simple Python application that works with a Postgres database. It has a bunch of unit tests,
including unit tests that require connecting to an actual Postgres instance to run.
31. Test
Removes some dependency
setup concerns from
application dev
Tests are fully isolated,
coupling is minimized
Fast and parallelizable,
multiple teams can work
on a single app without
slowing each other down
Drone
TheTest Runner
Drone, as we mentioned before, is what we use to run tests. What are the benefits we get from using Drone?
36. Server
Host OS
Each test run is
fully isolated using
containers
Parallel testing
becomes trivial
Fast testing of PRs
reduces likelihood of
breaking the build
Test
37. Build
Receives Git hash and
static dependencies
from Drone
Builds Docker images,
pushes to our private
Docker registry
Jenkins
The Build Server
39. Tested application code
(specific Git hash)
Same library versions used
to test that code, e.g. pip
freeze, npm shrinkwrap
All system libraries,
drivers, etc.
Build
What exactly are we putting into our images?
Note that our image will *not* contain service dependencies like the Postgres we want to run against. We have a
few options for how to connect to a database at runtime.
41. Drone
PyPi Jenkins
Registry
Image
Build
X
…what if we put a bullet in PyPi and are no longer able to get our library dependencies?
Before we answer that, what happened in the old world? We’d deploy new code, our servers would all try to pull
from PyPi, fail, and freak out. Is our site down? Are we up but with incorrect or partial dependencies? It’s not
great. What happens with Docker?
42. Drone
PyPi Jenkins
Registry
Image
Build
X
With Docker, that risk is moved from deploy time to build time. Jenkins will try to pull from PyPi and fail. It won’t
be able to push a new image with your updated code. This is usually a good thing! You can have confidence that
all the images you *do* have will have all their dependencies and be fully working images.
43. Mostly a thin API on
top of our Docker registry
Also owns triggering
deploys across our
infrastructure
Deploy
Bagel
The Deploy Service
Bagel gives us a way to coordinate the images in our Docker registry with their corresponding Git tags, the
dependencies that were baked into the images, etc.
I think just go into a quick demo here, show some cool dependency diffs and PR messages or something.
46. All our machines run identical
OS-level images
Images and runtime config
specified viaYAML
Deploy mechanism on each
machine reconciles spec and
containers currently running
Similar to Docker Compose
All our boxes run off the same machine image (AMI). A YAML file specifying which of our apps should be
deployed to that box is all that distinguishes it. Our pretty-dumb deploy scripts (triggered by Bagel) handle
matching the running state of that box’s containers to what’s in this YAML file and the current deployed versions
according to Bagel.
47. Test Build Deploy
That’s our deploy pipeline. Using Docker, we’ve seen significant gains in simplicity and developer productivity
across our test, build, and deploy stages. Our feature teams can release new services with ease, and our Ops
team has been phased out of existence. Our engineers are now free to focus on problems that benefit our
customers and our business.
Before I go, just a few closing thoughts on Docker as someone who’s spent a bit of time with it…
GameChanger’s market is amateur sports. Whereas ESPN caters to handful of top professional teams in the country, GameChanger provides free tools to the millions of amateur sports teams around the world.
This graph shows the number of requests/second received by one of our services over the last week. The area under the graph is broken down by host. You can see that we are scaling our hosts up and down in response to demand.
At GC we have an extremely spike traffic profile so using autoscaling is important to control costs. Therefore it’s very important not only to deploy new application code to existing servers but also to be able to very reliably build new servers with minimal risk.
To illustrate the risks associated with a traditional Configuration Management approach to building servers, let’s look at the typical Chef architecture.
This is a CM server which hosts the current valid configuration data for the cluster (and by configuration data we also mean setup scripts etc)
The Developer is responsible for pushing new configuration to the CM server and then all app servers periodically pull and execute the latest scripts.
Risks:
- CM server is a SPOF. Chef is painful to scale out.
- CM server needs to be scaled to support max conceivable cluster size (or we have problems when we need it most)
- Thundering herd
This is a visualization of GC’s role/recipe dependencies in Chef before we moved to Docker.
Risks:
- Spaghetti-like dependencies are impossible to reason about (what happens if I upgrade node.js?)
- Dependencies are indirect and not explicit
- Testing is expensive and time consuming. Devs are disincentivized from testing.
- Coupling issues not discovered until deploy time -> can take down your cluster.
- Rollback can be painful
Deploy-time dependencies on multiple external repositories is a big risk.
- Build AMIs (complex, heavy, time consuming, does not allow us to iterate fast enough)
- Host you own mirror for services like PyPI. But then who owns the maintenance of that mirror?
A less obvious problem with traditional CM approaches is how they inhibit the scaling of engineering. Let’s illustrate with an example…
Like many companies, GC’s product started out as a small Python app developed by a couple of people. At this point we only have a few users so we can run on a couple of servers and deployment is simple manual step.
As we build more features our application gets bigger and we hire more people to help build and maintain those features. We’re still doing some form of manual deployment at this point, and though it’s starting to become a bottleneck we’re still prioritizing feature development.
We grow further and our application grows accumulating more and more responsibilities. The need to coordinate test + build + deploy necessitates an Ops team to own this problem.
Following a more “Dev Ops” mantra, these responsibilities form more of a continuum. Devs care about getting their code to prod, Ops care about what the code does, both cooperate.
Deploying a monolithic app in this way actually works pretty well. The tech stack is fairly static, and forms a shared context which minimizes the friction between dev and ops teams.
The problem for GC was that this monolithic architecture scaled poorly for us:
- Poor ownership boundaries
- Quality of shared components suffered
- Introducing new languages is difficult to sell
- Different features have different CAP requirements
- Operational problems derived from indirect coupling
Solution: Teams own collections of independently-scalable microservices with clear ownership boundaries. Teams ar
But this poses a problem for our previous deployment approach:
- Suddenly Ops need to know how to deploy an ever growing list of technologies
- Information friction between Dev and Ops is high as the context is dynamic
- Deployment using something like Chef becomes more and more complex
- As feature teams are added, Ops becomes a bottleneck, the relationship risks becoming adversarial
Conway’s Law
A collection of teams that design a system will inevitably produce a design which evolves from the minimum amount of out-of-band communication required between those teams.
In the face of the need high-traffic high-complexity communication to get software deployed, teams will be motivated towards compartmentalizing the way they approach deployment. This is much better as the contextual footprint for each mini-Ops team is manageable.
But there is still a problem here. We risk duplicating effort across teams on “core” deployment activities.
All of these are important. Doing them well requires that they be owned and continuously improved and maintained as a first class system asset. On feature teams they will not be treated in this way, we’ll duplicate effort building several half-formed implementations of these things.
We still needed an Ops Team to own the core parts of deployment, but needed a way to ensure the interface between the feature and Ops teams to have low information friction and not require the Ops team to understand n different tech stacks.
We could have tried to use Chef to do this by making each team own its own roles, but you end up running into problems around shared dependencies, global state and indirect coupling.
Docker provides a neat abstraction which allows these responsibilities to be separated clearly and scalably.
allows us to scale teams out linearly without creating a centralized bottleneck
eliminates wasted duplicate effort and the effort of maintaining a substandard system
eliminates waste effort communicating complex requirements in an out of band way
We’re going to run through the test-build-deploy pipeline at GameChanger. We’re using a separate service for each of those, so let’s introduce the cast of characters.
mmm, bagel
We’re going to go through what it takes to wire up an application to work with our pipeline…
…and while we’re doing it, we’re going to highlight the ways in which Docker helps us achieve the goals that Tom was talking about earlier.
Let’s consider a simple Python application that works with a Postgres database. It has a bunch of unit tests, including unit tests that require connecting to an actual Postgres instance to run.
Drone, as we mentioned before, is what we use to run tests. What are the benefits we get from using Drone?
What exactly are we putting into our images?
Note that our image will *not* contain service dependencies like the Postgres we want to run against. We have a few options for how to connect to a database at runtime.
So this is how we make the build work with Docker. Why is this actually better than the traditional model? Well…
…what if we put a bullet in PyPi and are no longer able to get our library dependencies?
Before we answer that, what happened in the old world? We’d deploy new code, our servers would all try to pull from PyPi, fail, and freak out. Is our site down? Are we up but with incorrect or partial dependencies? It’s not great. What happens with Docker?
With Docker, that risk is moved from deploy time to build time. Jenkins will try to pull from PyPi and fail. It won’t be able to push a new image with your updated code. This is usually a good thing! You can have confidence that all the images you *do* have will have all their dependencies and be fully working images.
Bagel gives us a way to coordinate the images in our Docker registry with their corresponding Git tags, the dependencies that were baked into the images, etc.
I think just go into a quick demo here, show some cool dependency diffs and PR messages or something.
What happens when we hit the Deploy button in Bagel?
All our boxes run off the same machine image (AMI). A YAML file specifying which of our apps should be deployed to that box is all that distinguishes it. Our pretty-dumb deploy scripts (triggered by Bagel) handle matching the running state of that box’s containers to what’s in this YAML file and the current deployed versions according to Bagel.
That’s our deploy pipeline. Using Docker, we’ve seen significant gains in simplicity and developer productivity across our test, build, and deploy stages. Our feature teams can release new services with ease, and our Ops team has been phased out of existence. Our engineers are now free to focus on problems that benefit our customers and our business.
Before I go, just a few closing thoughts on Docker as someone who’s spent a bit of time with it…