Scaling Engineering with Docker

SCALING ENGINEERING WITH DOCKER
A Case Study
https://flic.kr/p/ba2mjn

TOM LEACH
@tomtheguvnor
github.com/tleach
TRAVISTHIEMAN
@thieman
github.com/thieman
gc.com/about

WHAT WE’LL COVER
1. What motivated GameChanger to adopt Docker?
2. Walkthrough of GameChanger Deploy Pipeline

WHAT MOTIVATED GAMECHANGERTO
ADOPT DOCKER?

• Scorekeeping
• 150+ Stats
• Live Gamestream
• Team management
• 12TB (10 histories
of pro sports)
• 10 MongoDB shards
• 100-400 app servers
• 50K games/day
(10K concurrent)
• 3000 w/s, 30,000 r/s
GameChanger’s market is amateur sports. Whereas ESPN caters to handful of top professional teams in the country, GameChanger provides free tools to the millions of
amateur sports teams around the world.

ELIMINATING DEPLOY-TIME RISK
This graph shows the number of requests/second received by one of our services over the last week. The area under the graph is broken down by host. You can see that
we are scaling our hosts up and down in response to demand.

At GC we have an extremely spike traﬃc profile so using autoscaling is important to control costs. Therefore it’s very important not only to deploy new application code
to existing servers but also to be able to very reliably build new servers with minimal risk.

Chef Server
App
Server
App
Server
App
Server
App
Server
App
Server
App
Server
30 30 30 30 30
30
To illustrate the risks associated with a traditional Configuration Management approach to building servers, let’s look at the typical Chef architecture.

This is a CM server which hosts the current valid configuration data for the cluster (and by configuration data we also mean setup scripts etc)

The Developer is responsible for pushing new configuration to the CM server and then all app servers periodically pull and execute the latest scripts.

Risks:

- CM server is a SPOF. Chef is painful to scale out.

- CM server needs to be scaled to support max conceivable cluster size (or we have problems when we need it most)

- Thundering herd

github.com/miketheman/knife-role-spaghetti
This is a visualization of GC’s role/recipe dependencies in Chef before we moved to Docker.

Risks:

- Spaghetti-like dependencies are impossible to reason about (what happens if I upgrade node.js?)

- Dependencies are indirect and not explicit

- Testing is expensive and time consuming. Devs are disincentivized from testing.

- Coupling issues not discovered until deploy time -> can take down your cluster.

- Rollback can be painful

App
Server
PyPI
npm
Ubuntu apt-get
rubygems
.tar.gz files
apache
binaries
S3
GitHub
Deploy-time dependencies on multiple external repositories is a big risk.

- Build AMIs (complex, heavy, time consuming, does not allow us to iterate fast enough)

- Host you own mirror for services like PyPI. But then who owns the maintenance of that mirror?

HOW DOES DOCKER
ELIMINATETHESE RISKS?
• Assets are baked into an immutable image at build time
• No deploy-time dependencies on 3rd party repos
• Docker registry is simple and easy to scale
• Dependencies simple, explicit and direct
• Rollback is trivial

SCALING ENGINEERING
A less obvious problem with traditional CM approaches is how they inhibit the scaling of engineering. Let’s illustrate with an example…

Application
FeatureTeam
Like many companies, GC’s product started out as a small Python app developed by a couple of people. At this point we only have a few users so we can run on a
couple of servers and deployment is simple manual step.

Application
FeatureTeam FeatureTeam
As we build more features our application gets bigger and we hire more people to help build and maintain those features. We’re still doing some form of manual
deployment at this point, and though it’s starting to become a bottleneck we’re still prioritizing feature development.

Monolithic Application
FeatureTeam FeatureTeam FeatureTeam
We grow further and our application grows accumulating more and more responsibilities. The need to coordinate test + build + deploy necessitates an Ops team to own
this problem.

Monolithic Application
Deployment
(Test + Build + Deploy)
OpsTeam
Following a more “Dev Ops” mantra, these responsibilities form more of a continuum. Devs care about getting their code to prod, Ops care about what the code does,
both cooperate.

Deploying a monolithic app in this way actually works pretty well. The tech stack is fairly static, and forms a shared context which minimizes the friction between dev and
ops teams.

The problem for GC was that this monolithic architecture scaled poorly for us:

- Poor ownership boundaries

- Quality of shared components suffered

- Introducing new languages is difficult to sell

- Different features have different CAP requirements

- Operational problems derived from indirect coupling

μ
OpsTeam
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Deployment
(Test + Build + Deploy)
Solution: Teams own collections of independently-scalable microservices with clear ownership boundaries. Teams ar

But this poses a problem for our previous deployment approach:

- Suddenly Ops need to know how to deploy an ever growing list of technologies

- Information friction between Dev and Ops is high as the context is dynamic

- Deployment using something like Chef becomes more and more complex

- As feature teams are added, Ops becomes a bottleneck, the relationship risks becoming adversarial

–Melvin Conway, 1968
“Any organization that designs a system … will
inevitably produce a design whose structure is a
copy of the organization's communication structure.”
Conway’s Law

A collection of teams that design a system will inevitably produce a design which evolves from the minimum amount of out-of-band communication required between
those teams.

μ
Feature + OpsTeam Feature + OpsTeam Feature + OpsTeam
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Deployment Deployment Deployment
In the face of the need high-traﬃc high-complexity communication to get software deployed, teams will be motivated towards compartmentalizing the way they approach
deployment. This is much better as the contextual footprint for each mini-Ops team is manageable.

But there is still a problem here. We risk duplicating eﬀort across teams on “core” deployment activities.

CORE DEPLOYMENTTASKS
• Log rotation
• User account creation,
sudoers, SSH keys
• Continuous Integration
• Metrics
• DNS
• Monitoring & alerting
• ulimits
• Tool installation
• …
All of these are important. Doing them well requires that they be owned and continuously improved and maintained as a first class system asset. On feature teams they
will not be treated in this way, we’ll duplicate eﬀort building several half-formed implementations of these things.

μ
μ
μ μ
μ μ
μ μ
μ μ
μ μ
Build Build Build
OpsTeam
“Core” Deployment Pipeline
We still needed an Ops Team to own the core parts of deployment, but needed a way to ensure the interface between the feature and Ops teams to have low information
friction and not require the Ops team to understand n diﬀerent tech stacks.

We could have tried to use Chef to do this by making each team own its own roles, but you end up running into problems around shared dependencies, global state and
indirect coupling.

Docker provides a neat abstraction which allows these responsibilities to be separated clearly and scalably.

HOW DOES DOCKER ALLOW USTO
SCALE ENGINEERING?
1. Development team has complete control over what they deploy
2. Core deployment can still be owned by a dedicated team as a first
class concern
3. Small shared context needed for cross-team communication
1. allows us to scale teams out linearly without creating a centralized bottleneck

2. eliminates wasted duplicate effort and the effort of maintaining a substandard system

3. eliminates waste effort communicating complex requirements in an out of band way

Test Build Deploy
We’re going to run through the test-build-deploy pipeline at GameChanger. We’re using a separate service for
each of those, so let’s introduce the cast of characters.

Drone
TheTest Runner
Docker-based testing
Fully isolated and
concurrent tests
Easily tests against
service containers,
e.g. Postgres

Jenkins
The Build Server
In charge of building
Docker containers,
app packages, etc.
Can do almost anything,
but isolation not a
strong suit

Bagel
The Deploy Service
Manages versioning of
deployable apps across
environments
One-click deploy, rollback
mmm, bagel

Test Build Deploy
We’re going to go through what it takes to wire up an application to work with our pipeline…

Test Build Deploy
…and while we’re doing it, we’re going to highlight the ways in which Docker helps us achieve the goals that Tom
was talking about earlier.

Python + Postgres:A Simple Application
Let’s consider a simple Python application that works with a Postgres database. It has a bunch of unit tests,
including unit tests that require connecting to an actual Postgres instance to run.

Test
Removes some dependency
setup concerns from
application dev
Tests are fully isolated,
coupling is minimized
Fast and parallelizable,
multiple teams can work
on a single app without
slowing each other down
Drone
TheTest Runner
Drone, as we mentioned before, is what we use to run tests. What are the benefits we get from using Drone?

Drone uses a simpleYAML specification
Test

Test
Drone uses a simpleYAML specification

Application devs do
not need to know
how to install these
Use official images or
write a generalized
image once, share with
all your teams
Test

Server
Host OS
Each test run is
fully isolated using
containers
Test

Server
Host OS
Each test run is
fully isolated using
containers
Parallel testing
becomes trivial
Fast testing of PRs
reduces likelihood of
breaking the build
Test

Build
Receives Git hash and
static dependencies
from Drone
Builds Docker images,
pushes to our private
Docker registry
Jenkins
The Build Server

Drone
PyPi Jenkins
Registry
Image
Build

Tested application code
(specific Git hash)
Same library versions used
to test that code, e.g. pip
freeze, npm shrinkwrap
All system libraries,
drivers, etc.
Build
What exactly are we putting into our images?
Note that our image will *not* contain service dependencies like the Postgres we want to run against. We have a
few options for how to connect to a database at runtime.

Drone
PyPi Jenkins
Registry
Image
Build
So this is how we make the build work with Docker. Why is this actually better than the traditional model? Well…

Drone
PyPi Jenkins
Registry
Image
Build
X
…what if we put a bullet in PyPi and are no longer able to get our library dependencies?
Before we answer that, what happened in the old world? We’d deploy new code, our servers would all try to pull
from PyPi, fail, and freak out. Is our site down? Are we up but with incorrect or partial dependencies? It’s not
great. What happens with Docker?

Drone
PyPi Jenkins
Registry
Image
Build
X
With Docker, that risk is moved from deploy time to build time. Jenkins will try to pull from PyPi and fail. It won’t
be able to push a new image with your updated code. This is usually a good thing! You can have confidence that
all the images you *do* have will have all their dependencies and be fully working images.

Mostly a thin API on
top of our Docker registry
Also owns triggering
deploys across our
infrastructure
Deploy
Bagel
The Deploy Service
Bagel gives us a way to coordinate the images in our Docker registry with their corresponding Git tags, the
dependencies that were baked into the images, etc.
I think just go into a quick demo here, show some cool dependency diffs and PR messages or something.

Deploy
deploy.travisthieman.com
Deploys triggered
via gossip protocol
Coordinated with
distributed locks
What happens when we hit the Deploy button in Bagel?

All our machines run identical
OS-level images
Images and runtime config
specified viaYAML
Deploy mechanism on each
machine reconciles spec and
containers currently running
Similar to Docker Compose
All our boxes run off the same machine image (AMI). A YAML file specifying which of our apps should be
deployed to that box is all that distinguishes it. Our pretty-dumb deploy scripts (triggered by Bagel) handle
matching the running state of that box’s containers to what’s in this YAML file and the current deployed versions
according to Bagel.

Test Build Deploy
That’s our deploy pipeline. Using Docker, we’ve seen significant gains in simplicity and developer productivity
across our test, build, and deploy stages. Our feature teams can release new services with ease, and our Ops
team has been phased out of existence. Our engineers are now free to focus on problems that benefit our
customers and our business.
Before I go, just a few closing thoughts on Docker as someone who’s spent a bit of time with it…

Young Ecosystem
RollYour Own?
Build Process Intricacies
Docker on OS X
Development Environments

QUESTIONS?
https://flic.kr/p/cQ3kfu
@tomtheguvnor
github.com/tleach
@thieman
github.com/thieman

Scaling Engineering with Docker

Recommandé

Recommandé

Contenu connexe

Tendances

Tendances (20)

Similaire à Scaling Engineering with Docker

Similaire à Scaling Engineering with Docker (20)

Dernier

Dernier (20)

Scaling Engineering with Docker

Notes de l'éditeur