Do you understand how quorum, consensus, leader election, and different scheduling algorithms can impact your running application? Could you explain these concepts to the rest of your team? Come learn about the algorithms that power all modern container orchestration platforms, and walk away with actionable steps to keep your highly available services highly available.
Connector Corner: Extending LLM automation use cases with UiPath GenAI connec...
Everything You Thought You Already Knew About Orchestration
1. Everything You Thought
You Already Knew About
Orchestration
Laura Frank
Director of Engineering,
Codeship
3. Managing Distributed State with Raft
Quorum 101
Leader Election
Log Replication
Service Scheduling
Failure Recovery
Agenda
bonus debugging tips!
4. They’re trying to get a collection of nodes to behave
like a single node.
• How does the system maintain state?
• How does work get scheduled?
The Big Problem(s)
What are tools like Swarm and Kubernetes trying to do?
7. Quorum
The minimum number of votes that a consensus
group needs in order to be allowed to perform
an operation.
Without quorum, your system can’t do work.
8. Math! Managers Quorum Fault Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
(N/2) + 1
In simpler terms, it
means a majority
9. Math! Managers Quorum Fault Tolerance
1 1 0
2 2 0
3 2 1
4 3 1
5 3 2
6 4 2
7 4 3
(N/2) + 1
In simpler terms, it
means a majority
10. Having two managers instead of
one actually doubles your
chances of losing quorum.
11. Pay attention to datacenter topology when placing
managers.
Quorum With Multiple Regions
Manager Nodes Distribution across 3 Regions
3 1-1-1
5 1-2-2
7 3-2-2
9 3-3-3
magically works with
Docker for AWS
15. Orchestration systems typically use a key/value
store backed by a consensus algorithm
In a lot of cases, that algorithm is Raft!
Raft is used everywhere…
…that etcd is used
17. In most cases, you don’t want to run
work on your manager nodes
docker node update --availability drain <NODE>
Participating in a Raft consensus group is work, too.
Make your manager nodes unavailable for tasks:
*I will run work on managers for educational purposes
21. The log is the source of truth for
your application.
22. In the context of distributed computing (and this
talk), a log is an append-only, time-based record
of data.
2 10 30 25 5 12first entry append entry here!
This log is for computers, not humans.
23. 2 10 30 25 5 12
Server
12
Client
12
In simple systems, the log is pretty straightforward.
24. In a manager group, that log entry can only “become
truth” once it is confirmed from the majority of
followers (quorum!)
Client
12
Manager
follower
Manager
follower
Manager
leader
31. Scheduling constraints
Restrict services to specific nodes, such as specific
architectures, security levels, or types
docker service create
--constraint 'node.labels.type==web' my-app
32. New in 17.04.0-ce
Topology-aware scheduling!!1!
Implements a spread strategy over nodes that belong to a
certain category.
Unlike --constraint, this is a “soft” preference
—placement-pref ‘spread=node.labels.dc’
34. Swarm will not rebalance
healthy tasks when a new
node comes online
35. Debugging Tip
Add a manager to your
Swarm running with
--availability drain
and in Engine debug mode
38. • Bring the downed nodes back online (derp)
Regain quorum
• On a healthy manager, run
docker swarm init --force-new-cluster
This will create a new cluster with one healthy manager
• You need to promote new managers
40. • Bring up a new manager and stop Docker
• sudo rm -rf /var/lib/docker/swarm
• Copy backup to /var/lib/docker/swarm
• Start Docker
• docker swarm init (--force-new-cluster)
Restore from a backup in 5 easy steps!
41. • In general, users shouldn’t be allowed to modify IP
addresses of nodes
• Restoring from a backup == old IP address for node1
• Workaround is to use elastic IPs with ability to reassign
But wait, there’s a bug… or a feature