SlideShare a Scribd company logo
1 of 129
Download to read offline
Cluster Schedulers
Pietro Michiardi
Eurecom
Pietro Michiardi (Eurecom) Cluster Schedulers 1 / 129
Introduction
Introduction and
Preliminaries
Pietro Michiardi (Eurecom) Cluster Schedulers 2 / 129
Introduction
Overview of the Lecture
General overview of cluster scheduling principles
General objectives
A taxonomy
Current architectures
In depth presentation of three representative examples
Yarn
Mesos
Borg (Kubernetes)
Pietro Michiardi (Eurecom) Cluster Schedulers 3 / 129
Introduction Cluster Scheduling Principles
Cluster Scheduling Principles
Pietro Michiardi (Eurecom) Cluster Schedulers 4 / 129
Introduction Cluster Scheduling Principles
Objectives
Large-scale clusters are expensive, so it is important to use
them well
Cluster utilization and efficiency are key indicators for good
resource management and scheduling decisions
Translates directly to cost arguments: better scheduling → smaller
clusters
Multiplexing to the rescue
Multiple, heterogeneous mixes of application run concurrently
→ The scheduling problem is a challenge
Scalability bottlenecks
Cluster and workload sizes keep growing
Scheduling complexity roughly proportional to cluster size
→ Schedulers must be scalable
Pietro Michiardi (Eurecom) Cluster Schedulers 5 / 129
Introduction Cluster Scheduling Principles
Current Scheduler Architectures
Monolitic: use a centralized scheduling and resource
management algorithm for all jobs
Difficult to add new scheduling policies
Do not scale well to large cluster sizes
Two-level: single resource manager that grants resources to
independent “framework schedulers”
Flexibility in accommodating multiple application frameworks
Resource management and locking are conservative, which can
hurt cluster utilization and performance
Pietro Michiardi (Eurecom) Cluster Schedulers 6 / 129
Introduction Cluster Scheduling Principles
Typical workloads to support
Cluster scheduler must support heterogeneous workloads
and clusters
Clusters are made of several generations of machines
Workloads evolve in time, and can be made of a variety of
applications
Rough categorization of job types
Batch jobs: e.g., MapReduce computations
Service jobs: e.g., end-user facing web service
Knowing your workload is fundamental!
Next, some examples from real cluster traces
The rationale: measurements can inform scheduling design
Pietro Michiardi (Eurecom) Cluster Schedulers 7 / 129
Introduction Cluster Scheduling Principles
Real cluster trace: workload characteristics
Solid lines: batch jobs, dashed lines: service jobs
Pietro Michiardi (Eurecom) Cluster Schedulers 8 / 129
Introduction Cluster Scheduling Principles
Real cluster trace: workload characteristics
Solid lines: batch jobs, dashed lines: service jobs
Pietro Michiardi (Eurecom) Cluster Schedulers 9 / 129
Introduction Cluster Scheduling Principles
Short Taxonomy of Scheduling Design Issues
Scheduling Work Partitioning: how to distribute work across
frameworks
Workload-oblivious load-balancing
Workload partitioning and specialized schedulers
Hybrid
Resource choice: which cluster resources are available to
concurrent frameworks
All resources available
A subset of cluster resources is granted (or offered)
NOTE: preemption primitives help scheduling flexibility at the cost
of potentially wasting work
Pietro Michiardi (Eurecom) Cluster Schedulers 10 / 129
Introduction Cluster Scheduling Principles
Short Taxonomy of Scheduling Design Issues
Interference: what to do when multiple frameworks attempt at
using the same resources
Pessimistic concurrency control: make sure to avoid conflicts, by
partitioning resources across frameworks
Optimistic concurrency control: hope for the best, otherwise detect
and undo conflicting claims
Allocation Granularity: task “scheduling” policies
Atomic, all-or-nothing gang scheduling: e.g. MPI
Incremental placement, hoarding: e.g. MapReduce
Cluster-wide behavior: some requirements need global view
Fairness across frameworks
Global notion of priority
Pietro Michiardi (Eurecom) Cluster Schedulers 11 / 129
Introduction Cluster Scheduling Principles
Summary of design knobs
Comparison of cluster scheduling approaches
Pietro Michiardi (Eurecom) Cluster Schedulers 12 / 129
Introduction Cluster Scheduling Principles
Architecture Details
High-level summary of scheduling objectives
Minimize the job queueing time, or more generally, the system
response time (queueing + service times)
Subject to
Priorities among jobs
Per-job constraints
Failure tolerance
Scalability
Scheduling architectures
Monolithic schedulers
Statically partitioned schedulers
Two-level schedulers
Shared-state schedulers (cf. Omega paper)
Pietro Michiardi (Eurecom) Cluster Schedulers 13 / 129
Introduction Cluster Scheduling Principles
Monolithic Schedulers
Single centralized instance
Typical of HPC setting
Implements all policies in a single code base
Applies the same scheduling algorithm to all incoming jobs
Alternative designs
Support multiple code paths for different jobs
Each path implements a different scheduling logic
→ Difficult to implement and maintain
Pietro Michiardi (Eurecom) Cluster Schedulers 14 / 129
Introduction Cluster Scheduling Principles
Statically Partitioned Schedulers
Standard “Cloud-computing” approach
Underlying assumption: each framework has complete control over
a set of resources
Depend on statically partitioned, dedicated resources
Examples: Hadoop 1.0, Quincy scheduler
Problems with static partitioning
Fragmentation of resources
Sub-optimal cluster utilization
Pietro Michiardi (Eurecom) Cluster Schedulers 15 / 129
Introduction Cluster Scheduling Principles
Two-level Schedulers
Obviates the problems of static partitioning
Dynamic allocation of resources to concurrent frameworks
Use a “logically centralized” coordinator to decide how many
resources to grant
A first example: Mesos
Centralized resource allocator, with dynamic cluster partitioning
Available resources are offered to competing frameworks
Avoids interference by exclusive offers
Frameworks lock resources by accepting offers → pessimistic
concurrency control
No global cluster state available to frameworks
Another tricky example: YARN
Centralized resource allocator (RM), with per-job framework master
(AM)
AM only provides job management services, not proper scheduling
→ YARN is closer to a monolithic architecture
Pietro Michiardi (Eurecom) Cluster Schedulers 16 / 129
Introduction Cluster Scheduling Principles
Representative Cluster
Schedulers
Pietro Michiardi (Eurecom) Cluster Schedulers 17 / 129
YARN
YARN
Pietro Michiardi (Eurecom) Cluster Schedulers 18 / 129
YARN Introduction and Motivations
Introduction and Motivations
Pietro Michiardi (Eurecom) Cluster Schedulers 19 / 129
YARN Introduction and Motivations
Hadoop 1.0: Focus on Batch applications
Built for batch applications
Supports only MapReduce applications
Different silos for each usage pattern
Pietro Michiardi (Eurecom) Cluster Schedulers 20 / 129
YARN Introduction and Motivations
Hadoop 1.0: Architecture (reloaded)
JobTracker
Manages cluster resources
Performs Job scheduling
Performs Task scheduling
TaskTracker
Per machine agent
Manages Task execution
Pietro Michiardi (Eurecom) Cluster Schedulers 21 / 129
YARN Introduction and Motivations
Hadoop 1.0: Limitations
Only supports MapReduce, no other paradigms
Everything needs to be cast to MapReduce
Iterative applications are slow
Scalability issues
Max cluster size roughy 4,000 nodes
Max concurrent tasks, roughly 40,000 tasks
Availability
System failures destroy running and queued jobs
Resource utilization
Hard, static partitioning of resources in Map or Reduce slots
Non-optimal resource utilization
Pietro Michiardi (Eurecom) Cluster Schedulers 22 / 129
YARN Introduction and Motivations
Next Generation Hadoop
Pietro Michiardi (Eurecom) Cluster Schedulers 23 / 129
YARN Introduction and Motivations
The YARN ecosystem
Store all data in one place
Avoids costly duplication
Interact with data in multiple ways
Not only in batch mode, with the rigid MapReduce model
More predictable performance
Advanced scheduling mechanisms
Pietro Michiardi (Eurecom) Cluster Schedulers 24 / 129
YARN Introduction and Motivations
Key Improvements in YARN (1)
Support for multiple applications
Separate generic resource brokering from application logic
Define protocols/libraries and provide a framework for custom
application development
Share same Hadoop Cluster across applications
Improved cluster utilization
Generic resource container model replaces fixed Map/Reduce slots
Container allocations based on locality and memory
Sharing cluster among multiple application
Improved scalability
Remove complex app logic from resource management
State machine, message passing based loosely coupled design
Compact scheduling protocol
Pietro Michiardi (Eurecom) Cluster Schedulers 25 / 129
YARN Introduction and Motivations
Key Improvements in YARN (2)
Application Agility
Use Protocol Buffers for RPC gives wire compatibility
Map Reduce becomes an application in user space
Multiple versions of an app can co-exist leading to experimentation
Easier upgrade of framework and applications
A data operating system: shared services
Common services included in a pluggable framework
Distributed file sharing service
Remote data read service
Log Aggregation Service
Pietro Michiardi (Eurecom) Cluster Schedulers 26 / 129
YARN YARN Architecture Overview
YARN Architecture Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 27 / 129
YARN YARN Architecture Overview
YARN: Architecture Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 28 / 129
YARN YARN Architecture Overview
YARN: Design Decisions
No static resource partitioning
There are no more slots
Nodes have resources, which are allocated to applications when
requested
Separate resource management from application logic
Cluster-wide resource allocation and management
Per-application master component
Multiple applications → multiple masters
Pietro Michiardi (Eurecom) Cluster Schedulers 29 / 129
YARN YARN Architecture Overview
YARN Daemons
Resource Manager (RM)
Runs on master node
Global resource manager and scheduler
Arbitrates system resources between competing applications
Node Manager (NM)
Run on slave nodes
Communicates with RM
Reports utilization
Resource containers
Created by the RM upon request
Allocate a certain amount of resources on slave nodes
Applications run in one or more containers
Application Master (AM)
One per application, application specific1
Requests more containers to execute application tasks
Runs in a container
1
Every new application requires a new AM to be designed and implemented!
Pietro Michiardi (Eurecom) Cluster Schedulers 30 / 129
YARN YARN Architecture Overview
YARN: Example with 2 Applications
Pietro Michiardi (Eurecom) Cluster Schedulers 31 / 129
YARN YARN Core Components
YARN Core Components
Pietro Michiardi (Eurecom) Cluster Schedulers 32 / 129
YARN YARN Core Components
YARN Schedulers (1)
Schedulers are a pluggable component of the RM
In addition to existing ones, advanced scheduling is supported
Current supported schedulers
The Capacity scheduler
The Fair scheduler
Dominant Resource Fairness
What’s different w.r.t. Hadoop 1.0?
Support any YARN application, not just MapReduce
No more slots, tasks are scheduled based on resources
Some terminology change
Pietro Michiardi (Eurecom) Cluster Schedulers 33 / 129
YARN YARN Core Components
YARN Schedulers (2)
Hierarchical queues
Queues can contain
sub-queues
Sub-queues share
resources assigned to
queues
Pietro Michiardi (Eurecom) Cluster Schedulers 34 / 129
YARN YARN Core Components
YARN Resource Manager: Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 35 / 129
YARN YARN Core Components
YARN Resource Manager: Operations
Node Management
Tracks hearbeats from NMs
Container Management
Handles AM requests for new containers
De-allocates containers when they expire or application finishes
AM Management
Creates a container for every new AM, and tracks its health
Security Management
Kerberos integration
Pietro Michiardi (Eurecom) Cluster Schedulers 36 / 129
YARN YARN Core Components
YARN Node Manager: Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 37 / 129
YARN YARN Core Components
YARN Node Manager: Operations
Manages communications with the RM
Registers, monitors and communicates node resources
Sends heartbeats and container status
Manages processes in containers
Launches AMs on request from the RM
Launches application processes on request from the AMs
Monitors resource usage
Kills processes and containers
Provides logging services
Log aggregation and roll over to HDFS
Pietro Michiardi (Eurecom) Cluster Schedulers 38 / 129
YARN YARN Core Components
YARN Resource Request
Resource Request
Resource name: hostname, rackname, *
Priority: within the same application, not across apps
Resource requirements: memory, CPU, and more to come...
Number of containers
Pietro Michiardi (Eurecom) Cluster Schedulers 39 / 129
YARN YARN Core Components
YARN Containers
Container Launch Context
Container ID
Commands to start application task(s)
Environment configuration
Local resources: application/task binary, HDFS files
Pietro Michiardi (Eurecom) Cluster Schedulers 40 / 129
YARN YARN Core Components
YARN Fault Tolerance
Container failure
AM re-attempts containers that complete with exceptions or fail
Applications with too many failed containers are considered failed
AM failure
If application or AM fail, the RM will re-attempt the whole application
Optional strategy: job recovery
If false, all containers are re-scheduled
If true, uses state to find which containers succeeded and which
failed, to re-schedule only failed ones
Pietro Michiardi (Eurecom) Cluster Schedulers 41 / 129
YARN YARN Core Components
YARN Fault Tolerance
NM failure
If NM stops sending heartbeats, RM removes it from active node list
Containers on the failed node are re-scheduled
AM on the failed node are re-submitted completely
RM failure
No application can be run if RM is down
Can work in active-passive mode (just like the NN of HDFS)
Pietro Michiardi (Eurecom) Cluster Schedulers 42 / 129
YARN YARN Core Components
YARN Shuffle Service
The Shuffle mechanism is now an auxiliary service
Runs in the NM JVM as a persistent service
Pietro Michiardi (Eurecom) Cluster Schedulers 43 / 129
YARN YARN Application Example
YARN Application Example
Pietro Michiardi (Eurecom) Cluster Schedulers 44 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 45 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 46 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 47 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 48 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 49 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 50 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 51 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 52 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 53 / 129
YARN YARN Application Example
YARN WordCount execution
Pietro Michiardi (Eurecom) Cluster Schedulers 54 / 129
MESOS
Mesos
Pietro Michiardi (Eurecom) Cluster Schedulers 55 / 129
MESOS Introduction
Introduction
Pietro Michiardi (Eurecom) Cluster Schedulers 56 / 129
MESOS Introduction
Introduction and Motivations
Clusters of commodity servers major computing platform
Modern Internet Services
Data-intensive applications
New frameworks developed to “program the cluster”
Hadoop MapReduce, Apache Spark, Microsoft Dryad
Pregel, Storm, ...
and many more
No one-size fit them all
Pick the right frameworks for the application
Run multiple frameworks at the same time
→ Multiplexing cluster resources among frameworks
Improves cluster utilization
Allows sharing of data without the need to replicate it
Pietro Michiardi (Eurecom) Cluster Schedulers 57 / 129
MESOS Introduction
Common Solutions to Share a Cluster
Common practice to achieve cluster sharing
Static partitioning
Traditional virtualization
Problems of current approaches
Mismatch between allocation granularities
No mechanism to allocate resources to short-lived tasks
→ Underlying hypothesis for Mesos
Cluster frameworks operate with short tasks
Cluster resources free up quickly
This allows to achieve data locality
Pietro Michiardi (Eurecom) Cluster Schedulers 58 / 129
MESOS Introduction
Mesos Design Objectives
Mesos: a thin resource sharing layer enabling fine-grained sharing
across diverse frameworks
Challenges
Each supported framework has different scheduling needs
Scalability is crucial (10,000+ nodes)
Fault-tolerance and high availability
Would a centralized approach work?
Input: framework requirements, instantaneous resource availability,
organization policies
Output: global schedule for all tasks of all jobs of all frameworks
Pietro Michiardi (Eurecom) Cluster Schedulers 59 / 129
MESOS Introduction
Mesos Key Design Principles
Centralized approach does not work
Complexity
Scalability and resilience
Moving framework-specific scheduling to a centralized scheduler
requires expensive refactoring
A decentralized approach
Based on the abstraction of a resource offer
Mesos decides how many resources to offer to a framework
The framework decides which resources to accept and which tasks
to run on them
Pietro Michiardi (Eurecom) Cluster Schedulers 60 / 129
MESOS Target Environment
Target Workloads
Pietro Michiardi (Eurecom) Cluster Schedulers 61 / 129
MESOS Target Environment
Target Environment
Typical workloads in “Data Warehouse” systems
Heterogeneous MapReduce jobs, production and ad-hoc queries
Large scale machine learning
SQL-like queries
Pietro Michiardi (Eurecom) Cluster Schedulers 62 / 129
MESOS Architecture
Architecture
Pietro Michiardi (Eurecom) Cluster Schedulers 63 / 129
MESOS Architecture
Design Philosophy
Data center operating system
Scalable and resilient core exposing low-level interfaces
High-level libraries for common functionalities
Minimal interface to support resource sharing
Mesos manages cluster resources
Frameworks control task scheduling and execution
Benefits of two-level approach
Frameworks are independent and can support diverse scheduling
requirements
Mesos is kept simple, minimizing the rate of change to the system
Pietro Michiardi (Eurecom) Cluster Schedulers 64 / 129
MESOS Architecture
Architecture Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 65 / 129
MESOS Architecture
Architecture Overview
The Mesos Master
Uses Resource Offers to implement fine-grained sharing
Collects resource utilization from slaves
Resource offer: list of free resources on multiple slaves
First-level Scheduling
Master decides how many resources to offer a framework
Implements a cluster-wide allocation policy:
Fair Sharing
Priority based
Pietro Michiardi (Eurecom) Cluster Schedulers 66 / 129
MESOS Architecture
Architecture Overview
Mesos Frameworks
Framework scheduler
Registers to the master
Selects which offer to accept
Describes the tasks to launch on accepted resources
Framework executor
Launched on Mesos slaves executing on accepted resources
Takes care of running the framework’s tasks
Second-level Scheduling
One framework scheduler per application
Framework decides how to execute a job and its tasks
NOTE: The actual task execution is requested by the master
Pietro Michiardi (Eurecom) Cluster Schedulers 67 / 129
MESOS Architecture
Resource Offer Example
Pietro Michiardi (Eurecom) Cluster Schedulers 68 / 129
MESOS Architecture
Consequences of the Mesos Architecture
Mesos makes no assumptions on Framework requirements
This is unlike other approaches, which requires the cluster
scheduler to understand application constraints
This does not mean that users are not required to express their
applications’ constraints
Rejecting offers
It is the framework that decides to reject a resource offer that does
not satisfy application constraints
Frameworks can wait for offers to satisfy constraints
Arbitrary and complex resource constraints
Delegate logic and control to individual frameworks
Mesos also implements filters to optimize resource offers
Pietro Michiardi (Eurecom) Cluster Schedulers 69 / 129
MESOS Architecture
Resource Allocation
Pluggable allocation module
Max-min Fairness
Strict priority
Fundamental assumption
Tasks are short
→ Mesos only reallocates resources when tasks finish
Example
Assume a Framework’s share is 10% of the cluster
It needs to wait 10% of the mean task length to receive its share
Pietro Michiardi (Eurecom) Cluster Schedulers 70 / 129
MESOS Architecture
Resource Revocation
Short vs. long lived tasks
Some jobs (e.g. streaming) may have long tasks
In this case, Mesos can Kill running tasks
Preemption primitives
Require knowledge about potential resource usage by frameworks
Killing might be wasteful, although not critical (e.g. MapReduce)
Some applications (e.g. MPI) might be harmed
Guaranteed allocation
Minimum set of resources granted to a framework
If below guaranteed allocation → never kill tasks
If above guaranteed allocation → kill any tasks
Pietro Michiardi (Eurecom) Cluster Schedulers 71 / 129
MESOS Architecture
Performance Isolation
Isolation between executors on the same slave
Achieved through low-level OS primitives
Pluggable isolation modules to support a variety of OS
Currently supported mechanisms
Limit CPU, memory, network and I/O bandwidth of a process tree
Linux Containers and Solaris Cages
Advantages and limitations
Better isolation than current approach, process-based
Fine grained isolation is not yet fully functional
Pietro Michiardi (Eurecom) Cluster Schedulers 72 / 129
MESOS Architecture
Mesos Scalability
Filter mechanism
Short-circuit the rejection process, avoids unnecessary
communication
Filter type 1: restrict which slave machines to use
Filter type 2: check resource availability on slaves
Incentives to speed-up the resource offer mechanism
Mesos counts offers to a framework toward its allocation
Frameworks have to answer and/or filter as quickly as possible
Rescinding offers
Mesos can decide to invalidate an offer to a framework
This avoids blocking and misbehavior
Pietro Michiardi (Eurecom) Cluster Schedulers 73 / 129
MESOS Architecture
Mesos Fault Tolerance
Master designed with Soft State
List of active slaves
List of registered frameworks
List of running tasks
Multiple masters in a hot-standby mode
Leader election through Zookeeper
Upon failure detection new master is elected
Slaves and executors help populating the new master’s state
Helping frameworks to tolerate failure
Master sends “health reports” to framework schedulers
Master allows multiple schedulers for a single framework
Pietro Michiardi (Eurecom) Cluster Schedulers 74 / 129
MESOS Mesos Behavior
System behavior: a very rough Mesos “model”
Pietro Michiardi (Eurecom) Cluster Schedulers 75 / 129
MESOS Mesos Behavior
Overview: Mesos in a nutshell
Ideal workloads for Mesos
Elastic frameworks, supporting scaling up and down seamlessly
Task durations are homogeneous (and short)
No strict preference over cluster nodes
Frameworks with cluster node preferences
Assume frameworks prefer different (and possibly disjoint) nodes
Mesos can emulate a centralized scheduler
Cluster and Framework wide fair resource sharing
Heterogeneous task durations
Mesos can handle coexisting short and long lived tasks
Performance degradation is acceptable
Pietro Michiardi (Eurecom) Cluster Schedulers 76 / 129
MESOS Mesos Behavior
Definitions
Workload characterization
Elasticity: elastic workloads can use resources as soon as they are
acquired, and release them as soon as tasks finish; in contrast,
rigid frameworks (e.g. MPI) can only start a job when all resources
have been acquired, and do not work well with scaling
Task runtime distribution: both homogeneous and not
Resource characterization
Mandatory: resource that a framework must acquire to work.
Assumption: mandatory resources < guaranteed share
Preferred: resources that a framework should acquire to achieve
better performance, but are not necessary for the job to work
Pietro Michiardi (Eurecom) Cluster Schedulers 77 / 129
MESOS Mesos Behavior
Performance Metrics
Performance metrics
Framework ramp-up time: time it takes a new framework to achieve
its fair share
Job completion time: time it takes a job to complete. Assume one
job per framework
System utilization: total cluster resource utilization, with focus on
CPU and memory
Pietro Michiardi (Eurecom) Cluster Schedulers 78 / 129
MESOS Mesos Behavior
Homogeneous Tasks
Cluster with n slots and a framework f entitled with k slots
Task runtime distribution: uniform and exponential
Mean task duration T
Job duration: βkT
→ If f has k slots, then job duration is βT
Pietro Michiardi (Eurecom) Cluster Schedulers 79 / 129
MESOS Mesos Behavior
Placement preferences
Consider two cases:
There exist a configuration satisfying all frameworks
constraints
The system will eventually converge to the state in which the
optimal allocation is achieved, and this in at most one T interval
No such allocation exists, e.g. demand is larger than supply
Lottery Scheduling to achieve a weighted fair allocation
Mesos offers a slot to framework i with probability
si
m
i=1 si
where si is framework’s i intended allocation, and m is the total
number of frameworks registered to Mesos
Pietro Michiardi (Eurecom) Cluster Schedulers 80 / 129
MESOS Mesos Behavior
Heterogeneous Tasks
Assumptions
Workloads with tasks that are either long or short
Mean duration of long task is longer than short ones
Worst case scenario
All nodes required by a “short job” are filled with long tasks, which
means it has to wait for a long time
How likely is the worst case?
Assume φ < 1, where φ fraction of long tasks
Assume a cluster with S available slots per node
→ Probability for a node to be filled with long tasks is φS
S = 8 and φ = 0.5 gives a 0.4% chance
Pietro Michiardi (Eurecom) Cluster Schedulers 81 / 129
MESOS Mesos Behavior
Limitations of Distributed Scheduling
Fragmentation
Provokes under utilization of system resources
Distributed collection of frameworks might not achieve the same
“packing” quality of a centralized scheduler
→ This is mitigated by having clusters of “big” nodes (many CPUs,
many cores) running “small” tasks
Starvation
Large jobs may wait indefinitely for slots to become free
Small tasks from small jobs might monopolize the cluster
→ This is mitigated by a minimum offer size mechanism
Pietro Michiardi (Eurecom) Cluster Schedulers 82 / 129
MESOS Mesos Performance
Experimental Mesos Performance Evaluation
Pietro Michiardi (Eurecom) Cluster Schedulers 83 / 129
MESOS Mesos Performance
Resource Allocation
Pietro Michiardi (Eurecom) Cluster Schedulers 84 / 129
MESOS Mesos Performance
Resource Utilization
Pietro Michiardi (Eurecom) Cluster Schedulers 85 / 129
MESOS Mesos Performance
Data Locality
Pietro Michiardi (Eurecom) Cluster Schedulers 86 / 129
BORG
Borg
Pietro Michiardi (Eurecom) Cluster Schedulers 87 / 129
BORG Introduction
Introduction
Pietro Michiardi (Eurecom) Cluster Schedulers 88 / 129
BORG Introduction
Introduction and Objectives
Hide the details of resource management
Let users instead focus on application developlemt
Operate applications with high reliability and availability
Tolerate failures within a datacenter and across datacenters
Run heterogeneous workloads and scale across thousands
of machines
Pietro Michiardi (Eurecom) Cluster Schedulers 89 / 129
BORG User perspective
The User Perspective
Pietro Michiardi (Eurecom) Cluster Schedulers 90 / 129
BORG User perspective
Terminology
Users develop applications called jobs
Jobs consists in one or more tasks
All tasks run the same binary
Each job runs in a set of machines managed as a unit, called
a Borg Cell
Pietro Michiardi (Eurecom) Cluster Schedulers 91 / 129
BORG User perspective
Workloads
Two main categories supported
Long-running services: jobs that should “never” go down, and
handle short-lived, latency-sensitive requests
Batch jobs: delay-tolerant jobs that can take from few seconds to
few days to complete
Storage services: these are long-running services like above, that
are used to store data
Workload composition in a cell is dynamic
It varies depending on the tenants using the cell
It varies with time: diurnal usage pattern for end-user-facing jobs,
irregular pattern for batch jobs
Examples
High-priority, production jobs → long-running services
Low-priority, non-production jobs → batch jobs
In a typical Borg Cell
Prod Jobs: 70% of CPU allocation, representing 60% of CPU usage
Non-prod Jobs: 55% of CPU allocation, representing 85% of CPU
usage
Pietro Michiardi (Eurecom) Cluster Schedulers 92 / 129
BORG User perspective
Clusters and Cells
Borg Cluster: a set of machines connected by a
high-performance datacenter-scale network fabric
The machines in a Borg Cell all belong to a single cluster
A cluster lives inside a datacenter building
A collection of building makes up a Site
Borg Machines: physical servers dedicated to execute Borg
applications
They are generally highly heterogeneous in terms of resources
They may expose a public IP address
They may expose advanced features, like SSD or GPGPU
Examples
A typical cluster usually hosts one large cell and a few small-scale
test cells
The median cell size is about 10k machines
Borg uses those machines to schedule application tasks, install
their binaries and dependencies, monitor their health and restarting
them if they fail
Pietro Michiardi (Eurecom) Cluster Schedulers 93 / 129
BORG User perspective
Jobs and Tasks
Job Definition
Name, owner and number of tasks
Constraints to force tasks run on machines with particular attributes
Constraints can be hard or soft (i.e., preferences)
Each task maps to a set of UNIX processes running in a container
on a Borg machine in a Borg Cell
Task Definition
Task index within their parent job
Resource requirements
Generally, all tasks have the same definition
Tasks can run on any resource dimension: there are no fixed-size
slots or buckets
Pietro Michiardi (Eurecom) Cluster Schedulers 94 / 129
BORG User perspective
Jobs and Tasks
The Borg Configuration Language
Declarative language to specify jobs and tasks
Lambda functions to allow calculations
Some application descriptions can be over 1k lines of code
User interacting with live jobs
This is achieved mainly using RPC
Users can update the specification of tasks, while their parent job is
running
Updates are non-atomic, executed in a rolling-fashion
Task updates “side-effects”
Always require restarts: e.g., pushing a new binary
Might require migration: e.g., change in specification
Never require restarts nor migrations: e.g., change in priority
Pietro Michiardi (Eurecom) Cluster Schedulers 95 / 129
BORG User perspective
Jobs State Diagram
Pietro Michiardi (Eurecom) Cluster Schedulers 96 / 129
BORG User perspective
Resource Allocations
The Borg “Alloc”
Reserved set of resources on an individual machine
Can be used to execute one or more tasks, that equally share
resources
Resources remain assigned whether or not they are used
Typical use of Borg Allocs
Set resources aside for future tasks
Retain resources between stopping and starting tasks
Consolidate (gather) tasks from different jobs on the same machine
Alloc Sets
Group of allocs on different machines
Once an alloc set has been created, one or more jobs can be
submitted
Pietro Michiardi (Eurecom) Cluster Schedulers 97 / 129
BORG User perspective
Priority, Quota and Admission Control
Mechanisms to deal with resource demand and offer
What to do when more work shows up than can be accommodated?
Note: this is not scheduling, it is more admission control
Job priority
Non-overlapping priority bands for different uses
This essentially means users must “manually” cluster their
applications according to such bands
→ Tasks from high-priority jobs can preempt low-priority tasks
Cascade preemption is avoided by disabling it for same-band jobs
Job/User quotas
Used to decide which job to admit for scheduling
Expressed as a vector of resource quantities
Pricing
Underlying mechanism to regulate user behavior
Aligns user incentives to better resource utilization
Discourages over-buying by over-selling quotas at lower priority
Pietro Michiardi (Eurecom) Cluster Schedulers 98 / 129
BORG User perspective
Naming Services
Borg Name Service
A mechanism to assign a name to tasks
Task name = Cell name, job name and task number
Uses the Chubby coordination service
Writes task names into it
Writes also health information and status
Used by Borg RPC mechanism to establish communication
endpoints
DNS service inherits from BNS
Example: the 50th task in job “jfoo” owned by user “ubar” in a Borg
Cell called “cc”
50.jfoo.ubar.cc.borg.google.com
Pietro Michiardi (Eurecom) Cluster Schedulers 99 / 129
BORG User perspective
Monitoring Services
Every task in Borg has a built-in HTTP server
Provides health information
Provides performance metrics
Borg SIGMA
Monitoring UI Service
State of jobs, of cells
Drill-down to task level
Why pending?
“Debugging” service
Helps users with finding job specifications that can be easily
scheduled
Billing services
Use monitoring information to compute usage-based charging
Help users debug their jobs
Used for capacity planning
Pietro Michiardi (Eurecom) Cluster Schedulers 100 / 129
BORG Borg Architecture
Borg Architecture
Pietro Michiardi (Eurecom) Cluster Schedulers 101 / 129
BORG Borg Architecture
Architecture Overview
Pietro Michiardi (Eurecom) Cluster Schedulers 102 / 129
BORG Borg Architecture
Architecture Overview
Architecture components
A set of physical machines
A logically centralized controller, the Borgmaster
An agent process running on all machines, the Borglet
Pietro Michiardi (Eurecom) Cluster Schedulers 103 / 129
BORG Borg Architecture
The Borgmaster: Components
The Borgmaster
One per Borg Cell
Orchestrates cell resources
Components
The Borgmaster process
The scheduler process
The Borgmaster process
Handles client RPCs that either mutate state or lookup for state
Manages the state machines for all Borg “objects” (machines,
tasks, allocs, etc...)
Communicates with all Borglets in the cell
Provides a Web-based UI
Pietro Michiardi (Eurecom) Cluster Schedulers 104 / 129
BORG Borg Architecture
The Borgmaster: Reliability
Borgmaster reliability achieved through replication
Single logical process, replicated 5 times in a cell
Single master elected using Paxos when starting a cell, or upon
failure of the current master
Master serves as Paxos leader and cell state mutator
Borgmaster replicas
Maintain an in-memory fresh copy of the cell state
Persist their state to a distributed Paxos-based store
Help building the most up-to-date cell state when a new master is
elected
Pietro Michiardi (Eurecom) Cluster Schedulers 105 / 129
BORG Borg Architecture
The Borgmaster: State
Borgmaster checkpoints its state
Time-based and event-based mechanism
State include everything related to a cell
Checkpoint utilization
Restore the state to a functional one, e.g. before a failure or a bug
Studying a faulty state and fixing it by hand
Build a persistent log of events for future queries
Use it for offline simulations
Pietro Michiardi (Eurecom) Cluster Schedulers 106 / 129
BORG Borg Architecture
The Fauxmaster
A high-fidelity simulator
It reads checkpoint files
Full-fledged Borgmaster code
Stubbed-out interfaces to Borglets
Fauxmaster operation
Accepts RPCs to make state machine changes
Connects to simulated Borglets that replay real interactions from
checkpoint files
Fauxmaster benefits
Help users debug their application
Capacity planning, e.g. “How many new jobs of this type would fit in
the cell?”
Perform sanity checks for cell configurations, e.g. “Will this new
configuration evict any important jobs?”
Pietro Michiardi (Eurecom) Cluster Schedulers 107 / 129
BORG Borg Architecture
Scheduling
Queue based mechanism
New submitted jobs (and their tasks) are stored in the Paxos store
(for reliability) and put in the pending queue
The scheduler process
Operates at the task level, not the job level
Scans asynchronously the pending queue
Assigns tasks to machines that satisfy constraints and that have
enough resources
Pending task selection
Scanning proceeds from high to low priority tasks
Within the same priority class, scheduling uses a round-robin
mechanism
→ Ensures fairness
→ Avoids head-of-line blocking behind large jobs
Pietro Michiardi (Eurecom) Cluster Schedulers 108 / 129
BORG Borg Architecture
Scheduling Algorithm
The scheduling algorithm has two main processes
Feasibility checking: find a set of machines that
Meet tasks’ constraints
Have enough available resources, including those that are currently
assigned to low-priority tasks that can be evicted
Scoring: among the set returned by the previous process, rank
such machines to
Minimize the number and priority of preempted tasks
Prefer machines with a local copy of tasks binaries and dependencies
Spread tasks across failure and power domains
Pack and spread tasks, mixing high and low priority ones on the
same machine to allow high-priority tasks to eventually expand
Pietro Michiardi (Eurecom) Cluster Schedulers 109 / 129
BORG Borg Architecture
More on the scoring mechanism
Worst-fit scoring: spreading tasks
Single cost value across heterogeneous resources
Minimize the change in cost when placing a new task
→ Leaves headroom for load spikes
→ But leads to fragmentation
Best-fit scoring: “waterfilling” algorithm
Tries to fill machines as tightly as possible
→ Leaves empty machines that can be used to place large tasks
→ But difficult to deal with load spikes as the headroom left in each
machine depends highly on load estimation
Hybrid
Tries to reduce the amount of stranded resources
Performs better than best-fit
Pietro Michiardi (Eurecom) Cluster Schedulers 110 / 129
BORG Borg Architecture
Task startup latency
Task startup latency is a very important metric to optimize
Time from job submission to a task running
Highly variable
E.g.: at Google, median was about 25s
Techniques to reduce latency
The main culprit for high latency is binary and package installations
Idea: place tasks on machines that already have dependencies
installed
Packages and binaries distributed using a BitTorrent-like protocol
Pietro Michiardi (Eurecom) Cluster Schedulers 111 / 129
BORG Borg Architecture
The Borglet
The Borglet
Borg agent present on every machine in a cell
Starts and stop tasks
Restarts failed tasks
Manages machine resources interacting with the OS
Maintains and rolls over debug logs
Report the state of the machine to the Borgmaster
Interaction with the Borgmaster
Pull-based mechanism: heartbeat-like messages every few
seconds
→ Borgmaster perform flow and rate control to avoid message storms
Borglet continues operation even if communication to Borgmaster is
interrupted
A failed Borglet is blacklisted and all tasks are rescheduled
Pietro Michiardi (Eurecom) Cluster Schedulers 112 / 129
BORG Borg Architecture
Borglet to Borgmaster communication
How to handle control message overhead?
Many Borgmaster replicas receive state updates
Many Borglets communicate concurrently
The link shard mechanism
Each borgmaster replica communicates with a subset of the cell
Borglets
Partitioning is computed at each leader election
Borglets report full state, but the link shard mechanism aggregate
state information
→ Differential state update, to reduce the load at the master
Pietro Michiardi (Eurecom) Cluster Schedulers 113 / 129
BORG Borg Architecture
Scalability
A typical Borgmaster resource requirements
Manages 1000s of machines in a cell
Arrival rates of 10,000 tasks per minute
10+ cores, 50+ GB of RAM
Decentralized design
Scheduler process separate from Borgmaster process
One scheduler per Borgmaster replica
Scheduling is somehow decentralized
State change communicated from replicas to elected Borgmaster,
that finalizes the state update
Additional techniques to achieve scalability
Score caching
Equivalence class
Relaxed randomization
Pietro Michiardi (Eurecom) Cluster Schedulers 114 / 129
BORG Borg Behavior
Borg Behavior: Experimental Perspective
Also, additional details on how Borg works...
Pietro Michiardi (Eurecom) Cluster Schedulers 115 / 129
BORG Borg Behavior
Availability
In large scale systems, failure are the norm not the exception
Everything can fail, and both Borg and its running applications must
deal with this
Baseline techniques to achieve high availability
Replication
Storing persistent state in a distributed file system
Checkpointing
Additional techniques
Automatic rescheduling of failed tasks
Mitigating correlated failures
Rate limitation
Avoid duplicate computation
Admission control to avoid overload
Minimize external dependencies for task binaries
Pietro Michiardi (Eurecom) Cluster Schedulers 116 / 129
BORG Borg Behavior
System Utilization
The primary goal of a cluster scheduler is to achieve high
utilization
Machines, network fabric, power, cooling ... represent a significant
financial investment
Increasing utilization by a few percent can save millions!
A sophisticated metric: Cell Compaction
Replaces the typical “average utilization” metric
Provides a fair, consistent way to compare scheduling policies
Translates directly into cost/benefit result
Computed as follows:
Given a workload in a point in time (so this is not trace driven
simulation)
Enter a loop of workload packing
At each iteration, remove physical machines from the cell
Exit the loop when the workload can no longer fit the cell size
Use the Fauxmaster to produce experimental results
Pietro Michiardi (Eurecom) Cluster Schedulers 117 / 129
BORG Borg Behavior
System Utilization: Compaction
This graphs shows how much smaller cells would be if we applied
compaction to them
Pietro Michiardi (Eurecom) Cluster Schedulers 118 / 129
BORG Borg Behavior
System Utilization: Cell Sharing
Fundamental question: to share or not to share?
Many current systems apply static partitioning: one cluster
dedicated only to prod jobs, one cluster for non-prod jobs
Benefits from sharing
Borg can reclaim resources reserved by “anxious” prod jobs
Pietro Michiardi (Eurecom) Cluster Schedulers 119 / 129
BORG Borg Behavior
System Utilization: Cell Sizing
Fundamental question: large or small cells?
Large cells to accommodate large jobs
Large cells also avoid fragmentation
Pietro Michiardi (Eurecom) Cluster Schedulers 120 / 129
BORG Borg Behavior
System Utilization: Fine-grained Resource Requests
Borg users specify Job requirements in terms of resources
For CPU: this is done in milli-core
For RAM and disk: this is done in bytes
Pietro Michiardi (Eurecom) Cluster Schedulers 121 / 129
BORG Borg Behavior
System Utilization: Fine-grained Resource Requests
Would fixed size containers (or slot) be good?
No! It would require more machines in a cell!
Pietro Michiardi (Eurecom) Cluster Schedulers 122 / 129
BORG Borg Behavior
System Utilization: Resource Reclamation
Borg users specify resource limits for their jobs
Used to perform admission control
Used for feasibility checking (i.e., find sets of suitable machines)
Borg users have the tendency to “over provision” for their
jobs
Some tasks, occasionally need to use all their resources
Most of the tasks never use all resources
Resource reclamation
Borg builds estimates of resource usage: these are called
resource reservations
Borgmaster receives usage updates from Borglets
Prod jobs are treated differently: they do not rely on reclaimed
resources, Borgmaster only uses resource limits
Pietro Michiardi (Eurecom) Cluster Schedulers 123 / 129
BORG Borg Behavior
System Utilization: Resource Reclamation
Resource reclamation is quite effective. A CDF of the additional
machines that would be needed if it was disabled.
Pietro Michiardi (Eurecom) Cluster Schedulers 124 / 129
BORG Borg Behavior
System Utilization: Resource Reclamation
Resource estimation is successful at identifying unused
resources. Most tasks use much less than their limit, although a
few use more CPU than requested.
Pietro Michiardi (Eurecom) Cluster Schedulers 125 / 129
BORG Borg Behavior
Isolation
Sharing a multi-tenancy are beneficial, but...
Tasks may interfere one with each other
Need a good mechanism to prevent interference (both in terms of
security and performance)
Performance isolation
All tasks run in Linux cgroup-based containers
Borglets operate on the OS to control container resources
A control-loop assigns resources based on predicted future usage
or on memory pressure
Additional techniques
Application classes: latency-sensitive, vs. batch
Resource classes: compressible and non-compressible
Tuning of the underlying OS, especially the OS scheduler
Pietro Michiardi (Eurecom) Cluster Schedulers 126 / 129
BORG Lessons Learned
Lessons learned from building and operating Borg
And what has been included in Kubernetes, the open source version of
Borg...
Pietro Michiardi (Eurecom) Cluster Schedulers 127 / 129
BORG Lessons Learned
Lessons Learned: the Bad
The “job” abstraction is too simplistic
Multi-job services cannot be easily managed, nor addressed
Kubernetes uses scheduling units called pods (i.e., the Borg allocs)
and labels (key/value pairs describing objects)
Addressing services is critical
One IP address implies managing ports as a resource, which
complicates tasks
Kubernetes uses Linux name-spaces, such that each pod has its
own IP address
Power or casual users?
Borg is geared toward power users: e.g., BCL has about 230
parameters!
Build automation tools, and template settings from historical
executions
Pietro Michiardi (Eurecom) Cluster Schedulers 128 / 129
BORG Lessons Learned
Lessons Learned: the Good
Allocs are useful
Resource envelope for one or more container co-scheduled on the
same machine and that can share resources
Cluster management is more than task management
Naming and load balancing are first-class citizens
Introspection is vital
Clearly true for debugging
Key for capacity planning and monitoring
The master is the kernel of a distributed system
Monolithic designs are not working well
Cooperation of micro-services that use a common low-level API to
process requests and manipulate state
Pietro Michiardi (Eurecom) Cluster Schedulers 129 / 129

More Related Content

What's hot (20)

Multivector and multiprocessor
Multivector and multiprocessorMultivector and multiprocessor
Multivector and multiprocessor
 
Distributed Database
Distributed DatabaseDistributed Database
Distributed Database
 
CLOUD ENABLING TECHNOLOGIES.pptx
 CLOUD ENABLING TECHNOLOGIES.pptx CLOUD ENABLING TECHNOLOGIES.pptx
CLOUD ENABLING TECHNOLOGIES.pptx
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Google File System
Google File SystemGoogle File System
Google File System
 
Cluster computing
Cluster computingCluster computing
Cluster computing
 
Database 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvanDatabase 2 ddbms,homogeneous & heterognus adv & disadvan
Database 2 ddbms,homogeneous & heterognus adv & disadvan
 
Introduction to Virtualization
Introduction to VirtualizationIntroduction to Virtualization
Introduction to Virtualization
 
High–Performance Computing
High–Performance ComputingHigh–Performance Computing
High–Performance Computing
 
Distributed Operating System_1
Distributed Operating System_1Distributed Operating System_1
Distributed Operating System_1
 
Hadoop File system (HDFS)
Hadoop File system (HDFS)Hadoop File system (HDFS)
Hadoop File system (HDFS)
 
6.distributed shared memory
6.distributed shared memory6.distributed shared memory
6.distributed shared memory
 
Introduction to Cassandra
Introduction to CassandraIntroduction to Cassandra
Introduction to Cassandra
 
Adoptive retransmission in TCP
Adoptive retransmission in TCPAdoptive retransmission in TCP
Adoptive retransmission in TCP
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Mapreduce by examples
Mapreduce by examplesMapreduce by examples
Mapreduce by examples
 
11. dfs
11. dfs11. dfs
11. dfs
 
Hadoop Distributed File System
Hadoop Distributed File SystemHadoop Distributed File System
Hadoop Distributed File System
 
Kernel security Concepts
Kernel security ConceptsKernel security Concepts
Kernel security Concepts
 
Pram model
Pram modelPram model
Pram model
 

Similar to Cluster Schedulers

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The SupercomputerAnkit Singh
 
Capella Based System Engineering Modelling and Multi-Objective Optimization o...
Capella Based System Engineering Modelling and Multi-Objective Optimization o...Capella Based System Engineering Modelling and Multi-Objective Optimization o...
Capella Based System Engineering Modelling and Multi-Objective Optimization o...MehdiJahromi
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridadMarketing Donalba
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesInderjeet Singh
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)inventionjournals
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTijmpict
 
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...Modelon
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemIJERA Editor
 

Similar to Cluster Schedulers (20)

Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
UIC Thesis Morandi
UIC Thesis MorandiUIC Thesis Morandi
UIC Thesis Morandi
 
Parallex - The Supercomputer
Parallex - The SupercomputerParallex - The Supercomputer
Parallex - The Supercomputer
 
Lj2419141918
Lj2419141918Lj2419141918
Lj2419141918
 
035
035035
035
 
Capella Based System Engineering Modelling and Multi-Objective Optimization o...
Capella Based System Engineering Modelling and Multi-Objective Optimization o...Capella Based System Engineering Modelling and Multi-Objective Optimization o...
Capella Based System Engineering Modelling and Multi-Objective Optimization o...
 
Hadoop Internals
Hadoop InternalsHadoop Internals
Hadoop Internals
 
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
Procesamiento multinúcleo óptimo para aplicaciones críticas de seguridad
 
Decision Support System
Decision Support SystemDecision Support System
Decision Support System
 
3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS3rd 3DDRESD: DReAMS
3rd 3DDRESD: DReAMS
 
HPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud TechnologiesHPC with Clouds and Cloud Technologies
HPC with Clouds and Cloud Technologies
 
International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)International Journal of Engineering and Science Invention (IJESI)
International Journal of Engineering and Science Invention (IJESI)
 
UIC Panella Thesis
UIC Panella ThesisUIC Panella Thesis
UIC Panella Thesis
 
Clustering
ClusteringClustering
Clustering
 
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENTVIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
VIRTUAL MACHINE SCHEDULING IN CLOUD COMPUTING ENVIRONMENT
 
Distributed clouds — micro clouds
Distributed clouds — micro cloudsDistributed clouds — micro clouds
Distributed clouds — micro clouds
 
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...
Model-Based Integration for FMI Co-Simulation and Heterogeneous Simulations o...
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
Distributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using OpenshmemDistributed Shared Memory – A Survey and Implementation Using Openshmem
Distributed Shared Memory – A Survey and Implementation Using Openshmem
 
3D-DRESD R4R
3D-DRESD R4R3D-DRESD R4R
3D-DRESD R4R
 

Recently uploaded

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxKartikeyaDwivedi3
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfAsst.prof M.Gokilavani
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgsaravananr517913
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...Chandu841456
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEroselinkalist12
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxk795866
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptJasonTagapanGulla
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024Mark Billinghurst
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfme23b1001
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionMebane Rash
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - GuideGOPINATHS437943
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfAsst.prof M.Gokilavani
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsSachinPawar510423
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvLewisJB
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHC Sai Kiran
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitterShivangiSharma879191
 

Recently uploaded (20)

Concrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptxConcrete Mix Design - IS 10262-2019 - .pptx
Concrete Mix Design - IS 10262-2019 - .pptx
 
Design and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdfDesign and analysis of solar grass cutter.pdf
Design and analysis of solar grass cutter.pdf
 
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdfCCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
CCS355 Neural Network & Deep Learning Unit II Notes with Question bank .pdf
 
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfgUnit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
Unit7-DC_Motors nkkjnsdkfnfcdfknfdgfggfg
 
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Serviceyoung call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
young call girls in Rajiv Chowk🔝 9953056974 🔝 Delhi escort Service
 
An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...An experimental study in using natural admixture as an alternative for chemic...
An experimental study in using natural admixture as an alternative for chemic...
 
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETEINFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
INFLUENCE OF NANOSILICA ON THE PROPERTIES OF CONCRETE
 
Introduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptxIntroduction-To-Agricultural-Surveillance-Rover.pptx
Introduction-To-Agricultural-Surveillance-Rover.pptx
 
Solving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.pptSolving The Right Triangles PowerPoint 2.ppt
Solving The Right Triangles PowerPoint 2.ppt
 
POWER SYSTEMS-1 Complete notes examples
POWER SYSTEMS-1 Complete notes  examplesPOWER SYSTEMS-1 Complete notes  examples
POWER SYSTEMS-1 Complete notes examples
 
young call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Serviceyoung call girls in Green Park🔝 9953056974 🔝 escort Service
young call girls in Green Park🔝 9953056974 🔝 escort Service
 
IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024IVE Industry Focused Event - Defence Sector 2024
IVE Industry Focused Event - Defence Sector 2024
 
Electronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdfElectronically Controlled suspensions system .pdf
Electronically Controlled suspensions system .pdf
 
US Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of ActionUS Department of Education FAFSA Week of Action
US Department of Education FAFSA Week of Action
 
Transport layer issues and challenges - Guide
Transport layer issues and challenges - GuideTransport layer issues and challenges - Guide
Transport layer issues and challenges - Guide
 
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdfCCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
CCS355 Neural Networks & Deep Learning Unit 1 PDF notes with Question bank .pdf
 
Vishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documentsVishratwadi & Ghorpadi Bridge Tender documents
Vishratwadi & Ghorpadi Bridge Tender documents
 
Work Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvvWork Experience-Dalton Park.pptxfvvvvvvv
Work Experience-Dalton Park.pptxfvvvvvvv
 
Introduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECHIntroduction to Machine Learning Unit-3 for II MECH
Introduction to Machine Learning Unit-3 for II MECH
 
8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter8251 universal synchronous asynchronous receiver transmitter
8251 universal synchronous asynchronous receiver transmitter
 

Cluster Schedulers

  • 1. Cluster Schedulers Pietro Michiardi Eurecom Pietro Michiardi (Eurecom) Cluster Schedulers 1 / 129
  • 3. Introduction Overview of the Lecture General overview of cluster scheduling principles General objectives A taxonomy Current architectures In depth presentation of three representative examples Yarn Mesos Borg (Kubernetes) Pietro Michiardi (Eurecom) Cluster Schedulers 3 / 129
  • 4. Introduction Cluster Scheduling Principles Cluster Scheduling Principles Pietro Michiardi (Eurecom) Cluster Schedulers 4 / 129
  • 5. Introduction Cluster Scheduling Principles Objectives Large-scale clusters are expensive, so it is important to use them well Cluster utilization and efficiency are key indicators for good resource management and scheduling decisions Translates directly to cost arguments: better scheduling → smaller clusters Multiplexing to the rescue Multiple, heterogeneous mixes of application run concurrently → The scheduling problem is a challenge Scalability bottlenecks Cluster and workload sizes keep growing Scheduling complexity roughly proportional to cluster size → Schedulers must be scalable Pietro Michiardi (Eurecom) Cluster Schedulers 5 / 129
  • 6. Introduction Cluster Scheduling Principles Current Scheduler Architectures Monolitic: use a centralized scheduling and resource management algorithm for all jobs Difficult to add new scheduling policies Do not scale well to large cluster sizes Two-level: single resource manager that grants resources to independent “framework schedulers” Flexibility in accommodating multiple application frameworks Resource management and locking are conservative, which can hurt cluster utilization and performance Pietro Michiardi (Eurecom) Cluster Schedulers 6 / 129
  • 7. Introduction Cluster Scheduling Principles Typical workloads to support Cluster scheduler must support heterogeneous workloads and clusters Clusters are made of several generations of machines Workloads evolve in time, and can be made of a variety of applications Rough categorization of job types Batch jobs: e.g., MapReduce computations Service jobs: e.g., end-user facing web service Knowing your workload is fundamental! Next, some examples from real cluster traces The rationale: measurements can inform scheduling design Pietro Michiardi (Eurecom) Cluster Schedulers 7 / 129
  • 8. Introduction Cluster Scheduling Principles Real cluster trace: workload characteristics Solid lines: batch jobs, dashed lines: service jobs Pietro Michiardi (Eurecom) Cluster Schedulers 8 / 129
  • 9. Introduction Cluster Scheduling Principles Real cluster trace: workload characteristics Solid lines: batch jobs, dashed lines: service jobs Pietro Michiardi (Eurecom) Cluster Schedulers 9 / 129
  • 10. Introduction Cluster Scheduling Principles Short Taxonomy of Scheduling Design Issues Scheduling Work Partitioning: how to distribute work across frameworks Workload-oblivious load-balancing Workload partitioning and specialized schedulers Hybrid Resource choice: which cluster resources are available to concurrent frameworks All resources available A subset of cluster resources is granted (or offered) NOTE: preemption primitives help scheduling flexibility at the cost of potentially wasting work Pietro Michiardi (Eurecom) Cluster Schedulers 10 / 129
  • 11. Introduction Cluster Scheduling Principles Short Taxonomy of Scheduling Design Issues Interference: what to do when multiple frameworks attempt at using the same resources Pessimistic concurrency control: make sure to avoid conflicts, by partitioning resources across frameworks Optimistic concurrency control: hope for the best, otherwise detect and undo conflicting claims Allocation Granularity: task “scheduling” policies Atomic, all-or-nothing gang scheduling: e.g. MPI Incremental placement, hoarding: e.g. MapReduce Cluster-wide behavior: some requirements need global view Fairness across frameworks Global notion of priority Pietro Michiardi (Eurecom) Cluster Schedulers 11 / 129
  • 12. Introduction Cluster Scheduling Principles Summary of design knobs Comparison of cluster scheduling approaches Pietro Michiardi (Eurecom) Cluster Schedulers 12 / 129
  • 13. Introduction Cluster Scheduling Principles Architecture Details High-level summary of scheduling objectives Minimize the job queueing time, or more generally, the system response time (queueing + service times) Subject to Priorities among jobs Per-job constraints Failure tolerance Scalability Scheduling architectures Monolithic schedulers Statically partitioned schedulers Two-level schedulers Shared-state schedulers (cf. Omega paper) Pietro Michiardi (Eurecom) Cluster Schedulers 13 / 129
  • 14. Introduction Cluster Scheduling Principles Monolithic Schedulers Single centralized instance Typical of HPC setting Implements all policies in a single code base Applies the same scheduling algorithm to all incoming jobs Alternative designs Support multiple code paths for different jobs Each path implements a different scheduling logic → Difficult to implement and maintain Pietro Michiardi (Eurecom) Cluster Schedulers 14 / 129
  • 15. Introduction Cluster Scheduling Principles Statically Partitioned Schedulers Standard “Cloud-computing” approach Underlying assumption: each framework has complete control over a set of resources Depend on statically partitioned, dedicated resources Examples: Hadoop 1.0, Quincy scheduler Problems with static partitioning Fragmentation of resources Sub-optimal cluster utilization Pietro Michiardi (Eurecom) Cluster Schedulers 15 / 129
  • 16. Introduction Cluster Scheduling Principles Two-level Schedulers Obviates the problems of static partitioning Dynamic allocation of resources to concurrent frameworks Use a “logically centralized” coordinator to decide how many resources to grant A first example: Mesos Centralized resource allocator, with dynamic cluster partitioning Available resources are offered to competing frameworks Avoids interference by exclusive offers Frameworks lock resources by accepting offers → pessimistic concurrency control No global cluster state available to frameworks Another tricky example: YARN Centralized resource allocator (RM), with per-job framework master (AM) AM only provides job management services, not proper scheduling → YARN is closer to a monolithic architecture Pietro Michiardi (Eurecom) Cluster Schedulers 16 / 129
  • 17. Introduction Cluster Scheduling Principles Representative Cluster Schedulers Pietro Michiardi (Eurecom) Cluster Schedulers 17 / 129
  • 18. YARN YARN Pietro Michiardi (Eurecom) Cluster Schedulers 18 / 129
  • 19. YARN Introduction and Motivations Introduction and Motivations Pietro Michiardi (Eurecom) Cluster Schedulers 19 / 129
  • 20. YARN Introduction and Motivations Hadoop 1.0: Focus on Batch applications Built for batch applications Supports only MapReduce applications Different silos for each usage pattern Pietro Michiardi (Eurecom) Cluster Schedulers 20 / 129
  • 21. YARN Introduction and Motivations Hadoop 1.0: Architecture (reloaded) JobTracker Manages cluster resources Performs Job scheduling Performs Task scheduling TaskTracker Per machine agent Manages Task execution Pietro Michiardi (Eurecom) Cluster Schedulers 21 / 129
  • 22. YARN Introduction and Motivations Hadoop 1.0: Limitations Only supports MapReduce, no other paradigms Everything needs to be cast to MapReduce Iterative applications are slow Scalability issues Max cluster size roughy 4,000 nodes Max concurrent tasks, roughly 40,000 tasks Availability System failures destroy running and queued jobs Resource utilization Hard, static partitioning of resources in Map or Reduce slots Non-optimal resource utilization Pietro Michiardi (Eurecom) Cluster Schedulers 22 / 129
  • 23. YARN Introduction and Motivations Next Generation Hadoop Pietro Michiardi (Eurecom) Cluster Schedulers 23 / 129
  • 24. YARN Introduction and Motivations The YARN ecosystem Store all data in one place Avoids costly duplication Interact with data in multiple ways Not only in batch mode, with the rigid MapReduce model More predictable performance Advanced scheduling mechanisms Pietro Michiardi (Eurecom) Cluster Schedulers 24 / 129
  • 25. YARN Introduction and Motivations Key Improvements in YARN (1) Support for multiple applications Separate generic resource brokering from application logic Define protocols/libraries and provide a framework for custom application development Share same Hadoop Cluster across applications Improved cluster utilization Generic resource container model replaces fixed Map/Reduce slots Container allocations based on locality and memory Sharing cluster among multiple application Improved scalability Remove complex app logic from resource management State machine, message passing based loosely coupled design Compact scheduling protocol Pietro Michiardi (Eurecom) Cluster Schedulers 25 / 129
  • 26. YARN Introduction and Motivations Key Improvements in YARN (2) Application Agility Use Protocol Buffers for RPC gives wire compatibility Map Reduce becomes an application in user space Multiple versions of an app can co-exist leading to experimentation Easier upgrade of framework and applications A data operating system: shared services Common services included in a pluggable framework Distributed file sharing service Remote data read service Log Aggregation Service Pietro Michiardi (Eurecom) Cluster Schedulers 26 / 129
  • 27. YARN YARN Architecture Overview YARN Architecture Overview Pietro Michiardi (Eurecom) Cluster Schedulers 27 / 129
  • 28. YARN YARN Architecture Overview YARN: Architecture Overview Pietro Michiardi (Eurecom) Cluster Schedulers 28 / 129
  • 29. YARN YARN Architecture Overview YARN: Design Decisions No static resource partitioning There are no more slots Nodes have resources, which are allocated to applications when requested Separate resource management from application logic Cluster-wide resource allocation and management Per-application master component Multiple applications → multiple masters Pietro Michiardi (Eurecom) Cluster Schedulers 29 / 129
  • 30. YARN YARN Architecture Overview YARN Daemons Resource Manager (RM) Runs on master node Global resource manager and scheduler Arbitrates system resources between competing applications Node Manager (NM) Run on slave nodes Communicates with RM Reports utilization Resource containers Created by the RM upon request Allocate a certain amount of resources on slave nodes Applications run in one or more containers Application Master (AM) One per application, application specific1 Requests more containers to execute application tasks Runs in a container 1 Every new application requires a new AM to be designed and implemented! Pietro Michiardi (Eurecom) Cluster Schedulers 30 / 129
  • 31. YARN YARN Architecture Overview YARN: Example with 2 Applications Pietro Michiardi (Eurecom) Cluster Schedulers 31 / 129
  • 32. YARN YARN Core Components YARN Core Components Pietro Michiardi (Eurecom) Cluster Schedulers 32 / 129
  • 33. YARN YARN Core Components YARN Schedulers (1) Schedulers are a pluggable component of the RM In addition to existing ones, advanced scheduling is supported Current supported schedulers The Capacity scheduler The Fair scheduler Dominant Resource Fairness What’s different w.r.t. Hadoop 1.0? Support any YARN application, not just MapReduce No more slots, tasks are scheduled based on resources Some terminology change Pietro Michiardi (Eurecom) Cluster Schedulers 33 / 129
  • 34. YARN YARN Core Components YARN Schedulers (2) Hierarchical queues Queues can contain sub-queues Sub-queues share resources assigned to queues Pietro Michiardi (Eurecom) Cluster Schedulers 34 / 129
  • 35. YARN YARN Core Components YARN Resource Manager: Overview Pietro Michiardi (Eurecom) Cluster Schedulers 35 / 129
  • 36. YARN YARN Core Components YARN Resource Manager: Operations Node Management Tracks hearbeats from NMs Container Management Handles AM requests for new containers De-allocates containers when they expire or application finishes AM Management Creates a container for every new AM, and tracks its health Security Management Kerberos integration Pietro Michiardi (Eurecom) Cluster Schedulers 36 / 129
  • 37. YARN YARN Core Components YARN Node Manager: Overview Pietro Michiardi (Eurecom) Cluster Schedulers 37 / 129
  • 38. YARN YARN Core Components YARN Node Manager: Operations Manages communications with the RM Registers, monitors and communicates node resources Sends heartbeats and container status Manages processes in containers Launches AMs on request from the RM Launches application processes on request from the AMs Monitors resource usage Kills processes and containers Provides logging services Log aggregation and roll over to HDFS Pietro Michiardi (Eurecom) Cluster Schedulers 38 / 129
  • 39. YARN YARN Core Components YARN Resource Request Resource Request Resource name: hostname, rackname, * Priority: within the same application, not across apps Resource requirements: memory, CPU, and more to come... Number of containers Pietro Michiardi (Eurecom) Cluster Schedulers 39 / 129
  • 40. YARN YARN Core Components YARN Containers Container Launch Context Container ID Commands to start application task(s) Environment configuration Local resources: application/task binary, HDFS files Pietro Michiardi (Eurecom) Cluster Schedulers 40 / 129
  • 41. YARN YARN Core Components YARN Fault Tolerance Container failure AM re-attempts containers that complete with exceptions or fail Applications with too many failed containers are considered failed AM failure If application or AM fail, the RM will re-attempt the whole application Optional strategy: job recovery If false, all containers are re-scheduled If true, uses state to find which containers succeeded and which failed, to re-schedule only failed ones Pietro Michiardi (Eurecom) Cluster Schedulers 41 / 129
  • 42. YARN YARN Core Components YARN Fault Tolerance NM failure If NM stops sending heartbeats, RM removes it from active node list Containers on the failed node are re-scheduled AM on the failed node are re-submitted completely RM failure No application can be run if RM is down Can work in active-passive mode (just like the NN of HDFS) Pietro Michiardi (Eurecom) Cluster Schedulers 42 / 129
  • 43. YARN YARN Core Components YARN Shuffle Service The Shuffle mechanism is now an auxiliary service Runs in the NM JVM as a persistent service Pietro Michiardi (Eurecom) Cluster Schedulers 43 / 129
  • 44. YARN YARN Application Example YARN Application Example Pietro Michiardi (Eurecom) Cluster Schedulers 44 / 129
  • 45. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 45 / 129
  • 46. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 46 / 129
  • 47. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 47 / 129
  • 48. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 48 / 129
  • 49. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 49 / 129
  • 50. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 50 / 129
  • 51. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 51 / 129
  • 52. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 52 / 129
  • 53. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 53 / 129
  • 54. YARN YARN Application Example YARN WordCount execution Pietro Michiardi (Eurecom) Cluster Schedulers 54 / 129
  • 55. MESOS Mesos Pietro Michiardi (Eurecom) Cluster Schedulers 55 / 129
  • 56. MESOS Introduction Introduction Pietro Michiardi (Eurecom) Cluster Schedulers 56 / 129
  • 57. MESOS Introduction Introduction and Motivations Clusters of commodity servers major computing platform Modern Internet Services Data-intensive applications New frameworks developed to “program the cluster” Hadoop MapReduce, Apache Spark, Microsoft Dryad Pregel, Storm, ... and many more No one-size fit them all Pick the right frameworks for the application Run multiple frameworks at the same time → Multiplexing cluster resources among frameworks Improves cluster utilization Allows sharing of data without the need to replicate it Pietro Michiardi (Eurecom) Cluster Schedulers 57 / 129
  • 58. MESOS Introduction Common Solutions to Share a Cluster Common practice to achieve cluster sharing Static partitioning Traditional virtualization Problems of current approaches Mismatch between allocation granularities No mechanism to allocate resources to short-lived tasks → Underlying hypothesis for Mesos Cluster frameworks operate with short tasks Cluster resources free up quickly This allows to achieve data locality Pietro Michiardi (Eurecom) Cluster Schedulers 58 / 129
  • 59. MESOS Introduction Mesos Design Objectives Mesos: a thin resource sharing layer enabling fine-grained sharing across diverse frameworks Challenges Each supported framework has different scheduling needs Scalability is crucial (10,000+ nodes) Fault-tolerance and high availability Would a centralized approach work? Input: framework requirements, instantaneous resource availability, organization policies Output: global schedule for all tasks of all jobs of all frameworks Pietro Michiardi (Eurecom) Cluster Schedulers 59 / 129
  • 60. MESOS Introduction Mesos Key Design Principles Centralized approach does not work Complexity Scalability and resilience Moving framework-specific scheduling to a centralized scheduler requires expensive refactoring A decentralized approach Based on the abstraction of a resource offer Mesos decides how many resources to offer to a framework The framework decides which resources to accept and which tasks to run on them Pietro Michiardi (Eurecom) Cluster Schedulers 60 / 129
  • 61. MESOS Target Environment Target Workloads Pietro Michiardi (Eurecom) Cluster Schedulers 61 / 129
  • 62. MESOS Target Environment Target Environment Typical workloads in “Data Warehouse” systems Heterogeneous MapReduce jobs, production and ad-hoc queries Large scale machine learning SQL-like queries Pietro Michiardi (Eurecom) Cluster Schedulers 62 / 129
  • 63. MESOS Architecture Architecture Pietro Michiardi (Eurecom) Cluster Schedulers 63 / 129
  • 64. MESOS Architecture Design Philosophy Data center operating system Scalable and resilient core exposing low-level interfaces High-level libraries for common functionalities Minimal interface to support resource sharing Mesos manages cluster resources Frameworks control task scheduling and execution Benefits of two-level approach Frameworks are independent and can support diverse scheduling requirements Mesos is kept simple, minimizing the rate of change to the system Pietro Michiardi (Eurecom) Cluster Schedulers 64 / 129
  • 65. MESOS Architecture Architecture Overview Pietro Michiardi (Eurecom) Cluster Schedulers 65 / 129
  • 66. MESOS Architecture Architecture Overview The Mesos Master Uses Resource Offers to implement fine-grained sharing Collects resource utilization from slaves Resource offer: list of free resources on multiple slaves First-level Scheduling Master decides how many resources to offer a framework Implements a cluster-wide allocation policy: Fair Sharing Priority based Pietro Michiardi (Eurecom) Cluster Schedulers 66 / 129
  • 67. MESOS Architecture Architecture Overview Mesos Frameworks Framework scheduler Registers to the master Selects which offer to accept Describes the tasks to launch on accepted resources Framework executor Launched on Mesos slaves executing on accepted resources Takes care of running the framework’s tasks Second-level Scheduling One framework scheduler per application Framework decides how to execute a job and its tasks NOTE: The actual task execution is requested by the master Pietro Michiardi (Eurecom) Cluster Schedulers 67 / 129
  • 68. MESOS Architecture Resource Offer Example Pietro Michiardi (Eurecom) Cluster Schedulers 68 / 129
  • 69. MESOS Architecture Consequences of the Mesos Architecture Mesos makes no assumptions on Framework requirements This is unlike other approaches, which requires the cluster scheduler to understand application constraints This does not mean that users are not required to express their applications’ constraints Rejecting offers It is the framework that decides to reject a resource offer that does not satisfy application constraints Frameworks can wait for offers to satisfy constraints Arbitrary and complex resource constraints Delegate logic and control to individual frameworks Mesos also implements filters to optimize resource offers Pietro Michiardi (Eurecom) Cluster Schedulers 69 / 129
  • 70. MESOS Architecture Resource Allocation Pluggable allocation module Max-min Fairness Strict priority Fundamental assumption Tasks are short → Mesos only reallocates resources when tasks finish Example Assume a Framework’s share is 10% of the cluster It needs to wait 10% of the mean task length to receive its share Pietro Michiardi (Eurecom) Cluster Schedulers 70 / 129
  • 71. MESOS Architecture Resource Revocation Short vs. long lived tasks Some jobs (e.g. streaming) may have long tasks In this case, Mesos can Kill running tasks Preemption primitives Require knowledge about potential resource usage by frameworks Killing might be wasteful, although not critical (e.g. MapReduce) Some applications (e.g. MPI) might be harmed Guaranteed allocation Minimum set of resources granted to a framework If below guaranteed allocation → never kill tasks If above guaranteed allocation → kill any tasks Pietro Michiardi (Eurecom) Cluster Schedulers 71 / 129
  • 72. MESOS Architecture Performance Isolation Isolation between executors on the same slave Achieved through low-level OS primitives Pluggable isolation modules to support a variety of OS Currently supported mechanisms Limit CPU, memory, network and I/O bandwidth of a process tree Linux Containers and Solaris Cages Advantages and limitations Better isolation than current approach, process-based Fine grained isolation is not yet fully functional Pietro Michiardi (Eurecom) Cluster Schedulers 72 / 129
  • 73. MESOS Architecture Mesos Scalability Filter mechanism Short-circuit the rejection process, avoids unnecessary communication Filter type 1: restrict which slave machines to use Filter type 2: check resource availability on slaves Incentives to speed-up the resource offer mechanism Mesos counts offers to a framework toward its allocation Frameworks have to answer and/or filter as quickly as possible Rescinding offers Mesos can decide to invalidate an offer to a framework This avoids blocking and misbehavior Pietro Michiardi (Eurecom) Cluster Schedulers 73 / 129
  • 74. MESOS Architecture Mesos Fault Tolerance Master designed with Soft State List of active slaves List of registered frameworks List of running tasks Multiple masters in a hot-standby mode Leader election through Zookeeper Upon failure detection new master is elected Slaves and executors help populating the new master’s state Helping frameworks to tolerate failure Master sends “health reports” to framework schedulers Master allows multiple schedulers for a single framework Pietro Michiardi (Eurecom) Cluster Schedulers 74 / 129
  • 75. MESOS Mesos Behavior System behavior: a very rough Mesos “model” Pietro Michiardi (Eurecom) Cluster Schedulers 75 / 129
  • 76. MESOS Mesos Behavior Overview: Mesos in a nutshell Ideal workloads for Mesos Elastic frameworks, supporting scaling up and down seamlessly Task durations are homogeneous (and short) No strict preference over cluster nodes Frameworks with cluster node preferences Assume frameworks prefer different (and possibly disjoint) nodes Mesos can emulate a centralized scheduler Cluster and Framework wide fair resource sharing Heterogeneous task durations Mesos can handle coexisting short and long lived tasks Performance degradation is acceptable Pietro Michiardi (Eurecom) Cluster Schedulers 76 / 129
  • 77. MESOS Mesos Behavior Definitions Workload characterization Elasticity: elastic workloads can use resources as soon as they are acquired, and release them as soon as tasks finish; in contrast, rigid frameworks (e.g. MPI) can only start a job when all resources have been acquired, and do not work well with scaling Task runtime distribution: both homogeneous and not Resource characterization Mandatory: resource that a framework must acquire to work. Assumption: mandatory resources < guaranteed share Preferred: resources that a framework should acquire to achieve better performance, but are not necessary for the job to work Pietro Michiardi (Eurecom) Cluster Schedulers 77 / 129
  • 78. MESOS Mesos Behavior Performance Metrics Performance metrics Framework ramp-up time: time it takes a new framework to achieve its fair share Job completion time: time it takes a job to complete. Assume one job per framework System utilization: total cluster resource utilization, with focus on CPU and memory Pietro Michiardi (Eurecom) Cluster Schedulers 78 / 129
  • 79. MESOS Mesos Behavior Homogeneous Tasks Cluster with n slots and a framework f entitled with k slots Task runtime distribution: uniform and exponential Mean task duration T Job duration: βkT → If f has k slots, then job duration is βT Pietro Michiardi (Eurecom) Cluster Schedulers 79 / 129
  • 80. MESOS Mesos Behavior Placement preferences Consider two cases: There exist a configuration satisfying all frameworks constraints The system will eventually converge to the state in which the optimal allocation is achieved, and this in at most one T interval No such allocation exists, e.g. demand is larger than supply Lottery Scheduling to achieve a weighted fair allocation Mesos offers a slot to framework i with probability si m i=1 si where si is framework’s i intended allocation, and m is the total number of frameworks registered to Mesos Pietro Michiardi (Eurecom) Cluster Schedulers 80 / 129
  • 81. MESOS Mesos Behavior Heterogeneous Tasks Assumptions Workloads with tasks that are either long or short Mean duration of long task is longer than short ones Worst case scenario All nodes required by a “short job” are filled with long tasks, which means it has to wait for a long time How likely is the worst case? Assume φ < 1, where φ fraction of long tasks Assume a cluster with S available slots per node → Probability for a node to be filled with long tasks is φS S = 8 and φ = 0.5 gives a 0.4% chance Pietro Michiardi (Eurecom) Cluster Schedulers 81 / 129
  • 82. MESOS Mesos Behavior Limitations of Distributed Scheduling Fragmentation Provokes under utilization of system resources Distributed collection of frameworks might not achieve the same “packing” quality of a centralized scheduler → This is mitigated by having clusters of “big” nodes (many CPUs, many cores) running “small” tasks Starvation Large jobs may wait indefinitely for slots to become free Small tasks from small jobs might monopolize the cluster → This is mitigated by a minimum offer size mechanism Pietro Michiardi (Eurecom) Cluster Schedulers 82 / 129
  • 83. MESOS Mesos Performance Experimental Mesos Performance Evaluation Pietro Michiardi (Eurecom) Cluster Schedulers 83 / 129
  • 84. MESOS Mesos Performance Resource Allocation Pietro Michiardi (Eurecom) Cluster Schedulers 84 / 129
  • 85. MESOS Mesos Performance Resource Utilization Pietro Michiardi (Eurecom) Cluster Schedulers 85 / 129
  • 86. MESOS Mesos Performance Data Locality Pietro Michiardi (Eurecom) Cluster Schedulers 86 / 129
  • 87. BORG Borg Pietro Michiardi (Eurecom) Cluster Schedulers 87 / 129
  • 88. BORG Introduction Introduction Pietro Michiardi (Eurecom) Cluster Schedulers 88 / 129
  • 89. BORG Introduction Introduction and Objectives Hide the details of resource management Let users instead focus on application developlemt Operate applications with high reliability and availability Tolerate failures within a datacenter and across datacenters Run heterogeneous workloads and scale across thousands of machines Pietro Michiardi (Eurecom) Cluster Schedulers 89 / 129
  • 90. BORG User perspective The User Perspective Pietro Michiardi (Eurecom) Cluster Schedulers 90 / 129
  • 91. BORG User perspective Terminology Users develop applications called jobs Jobs consists in one or more tasks All tasks run the same binary Each job runs in a set of machines managed as a unit, called a Borg Cell Pietro Michiardi (Eurecom) Cluster Schedulers 91 / 129
  • 92. BORG User perspective Workloads Two main categories supported Long-running services: jobs that should “never” go down, and handle short-lived, latency-sensitive requests Batch jobs: delay-tolerant jobs that can take from few seconds to few days to complete Storage services: these are long-running services like above, that are used to store data Workload composition in a cell is dynamic It varies depending on the tenants using the cell It varies with time: diurnal usage pattern for end-user-facing jobs, irregular pattern for batch jobs Examples High-priority, production jobs → long-running services Low-priority, non-production jobs → batch jobs In a typical Borg Cell Prod Jobs: 70% of CPU allocation, representing 60% of CPU usage Non-prod Jobs: 55% of CPU allocation, representing 85% of CPU usage Pietro Michiardi (Eurecom) Cluster Schedulers 92 / 129
  • 93. BORG User perspective Clusters and Cells Borg Cluster: a set of machines connected by a high-performance datacenter-scale network fabric The machines in a Borg Cell all belong to a single cluster A cluster lives inside a datacenter building A collection of building makes up a Site Borg Machines: physical servers dedicated to execute Borg applications They are generally highly heterogeneous in terms of resources They may expose a public IP address They may expose advanced features, like SSD or GPGPU Examples A typical cluster usually hosts one large cell and a few small-scale test cells The median cell size is about 10k machines Borg uses those machines to schedule application tasks, install their binaries and dependencies, monitor their health and restarting them if they fail Pietro Michiardi (Eurecom) Cluster Schedulers 93 / 129
  • 94. BORG User perspective Jobs and Tasks Job Definition Name, owner and number of tasks Constraints to force tasks run on machines with particular attributes Constraints can be hard or soft (i.e., preferences) Each task maps to a set of UNIX processes running in a container on a Borg machine in a Borg Cell Task Definition Task index within their parent job Resource requirements Generally, all tasks have the same definition Tasks can run on any resource dimension: there are no fixed-size slots or buckets Pietro Michiardi (Eurecom) Cluster Schedulers 94 / 129
  • 95. BORG User perspective Jobs and Tasks The Borg Configuration Language Declarative language to specify jobs and tasks Lambda functions to allow calculations Some application descriptions can be over 1k lines of code User interacting with live jobs This is achieved mainly using RPC Users can update the specification of tasks, while their parent job is running Updates are non-atomic, executed in a rolling-fashion Task updates “side-effects” Always require restarts: e.g., pushing a new binary Might require migration: e.g., change in specification Never require restarts nor migrations: e.g., change in priority Pietro Michiardi (Eurecom) Cluster Schedulers 95 / 129
  • 96. BORG User perspective Jobs State Diagram Pietro Michiardi (Eurecom) Cluster Schedulers 96 / 129
  • 97. BORG User perspective Resource Allocations The Borg “Alloc” Reserved set of resources on an individual machine Can be used to execute one or more tasks, that equally share resources Resources remain assigned whether or not they are used Typical use of Borg Allocs Set resources aside for future tasks Retain resources between stopping and starting tasks Consolidate (gather) tasks from different jobs on the same machine Alloc Sets Group of allocs on different machines Once an alloc set has been created, one or more jobs can be submitted Pietro Michiardi (Eurecom) Cluster Schedulers 97 / 129
  • 98. BORG User perspective Priority, Quota and Admission Control Mechanisms to deal with resource demand and offer What to do when more work shows up than can be accommodated? Note: this is not scheduling, it is more admission control Job priority Non-overlapping priority bands for different uses This essentially means users must “manually” cluster their applications according to such bands → Tasks from high-priority jobs can preempt low-priority tasks Cascade preemption is avoided by disabling it for same-band jobs Job/User quotas Used to decide which job to admit for scheduling Expressed as a vector of resource quantities Pricing Underlying mechanism to regulate user behavior Aligns user incentives to better resource utilization Discourages over-buying by over-selling quotas at lower priority Pietro Michiardi (Eurecom) Cluster Schedulers 98 / 129
  • 99. BORG User perspective Naming Services Borg Name Service A mechanism to assign a name to tasks Task name = Cell name, job name and task number Uses the Chubby coordination service Writes task names into it Writes also health information and status Used by Borg RPC mechanism to establish communication endpoints DNS service inherits from BNS Example: the 50th task in job “jfoo” owned by user “ubar” in a Borg Cell called “cc” 50.jfoo.ubar.cc.borg.google.com Pietro Michiardi (Eurecom) Cluster Schedulers 99 / 129
  • 100. BORG User perspective Monitoring Services Every task in Borg has a built-in HTTP server Provides health information Provides performance metrics Borg SIGMA Monitoring UI Service State of jobs, of cells Drill-down to task level Why pending? “Debugging” service Helps users with finding job specifications that can be easily scheduled Billing services Use monitoring information to compute usage-based charging Help users debug their jobs Used for capacity planning Pietro Michiardi (Eurecom) Cluster Schedulers 100 / 129
  • 101. BORG Borg Architecture Borg Architecture Pietro Michiardi (Eurecom) Cluster Schedulers 101 / 129
  • 102. BORG Borg Architecture Architecture Overview Pietro Michiardi (Eurecom) Cluster Schedulers 102 / 129
  • 103. BORG Borg Architecture Architecture Overview Architecture components A set of physical machines A logically centralized controller, the Borgmaster An agent process running on all machines, the Borglet Pietro Michiardi (Eurecom) Cluster Schedulers 103 / 129
  • 104. BORG Borg Architecture The Borgmaster: Components The Borgmaster One per Borg Cell Orchestrates cell resources Components The Borgmaster process The scheduler process The Borgmaster process Handles client RPCs that either mutate state or lookup for state Manages the state machines for all Borg “objects” (machines, tasks, allocs, etc...) Communicates with all Borglets in the cell Provides a Web-based UI Pietro Michiardi (Eurecom) Cluster Schedulers 104 / 129
  • 105. BORG Borg Architecture The Borgmaster: Reliability Borgmaster reliability achieved through replication Single logical process, replicated 5 times in a cell Single master elected using Paxos when starting a cell, or upon failure of the current master Master serves as Paxos leader and cell state mutator Borgmaster replicas Maintain an in-memory fresh copy of the cell state Persist their state to a distributed Paxos-based store Help building the most up-to-date cell state when a new master is elected Pietro Michiardi (Eurecom) Cluster Schedulers 105 / 129
  • 106. BORG Borg Architecture The Borgmaster: State Borgmaster checkpoints its state Time-based and event-based mechanism State include everything related to a cell Checkpoint utilization Restore the state to a functional one, e.g. before a failure or a bug Studying a faulty state and fixing it by hand Build a persistent log of events for future queries Use it for offline simulations Pietro Michiardi (Eurecom) Cluster Schedulers 106 / 129
  • 107. BORG Borg Architecture The Fauxmaster A high-fidelity simulator It reads checkpoint files Full-fledged Borgmaster code Stubbed-out interfaces to Borglets Fauxmaster operation Accepts RPCs to make state machine changes Connects to simulated Borglets that replay real interactions from checkpoint files Fauxmaster benefits Help users debug their application Capacity planning, e.g. “How many new jobs of this type would fit in the cell?” Perform sanity checks for cell configurations, e.g. “Will this new configuration evict any important jobs?” Pietro Michiardi (Eurecom) Cluster Schedulers 107 / 129
  • 108. BORG Borg Architecture Scheduling Queue based mechanism New submitted jobs (and their tasks) are stored in the Paxos store (for reliability) and put in the pending queue The scheduler process Operates at the task level, not the job level Scans asynchronously the pending queue Assigns tasks to machines that satisfy constraints and that have enough resources Pending task selection Scanning proceeds from high to low priority tasks Within the same priority class, scheduling uses a round-robin mechanism → Ensures fairness → Avoids head-of-line blocking behind large jobs Pietro Michiardi (Eurecom) Cluster Schedulers 108 / 129
  • 109. BORG Borg Architecture Scheduling Algorithm The scheduling algorithm has two main processes Feasibility checking: find a set of machines that Meet tasks’ constraints Have enough available resources, including those that are currently assigned to low-priority tasks that can be evicted Scoring: among the set returned by the previous process, rank such machines to Minimize the number and priority of preempted tasks Prefer machines with a local copy of tasks binaries and dependencies Spread tasks across failure and power domains Pack and spread tasks, mixing high and low priority ones on the same machine to allow high-priority tasks to eventually expand Pietro Michiardi (Eurecom) Cluster Schedulers 109 / 129
  • 110. BORG Borg Architecture More on the scoring mechanism Worst-fit scoring: spreading tasks Single cost value across heterogeneous resources Minimize the change in cost when placing a new task → Leaves headroom for load spikes → But leads to fragmentation Best-fit scoring: “waterfilling” algorithm Tries to fill machines as tightly as possible → Leaves empty machines that can be used to place large tasks → But difficult to deal with load spikes as the headroom left in each machine depends highly on load estimation Hybrid Tries to reduce the amount of stranded resources Performs better than best-fit Pietro Michiardi (Eurecom) Cluster Schedulers 110 / 129
  • 111. BORG Borg Architecture Task startup latency Task startup latency is a very important metric to optimize Time from job submission to a task running Highly variable E.g.: at Google, median was about 25s Techniques to reduce latency The main culprit for high latency is binary and package installations Idea: place tasks on machines that already have dependencies installed Packages and binaries distributed using a BitTorrent-like protocol Pietro Michiardi (Eurecom) Cluster Schedulers 111 / 129
  • 112. BORG Borg Architecture The Borglet The Borglet Borg agent present on every machine in a cell Starts and stop tasks Restarts failed tasks Manages machine resources interacting with the OS Maintains and rolls over debug logs Report the state of the machine to the Borgmaster Interaction with the Borgmaster Pull-based mechanism: heartbeat-like messages every few seconds → Borgmaster perform flow and rate control to avoid message storms Borglet continues operation even if communication to Borgmaster is interrupted A failed Borglet is blacklisted and all tasks are rescheduled Pietro Michiardi (Eurecom) Cluster Schedulers 112 / 129
  • 113. BORG Borg Architecture Borglet to Borgmaster communication How to handle control message overhead? Many Borgmaster replicas receive state updates Many Borglets communicate concurrently The link shard mechanism Each borgmaster replica communicates with a subset of the cell Borglets Partitioning is computed at each leader election Borglets report full state, but the link shard mechanism aggregate state information → Differential state update, to reduce the load at the master Pietro Michiardi (Eurecom) Cluster Schedulers 113 / 129
  • 114. BORG Borg Architecture Scalability A typical Borgmaster resource requirements Manages 1000s of machines in a cell Arrival rates of 10,000 tasks per minute 10+ cores, 50+ GB of RAM Decentralized design Scheduler process separate from Borgmaster process One scheduler per Borgmaster replica Scheduling is somehow decentralized State change communicated from replicas to elected Borgmaster, that finalizes the state update Additional techniques to achieve scalability Score caching Equivalence class Relaxed randomization Pietro Michiardi (Eurecom) Cluster Schedulers 114 / 129
  • 115. BORG Borg Behavior Borg Behavior: Experimental Perspective Also, additional details on how Borg works... Pietro Michiardi (Eurecom) Cluster Schedulers 115 / 129
  • 116. BORG Borg Behavior Availability In large scale systems, failure are the norm not the exception Everything can fail, and both Borg and its running applications must deal with this Baseline techniques to achieve high availability Replication Storing persistent state in a distributed file system Checkpointing Additional techniques Automatic rescheduling of failed tasks Mitigating correlated failures Rate limitation Avoid duplicate computation Admission control to avoid overload Minimize external dependencies for task binaries Pietro Michiardi (Eurecom) Cluster Schedulers 116 / 129
  • 117. BORG Borg Behavior System Utilization The primary goal of a cluster scheduler is to achieve high utilization Machines, network fabric, power, cooling ... represent a significant financial investment Increasing utilization by a few percent can save millions! A sophisticated metric: Cell Compaction Replaces the typical “average utilization” metric Provides a fair, consistent way to compare scheduling policies Translates directly into cost/benefit result Computed as follows: Given a workload in a point in time (so this is not trace driven simulation) Enter a loop of workload packing At each iteration, remove physical machines from the cell Exit the loop when the workload can no longer fit the cell size Use the Fauxmaster to produce experimental results Pietro Michiardi (Eurecom) Cluster Schedulers 117 / 129
  • 118. BORG Borg Behavior System Utilization: Compaction This graphs shows how much smaller cells would be if we applied compaction to them Pietro Michiardi (Eurecom) Cluster Schedulers 118 / 129
  • 119. BORG Borg Behavior System Utilization: Cell Sharing Fundamental question: to share or not to share? Many current systems apply static partitioning: one cluster dedicated only to prod jobs, one cluster for non-prod jobs Benefits from sharing Borg can reclaim resources reserved by “anxious” prod jobs Pietro Michiardi (Eurecom) Cluster Schedulers 119 / 129
  • 120. BORG Borg Behavior System Utilization: Cell Sizing Fundamental question: large or small cells? Large cells to accommodate large jobs Large cells also avoid fragmentation Pietro Michiardi (Eurecom) Cluster Schedulers 120 / 129
  • 121. BORG Borg Behavior System Utilization: Fine-grained Resource Requests Borg users specify Job requirements in terms of resources For CPU: this is done in milli-core For RAM and disk: this is done in bytes Pietro Michiardi (Eurecom) Cluster Schedulers 121 / 129
  • 122. BORG Borg Behavior System Utilization: Fine-grained Resource Requests Would fixed size containers (or slot) be good? No! It would require more machines in a cell! Pietro Michiardi (Eurecom) Cluster Schedulers 122 / 129
  • 123. BORG Borg Behavior System Utilization: Resource Reclamation Borg users specify resource limits for their jobs Used to perform admission control Used for feasibility checking (i.e., find sets of suitable machines) Borg users have the tendency to “over provision” for their jobs Some tasks, occasionally need to use all their resources Most of the tasks never use all resources Resource reclamation Borg builds estimates of resource usage: these are called resource reservations Borgmaster receives usage updates from Borglets Prod jobs are treated differently: they do not rely on reclaimed resources, Borgmaster only uses resource limits Pietro Michiardi (Eurecom) Cluster Schedulers 123 / 129
  • 124. BORG Borg Behavior System Utilization: Resource Reclamation Resource reclamation is quite effective. A CDF of the additional machines that would be needed if it was disabled. Pietro Michiardi (Eurecom) Cluster Schedulers 124 / 129
  • 125. BORG Borg Behavior System Utilization: Resource Reclamation Resource estimation is successful at identifying unused resources. Most tasks use much less than their limit, although a few use more CPU than requested. Pietro Michiardi (Eurecom) Cluster Schedulers 125 / 129
  • 126. BORG Borg Behavior Isolation Sharing a multi-tenancy are beneficial, but... Tasks may interfere one with each other Need a good mechanism to prevent interference (both in terms of security and performance) Performance isolation All tasks run in Linux cgroup-based containers Borglets operate on the OS to control container resources A control-loop assigns resources based on predicted future usage or on memory pressure Additional techniques Application classes: latency-sensitive, vs. batch Resource classes: compressible and non-compressible Tuning of the underlying OS, especially the OS scheduler Pietro Michiardi (Eurecom) Cluster Schedulers 126 / 129
  • 127. BORG Lessons Learned Lessons learned from building and operating Borg And what has been included in Kubernetes, the open source version of Borg... Pietro Michiardi (Eurecom) Cluster Schedulers 127 / 129
  • 128. BORG Lessons Learned Lessons Learned: the Bad The “job” abstraction is too simplistic Multi-job services cannot be easily managed, nor addressed Kubernetes uses scheduling units called pods (i.e., the Borg allocs) and labels (key/value pairs describing objects) Addressing services is critical One IP address implies managing ports as a resource, which complicates tasks Kubernetes uses Linux name-spaces, such that each pod has its own IP address Power or casual users? Borg is geared toward power users: e.g., BCL has about 230 parameters! Build automation tools, and template settings from historical executions Pietro Michiardi (Eurecom) Cluster Schedulers 128 / 129
  • 129. BORG Lessons Learned Lessons Learned: the Good Allocs are useful Resource envelope for one or more container co-scheduled on the same machine and that can share resources Cluster management is more than task management Naming and load balancing are first-class citizens Introspection is vital Clearly true for debugging Key for capacity planning and monitoring The master is the kernel of a distributed system Monolithic designs are not working well Cooperation of micro-services that use a common low-level API to process requests and manipulate state Pietro Michiardi (Eurecom) Cluster Schedulers 129 / 129