SlideShare une entreprise Scribd logo
1  sur  32
Airflow Clustering
and High Availability
By: Robert Sanders
2Page:
Agenda
• Airflow Daemons
• Single Node Deployment
• Cluster Deployment
• Scaling
• Worker Nodes
• Master Nodes
• Limitations
• Airflow Scheduler Failover Controller
• Failover Controller Procedure
3Page:
Airflow Daemons
• Web Server
• Daemon that runs the Airflow Webserver
• 1 to many gunicorn processes to accept and process requests in
parallel.
• Allows you to track jobs progress, run jobs and more
• Scheduler
• Periodically runs (every X seconds) to determine if a DAG or Task
needs to be ran based off the DAG schedule
• Pushes messages to the Queuing Service to be executed
• Worker
• Daemon runs if you’re using the CeleryExecutors (as opposed to
SequentialExecutor and LocalExecutor)
• 1 to many dedicated celeryd processes which execute functions
• Pulls messages from a Queuing services to determine what
functions to execute
4Page:
Single Node Deployment
5Page:
Cluster Deployment
6Page:
Why setup a Cluster Deployment?
• Distributes heavy processes onto many machines for better
use of resources
• More Highly Available Airflow environment
• If you have many Workflows with many Tasks your executors
would not be able to get to all the messages in the queue.
Adding more executors would fix this issue.
7Page:
Scaling Workers
• Horizontally
• Add more machines to the cluster
• No need to register the machines with the master. You
just need to start up the Airflow Worker task on the new
Machine.
• Vertically
• Increase the number of executors (celeryd processes)
per node and restart the workers
8Page:
Scaling Master
9Page:
Limitations
• There can only be one scheduler running at a time
• If you have multiple Scheduler processes running, there's
a possibility that multiple instances of a single task that
will be scheduled to run.
• If the Scheduler Daemon or Machine with the process goes
down then no jobs will get scheduled
10Page:
Airflow Scheduler Failover Controller
• Dedicated Daemon that runs with Airflow on the Master
Nodes
• Ensures that there is always one and only one Scheduler
running on the Master nodes at a time
• Developed Internally and Open Sourced
• https://github.com/teamclairvoyant/airflow-scheduler-
failover-controller
• High Level Steps
• Polls (every x seconds) to check if the scheduler is
running
• If scheduler isn’t running, restart the scheduler
• If it still doesn’t start up, then try starting it up on the
other master nodes
11Page:
Failover Controller Diagram
12Page:
Start Up Scenario
13Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(standby)
On startup, the processes start out in STANDBY
14Page:
Failover Controller Process (Start Up)
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The first one to enter data into the Metastore is elected as the active
controller.
15Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
The Failover controller checks to see if the Scheduler is running, but it
isn’t.
16Page:
Failover Controller Process (Start Up)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller starts up the Scheduler
17Page:
Scheduler Failure
Scenario
18Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
19Page:
Failover Controller Process (Process Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller restarts the Scheduler
20Page:
Scheduler Failure and
Failed Restart Scenario
21Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Scheduler process has died
22Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler, but its still not running
23Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller tries to restart the Scheduler on a different node
24Page:
Failover Controller Process (Process Failure 2)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Failover Controller succeeds to restart the scheduler and the cluster is
back to normal
25Page:
Node Failure Scenario
26Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(active)
Master Node 2
Failover
Controller
(standby)
Everything is running as expected
27Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(standby)
Master Node 1 dies and all the processes running on it are gone
28Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
Failover Controller on Master 2 becomes active because the one running
on Master Node 1 has stopped sending a heart beat
29Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The newly active Failover Controller tries to check-in with and restart the
Scheduler on the daemon the Metadata says its running on and fails.
30Page:
Failover Controller Process (Node Failure)
Scheduler
Master Node 1
Failover
Controller
(dead)
Master Node 2
Failover
Controller
(active)
The Failover Controller then starts it on another node and it succeeds
Scheduler
31Page:
Failover Controller Process (Node Failure)
Master Node 1
Failover
Controller
(standby)
Master Node 2
Failover
Controller
(active)
When Master Node 1 is brought back, the old Failover Controller goes
into STANDBY state
Scheduler
32Page:
Q&A

Contenu connexe

Tendances

Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...HostedbyConfluent
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internalsnarsiman
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesRed Hat Developers
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkBo Yang
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the DisruptorTrisha Gee
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkFlink Forward
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioDevOpsDays Tel Aviv
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With PrometheusKnoldus Inc.
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy Docker, Inc.
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaDatabricks
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for PrometheusMitsuhiro Tanda
 
Meetup OpenTelemetry Intro
Meetup OpenTelemetry IntroMeetup OpenTelemetry Intro
Meetup OpenTelemetry IntroDimitrisFinas1
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)Roger Zhou 周志强
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3DataWorks Summit
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit
 
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
Juraci Paixão Kröhling - All you need to know about OpenTelemetryJuraci Paixão Kröhling - All you need to know about OpenTelemetry
Juraci Paixão Kröhling - All you need to know about OpenTelemetryJuliano Costa
 

Tendances (20)

Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
Apache Kafka’s Transactions in the Wild! Developing an exactly-once KafkaSink...
 
Cassandra internals
Cassandra internalsCassandra internals
Cassandra internals
 
Exploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on KubernetesExploring the power of OpenTelemetry on Kubernetes
Exploring the power of OpenTelemetry on Kubernetes
 
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in SparkSpark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
Spark Shuffle Deep Dive (Explained In Depth) - How Shuffle Works in Spark
 
Introduction to the Disruptor
Introduction to the DisruptorIntroduction to the Disruptor
Introduction to the Disruptor
 
Evening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in FlinkEvening out the uneven: dealing with skew in Flink
Evening out the uneven: dealing with skew in Flink
 
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.ioTHE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
THE STATE OF OPENTELEMETRY, DOTAN HOROVITS, Logz.io
 
Airflow 101
Airflow 101Airflow 101
Airflow 101
 
Monitoring With Prometheus
Monitoring With PrometheusMonitoring With Prometheus
Monitoring With Prometheus
 
Prometheus design and philosophy
Prometheus design and philosophy   Prometheus design and philosophy
Prometheus design and philosophy
 
Apache flink
Apache flinkApache flink
Apache flink
 
Grafana 7.0
Grafana 7.0Grafana 7.0
Grafana 7.0
 
Simplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks DeltaSimplifying Change Data Capture using Databricks Delta
Simplifying Change Data Capture using Databricks Delta
 
Grafana optimization for Prometheus
Grafana optimization for PrometheusGrafana optimization for Prometheus
Grafana optimization for Prometheus
 
Meetup OpenTelemetry Intro
Meetup OpenTelemetry IntroMeetup OpenTelemetry Intro
Meetup OpenTelemetry Intro
 
Apache Spark Architecture
Apache Spark ArchitectureApache Spark Architecture
Apache Spark Architecture
 
High Availability Storage (susecon2016)
High Availability Storage (susecon2016)High Availability Storage (susecon2016)
High Availability Storage (susecon2016)
 
Iceberg: a fast table format for S3
Iceberg: a fast table format for S3Iceberg: a fast table format for S3
Iceberg: a fast table format for S3
 
Flexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache FlinkFlexible and Real-Time Stream Processing with Apache Flink
Flexible and Real-Time Stream Processing with Apache Flink
 
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
Juraci Paixão Kröhling - All you need to know about OpenTelemetryJuraci Paixão Kröhling - All you need to know about OpenTelemetry
Juraci Paixão Kröhling - All you need to know about OpenTelemetry
 

Similaire à Airflow Clustering and High Availability

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for BeginnerShahzad Masud
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demoAjith Narayanan
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in productionItai Yaffe
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)STePINForum
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelDocker, Inc.
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlcAlexey Tokar
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouMariaDB plc
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAjith Narayanan
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeDocker, Inc.
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3kognate
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practiceDocker, Inc.
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Jimmy Lai
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoopclairvoyantllc
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACKristofferson A
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection7mind
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dbaOsama Mustafa
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresDocker, Inc.
 

Similaire à Airflow Clustering and High Availability (20)

Docker Swarm for Beginner
Docker Swarm for BeginnerDocker Swarm for Beginner
Docker Swarm for Beginner
 
Oracle real application clusters system tests with demo
Oracle real application clusters system tests with demoOracle real application clusters system tests with demo
Oracle real application clusters system tests with demo
 
Fyber - airflow best practices in production
Fyber - airflow best practices in productionFyber - airflow best practices in production
Fyber - airflow best practices in production
 
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
Docker–Grid (A On demand and Scalable dockerized selenium grid architecture)
 
Heart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object ModelHeart of the SwarmKit: Store, Topology & Object Model
Heart of the SwarmKit: Store, Topology & Object Model
 
Bots on guard of sdlc
Bots on guard of sdlcBots on guard of sdlc
Bots on guard of sdlc
 
M|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for YouM|18 Choosing the Right High Availability Strategy for You
M|18 Choosing the Right High Availability Strategy for You
 
An introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methodsAn introduction to_rac_system_test_planning_methods
An introduction to_rac_system_test_planning_methods
 
Fail over fail_back
Fail over fail_backFail over fail_back
Fail over fail_back
 
Container Orchestration from Theory to Practice
Container Orchestration from Theory to PracticeContainer Orchestration from Theory to Practice
Container Orchestration from Theory to Practice
 
Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3Server(less) Swift at SwiftCloudWorkshop 3
Server(less) Swift at SwiftCloudWorkshop 3
 
Container orchestration from theory to practice
Container orchestration from theory to practiceContainer orchestration from theory to practice
Container orchestration from theory to practice
 
Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...Distributed system coordination by zookeeper and introduction to kazoo python...
Distributed system coordination by zookeeper and introduction to kazoo python...
 
Running Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on HadoopRunning Airflow Workflows as ETL Processes on Hadoop
Running Airflow Workflows as ETL Processes on Hadoop
 
RubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngineRubyKaigi 2014: ServerEngine
RubyKaigi 2014: ServerEngine
 
Hadoop fault-tolerance
Hadoop fault-toleranceHadoop fault-tolerance
Hadoop fault-tolerance
 
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RACPerformance Scenario: Diagnosing and resolving sudden slow down on two node RAC
Performance Scenario: Diagnosing and resolving sudden slow down on two node RAC
 
ScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency InjectionScalaUA - distage: Staged Dependency Injection
ScalaUA - distage: Staged Dependency Injection
 
Weblogic 101 for dba
Weblogic  101 for dbaWeblogic  101 for dba
Weblogic 101 for dba
 
Orchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failuresOrchestrating Linux Containers while tolerating failures
Orchestrating Linux Containers while tolerating failures
 

Plus de Robert Sanders

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudRobert Sanders
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Robert Sanders
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applicationsRobert Sanders
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in ProductionRobert Sanders
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud OverviewRobert Sanders
 

Plus de Robert Sanders (6)

Migrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the CloudMigrating Big Data Workloads to the Cloud
Migrating Big Data Workloads to the Cloud
 
Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...Delivering digital transformation and business impact with io t, machine lear...
Delivering digital transformation and business impact with io t, machine lear...
 
Productionalizing spark streaming applications
Productionalizing spark streaming applicationsProductionalizing spark streaming applications
Productionalizing spark streaming applications
 
Apache Airflow in Production
Apache Airflow in ProductionApache Airflow in Production
Apache Airflow in Production
 
Databricks Community Cloud Overview
Databricks Community Cloud OverviewDatabricks Community Cloud Overview
Databricks Community Cloud Overview
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 

Dernier

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based projectAnoyGreter
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...OnePlan Solutions
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfAlina Yurenko
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odishasmiwainfosol
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineeringssuserb3a23b
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024StefanoLambiase
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Developmentvyaparkranti
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Matt Ray
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationBradBedford3
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...OnePlan Solutions
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Andreas Granig
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfFerryKemperman
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...Technogeeks
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsChristian Birchler
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWave PLM
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsSafe Software
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作qr0udbr0
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Angel Borroy López
 

Dernier (20)

MYjobs Presentation Django-based project
MYjobs Presentation Django-based projectMYjobs Presentation Django-based project
MYjobs Presentation Django-based project
 
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
Tech Tuesday - Mastering Time Management Unlock the Power of OnePlan's Timesh...
 
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdfGOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
GOING AOT WITH GRAALVM – DEVOXX GREECE.pdf
 
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company OdishaBalasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
Balasore Best It Company|| Top 10 IT Company || Balasore Software company Odisha
 
Software Coding for software engineering
Software Coding for software engineeringSoftware Coding for software engineering
Software Coding for software engineering
 
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
Dealing with Cultural Dispersion — Stefano Lambiase — ICSE-SEIS 2024
 
VK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web DevelopmentVK Business Profile - provides IT solutions and Web Development
VK Business Profile - provides IT solutions and Web Development
 
Odoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting ServiceOdoo Development Company in India | Devintelle Consulting Service
Odoo Development Company in India | Devintelle Consulting Service
 
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
Open Source Summit NA 2024: Open Source Cloud Costs - OpenCost's Impact on En...
 
How to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion ApplicationHow to submit a standout Adobe Champion Application
How to submit a standout Adobe Champion Application
 
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
Maximizing Efficiency and Profitability with OnePlan’s Professional Service A...
 
Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024Automate your Kamailio Test Calls - Kamailio World 2024
Automate your Kamailio Test Calls - Kamailio World 2024
 
Introduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdfIntroduction Computer Science - Software Design.pdf
Introduction Computer Science - Software Design.pdf
 
What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...What is Advanced Excel and what are some best practices for designing and cre...
What is Advanced Excel and what are some best practices for designing and cre...
 
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving CarsSensoDat: Simulation-based Sensor Dataset of Self-driving Cars
SensoDat: Simulation-based Sensor Dataset of Self-driving Cars
 
What is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need ItWhat is Fashion PLM and Why Do You Need It
What is Fashion PLM and Why Do You Need It
 
Powering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data StreamsPowering Real-Time Decisions with Continuous Data Streams
Powering Real-Time Decisions with Continuous Data Streams
 
英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作英国UN学位证,北安普顿大学毕业证书1:1制作
英国UN学位证,北安普顿大学毕业证书1:1制作
 
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
Alfresco TTL#157 - Troubleshooting Made Easy: Deciphering Alfresco mTLS Confi...
 
2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva2.pdf Ejercicios de programación competitiva
2.pdf Ejercicios de programación competitiva
 

Airflow Clustering and High Availability

  • 1. Airflow Clustering and High Availability By: Robert Sanders
  • 2. 2Page: Agenda • Airflow Daemons • Single Node Deployment • Cluster Deployment • Scaling • Worker Nodes • Master Nodes • Limitations • Airflow Scheduler Failover Controller • Failover Controller Procedure
  • 3. 3Page: Airflow Daemons • Web Server • Daemon that runs the Airflow Webserver • 1 to many gunicorn processes to accept and process requests in parallel. • Allows you to track jobs progress, run jobs and more • Scheduler • Periodically runs (every X seconds) to determine if a DAG or Task needs to be ran based off the DAG schedule • Pushes messages to the Queuing Service to be executed • Worker • Daemon runs if you’re using the CeleryExecutors (as opposed to SequentialExecutor and LocalExecutor) • 1 to many dedicated celeryd processes which execute functions • Pulls messages from a Queuing services to determine what functions to execute
  • 6. 6Page: Why setup a Cluster Deployment? • Distributes heavy processes onto many machines for better use of resources • More Highly Available Airflow environment • If you have many Workflows with many Tasks your executors would not be able to get to all the messages in the queue. Adding more executors would fix this issue.
  • 7. 7Page: Scaling Workers • Horizontally • Add more machines to the cluster • No need to register the machines with the master. You just need to start up the Airflow Worker task on the new Machine. • Vertically • Increase the number of executors (celeryd processes) per node and restart the workers
  • 9. 9Page: Limitations • There can only be one scheduler running at a time • If you have multiple Scheduler processes running, there's a possibility that multiple instances of a single task that will be scheduled to run. • If the Scheduler Daemon or Machine with the process goes down then no jobs will get scheduled
  • 10. 10Page: Airflow Scheduler Failover Controller • Dedicated Daemon that runs with Airflow on the Master Nodes • Ensures that there is always one and only one Scheduler running on the Master nodes at a time • Developed Internally and Open Sourced • https://github.com/teamclairvoyant/airflow-scheduler- failover-controller • High Level Steps • Polls (every x seconds) to check if the scheduler is running • If scheduler isn’t running, restart the scheduler • If it still doesn’t start up, then try starting it up on the other master nodes
  • 13. 13Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (standby) On startup, the processes start out in STANDBY
  • 14. 14Page: Failover Controller Process (Start Up) Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The first one to enter data into the Metastore is elected as the active controller.
  • 15. 15Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) The Failover controller checks to see if the Scheduler is running, but it isn’t.
  • 16. 16Page: Failover Controller Process (Start Up) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller starts up the Scheduler
  • 18. 18Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 19. 19Page: Failover Controller Process (Process Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller restarts the Scheduler
  • 21. 21Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Scheduler process has died
  • 22. 22Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler, but its still not running
  • 23. 23Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller tries to restart the Scheduler on a different node
  • 24. 24Page: Failover Controller Process (Process Failure 2) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Failover Controller succeeds to restart the scheduler and the cluster is back to normal
  • 26. 26Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (active) Master Node 2 Failover Controller (standby) Everything is running as expected
  • 27. 27Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (standby) Master Node 1 dies and all the processes running on it are gone
  • 28. 28Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) Failover Controller on Master 2 becomes active because the one running on Master Node 1 has stopped sending a heart beat
  • 29. 29Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The newly active Failover Controller tries to check-in with and restart the Scheduler on the daemon the Metadata says its running on and fails.
  • 30. 30Page: Failover Controller Process (Node Failure) Scheduler Master Node 1 Failover Controller (dead) Master Node 2 Failover Controller (active) The Failover Controller then starts it on another node and it succeeds Scheduler
  • 31. 31Page: Failover Controller Process (Node Failure) Master Node 1 Failover Controller (standby) Master Node 2 Failover Controller (active) When Master Node 1 is brought back, the old Failover Controller goes into STANDBY state Scheduler