SlideShare une entreprise Scribd logo
1  sur  52
Télécharger pour lire hors ligne
Ensuring Performance in a Fast- 
Paced Environment 
Martin Spier 
Performance Engineering @ Netflix 
@spiermar 
mspier@netflix.com 
Performance & Capacity 2014 by CMG
Martin Spier 
● Performance Engineer @ Netflix 
● Previously @ Expedia and Dell 
● Performance 
o Architecture, Tuning and Profiling 
o Testing and Frameworks 
o Tool Development 
● Blog @ http://overloaded.io 
● Twitter @spiermar
● World's leading Internet television network 
● ⅓ of all traffic heading into American homes at 
peak hours 
● > 50 million members 
● > 40 countries 
● > 1 billion hours of TV shows and movies per 
month 
● > 100s different client devices
Agenda 
● How Netflix Works 
o Culture, Development Model, High-level 
Architecture, Platform 
● Ensuring Performance 
o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, 
Redundancy, Canary Analysis, Performance Test 
Framework, Large Scale Tests
Freedom and Responsibility 
● Culture deck* is TRUE 
o 9M+ views 
● Minimal process 
● Context over control 
● Root access to everything 
● No approvals required 
● Only Senior Engineers 
* http://www.slideshare.net/reed2001/culture-1798664
Independent Development Teams 
● Highly aligned, loosely coupled 
● Free to define release cycles 
● Free to choose use any methodology 
● But it’s an agile environment 
● And there is a “paved road”
Development Agility 
● Continuous innovation cycle 
● Shorter development cycles 
● Automate everything! 
● Self-service deployments 
● A/B Tests 
● Failure cost close to zero 
● Lower time to market 
● Innovation > Risk
Architecture 
● Scalable and Resilient 
● Micro-services 
● Stateless 
● Assume Failure 
● Backwards Compatible 
● Service Discovery
Zuul & Dynamic Routing 
● Zuul, the front door for all requests from devices and 
websites to the backend of the Netflix streaming 
application 
● Dynamic Routing 
● Monitoring 
● Resiliency and Security 
● Region and AZ Failure 
* https://github.com/Netflix/zuul
Cloud 
● Amazon’s AWS 
● Multi-region Active/Active 
● Ephemeral Instances 
● Auto-Scaling 
● Netflix OSS (https://github.com/Netflix)
Performance Engineering 
● Not a part of any development team 
● Not a shared service 
● Through consultation improve and maintain the 
performance and reliability 
● Provide self-service performance analysis utilities 
● Disseminate performance best practices 
● And we’re hiring!
What about Performance?
Auto-Scaling 
● 5-6x Intraday 
● Auto-Scaling Groups (ASGs) 
● Reactive Auto-Scaling 
● Predictive Auto-Scaling (Scryer)
Squeeze Tests 
● Stress Test, with Production Load 
● Steering Production Traffic 
● Understand the Upper Limits of Capacity 
● Adjust Auto-Scaling Policies 
● Automated Squeeze Tests
Red/Black Pushes 
● New builds are rolled out as new 
Auto-Scaling Groups (ASGs) 
● Elastic Load Balancers (ELBs) 
control the traffic going to each 
ASG 
● Fast and simple rollback if issues 
are found 
● Canary Clusters are used to test 
builds before a full rollout
Monitoring: Atlas 
● Humongous, 1.2 billion distinct time 
series 
● Integrated to all systems, production 
and test 
● 1 minute resolution, quick roll ups 
● 12-month persistence 
● API and querying UI 
● System and Application Level 
● Servo (github.com/Netflix/servo) 
● Custom dashboards
Vector 
● 1 second Resolution 
● No Persistence 
● Leverages Performance Co- 
Pilot (PCP) 
● System-level Metrics 
● Java Metrics (parfait) 
● ElasticSearch, Cassandra 
● Flame Graphs (Brendan Gregg)
Mogul 
● ASG and Instance Level 
● Resource Demand; 
● Performance 
Characteristics; 
● And Downstream 
Dependencies.
Slalom 
● Cluster Level 
● High-level Demand Flow 
● Cross-application Request 
Tracing 
● Downstream and Upstream 
Demand
Canary Release 
“Canary release is a technique to reduce the risk 
of introducing a new software version in 
production by slowly rolling out the change to a 
small subset of users before rolling it out to the 
entire infrastructure and making it available to 
everybody.”
Automatic Canary Analysis (ACA) 
Exactly what the name implies. An automated 
way of analyzing a canary release.
ACA: Use Case 
● You are a service owner and have finished 
implementing a new feature into your application. 
● You want to determine if the new build, v1.1, is 
performing analogous to the existing build. 
● The new build is deployed automatically to a canary 
cluster 
● A small percentage of production traffic is steered to the 
canary cluster 
● After a short period of time, canary analysis 
is triggered
Automated Canary Analysis 
● For a given set of metrics, ACA will compare 
samples from baseline and canary; 
● Determine if they are analogous; 
● Identify any metrics that deviate from the 
baseline; 
● And generate a score that indicates the overall 
similarity of the canary.
Automated Canary Analysis 
● The score will be associated 
with a Go/No-Go decision; 
● And the new build will be 
rolled out (or not) to the rest 
of the production 
environment. 
● No workload definitions 
● No synthetic load
What about pre-production 
Performance 
Testing? 
When is it appropriate?
Not always! 
Sometimes it doesn't make sense to run 
performance tests.
Remember the short release cycles? 
With the short time span between production builds, 
pre-production tests don’t warn us much sooner. 
(And there’s ACA)
So when? 
When it brings value. Not just because is 
part of a process.
When? Use Cases 
● New Services 
● Large Code Refactoring 
● Architecture Changes 
● Workload Changes 
● Proof of Concept 
● Initial Cluster Sizing 
● Instance Type Migration
Use Cases, cont. 
● Troubleshooting 
● Tuning 
● Teams that release less frequently 
o Intermediary Builds 
● Base Components (Paved Road) 
o Amazon Cloud Images (AMIs) 
o Platform 
o Common Libraries
Who? 
● Push “tests” to development teams 
● Development understands the product, they 
developed It 
● Performance Engineering knows the tools 
and techniques (so we help!) 
● Easier to scale the effort!
How? Environment 
● Free to create any environment configuration 
● Integration stack 
● Full production-like or scaled-down environment 
● Hybrid model 
o Performance + integration stack 
● Production testing
How? Test Framework 
● Built around JMeter
How? Test Framework 
● Runs on Amazon’s EC2 
● Leverages Jenkins for orchestration
How? Analysis 
● In-house developed web analysis tool and API 
● Results persisted on Amazon’s S3 and RDS
How? Analysis 
● Automated analysis built-in (thresholds) 
● Customized alerts 
● Interface with monitoring tools
Large Scale Tests 
● > 100k req/s 
● > 100 of load generators 
● High Throughput Components 
o In-Memory Caches 
● Component scaling 
● Full production tests
Large Scale Tests: Problems 
● Your test client is likely the first bottleneck 
● Components are (often) not designed to 
scale 
o Great performance per node; 
o But they don’t scale horizontally. 
o Controller, data feeder, load generator*, result 
collection, result analysis, monitoring 
* often the exception
Large Scale Tests: Single Controller 
● Single controller, multiple load generators 
● Controller also serves as data feeder 
● Controller collects all results synchronously 
● Controller aggregates monitoring data 
● Batch and async might alleviate the problem 
● Analysis of large result sets is heavy (think 
percentiles)
Large Scale Tests: Distributed Model 
● Data Feeding and Load Generation 
o No Controller 
o Independent Load Generators 
● Data Collection and Monitoring 
o Decentralized Monitoring Platform 
● Data Analysis 
o Aggregation at node level 
o Hive/Pig 
o ElasticSearch
Takeaways 
● Canary analysis 
● Testing only when it brings VALUE 
● Leveraging cloud for tests 
● Automated test analysis 
● Pushing execution to development teams 
● Open source tools
Martin Spier 
mspier@netflix.com 
@spiermar 
http://overloaded.io/
References 
● parfait (https://code.google.com/p/parfait/) 
● servo (https://github.com/Netflix/servo) 
● hystrix (https://github.com/Netflix/Hystrix) 
● culture deck ( 
http://www.slideshare.net/reed2001/culture-1798664) 
● zuul (https://github.com/Netflix/zuul) 
● scryer ( 
http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- 
auto-scaling.html)
Backup Slides
Simian Army 
● Ensures cloud handles failures 
through regular testing 
● The Monkeys 
o Chaos Monkey: Resiliency 
o Latency: Artificial Delays 
o Conformity: Best-practices 
o Janitor: Unused Instances 
o Doctor: Health checks 
o Security: Security Violations 
o Chaos Gorilla: AZ Failure 
o Chaos Kong: Region Failure
“... is a latency and fault 
tolerance library designed to 
isolate points of access to 
remote systems ...” 
● Stop cascading failures. 
● Fallbacks and graceful degradation 
● Fail fast and rapid recovery 
● Thread and semaphore isolation with 
circuit breakers 
● Real-time monitoring and 
configuration changes 
* https://github.com/Netflix/Hystrix
Real-time Analytics Platform (RTA) 
● ACA runs on top of RTA 
● Compute Engines 
o OpenCPU (R) 
o OpenPY (Python) 
● Data Sources 
o Real-time Monitoring Systems 
o Big Data Platforms 
● Reporting, Scheduling, Persistence
Slow Performance Regression 
● Deviation => “acceptable” regression 
● Small performance regressions might sneak in 
● Short release cycle = many releases 
● Many releases = cumullative regression
Slow Performance Regression
Testing Lower Level Components 
● Base AMIs 
o OS (Linux), tools and agents 
● Common Application Platform 
● Common Libraries 
● Reference Application 
o Leverages a common architecture (front, middle, 
data, memcache, jar clients, Hystrix) 
o Implements functions that stress 
specific resources (cpu, service, db)

Contenu connexe

Tendances

Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkKarthik Deivasigamani
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreLegacy Typesafe (now Lightbend)
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...Bert Jan Schrijver
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceBert Jan Schrijver
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveIlya Ganelin
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Vadym Kazulkin
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceESUG
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingAhmed Soliman
 
Introduction to Akka Streams
Introduction to Akka StreamsIntroduction to Akka Streams
Introduction to Akka StreamsKnoldus Inc.
 
LCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLinaro
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflixVinay Kumar Chella
 
A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)neptunerx
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)Salvatore Orlando
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud RoadGert Drapers
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1Ruslan Meshenberg
 
Quick Tour On Zeppelin
Quick Tour On ZeppelinQuick Tour On Zeppelin
Quick Tour On ZeppelinKnoldus Inc.
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Tim Bozarth
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0Legacy Typesafe (now Lightbend)
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Matt Tesauro
 

Tendances (20)

Tale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache FlinkTale of two streaming frameworks- Apace Storm & Apache Flink
Tale of two streaming frameworks- Apace Storm & Apache Flink
 
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and moreTypesafe Reactive Platform: Monitoring 1.0, Commercial features and more
Typesafe Reactive Platform: Monitoring 1.0, Commercial features and more
 
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...OpenValue meetup October 2017 - Microservices in action at the Dutch National...
OpenValue meetup October 2017 - Microservices in action at the Dutch National...
 
JavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National PoliceJavaZone 2017 - Microservices in action at the Dutch National Police
JavaZone 2017 - Microservices in action at the Dutch National Police
 
Microservices in action at the Dutch National Police
Microservices in action at the Dutch National PoliceMicroservices in action at the Dutch National Police
Microservices in action at the Dutch National Police
 
Your Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's PerspectiveYour Guide to Streaming - The Engineer's Perspective
Your Guide to Streaming - The Engineer's Perspective
 
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
Measure and Increase Developer Productivity with Help of Serverless at JCON 2...
 
Sista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performanceSista: Improving Cog’s JIT performance
Sista: Improving Cog’s JIT performance
 
A Journey to Reactive Function Programming
A Journey to Reactive Function ProgrammingA Journey to Reactive Function Programming
A Journey to Reactive Function Programming
 
Introduction to Akka Streams
Introduction to Akka StreamsIntroduction to Akka Streams
Introduction to Akka Streams
 
LCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation ImprovementsLCA13: Android Infrastructure Automation Improvements
LCA13: Android Infrastructure Automation Improvements
 
Looking towards an official cassandra sidecar netflix
Looking towards an official cassandra sidecar   netflixLooking towards an official cassandra sidecar   netflix
Looking towards an official cassandra sidecar netflix
 
A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)A tale in automation (Puppet to Ansible)
A tale in automation (Puppet to Ansible)
 
How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)How to build a Neutron Plugin (stadium edition)
How to build a Neutron Plugin (stadium edition)
 
The Rocky Cloud Road
The Rocky Cloud RoadThe Rocky Cloud Road
The Rocky Cloud Road
 
NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1NetflixOSS Meetup season 3 episode 1
NetflixOSS Meetup season 3 episode 1
 
Quick Tour On Zeppelin
Quick Tour On ZeppelinQuick Tour On Zeppelin
Quick Tour On Zeppelin
 
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
Netflix: From Zero to Production-Ready in Minutes (QCon 2017)
 
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
A Deeper Look Into Reactive Streams with Akka Streams 1.0 and Slick 3.0
 
Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013Testing at-cloud-speed sans-app-sec-austin-2013
Testing at-cloud-speed sans-app-sec-austin-2013
 

Similaire à Ensuring Performance in a Fast-Paced Environment (CMG 2014)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITOpenStack
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentationDrew Hannay
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015aspyker
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegasPeter Mounce
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...StormForge .io
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open SourceAll Things Open
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureDiUS
 
Security in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersSecurity in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersDevOps.com
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With GatlingKnoldus Inc.
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...HostedbyConfluent
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Omid Vahdaty
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Waysmalltown
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudJosh Evans
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The CloudAmazon Web Services
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovValeriia Maliarenko
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingMark Price
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance TestingC4Media
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2
 

Similaire à Ensuring Performance in a Fast-Paced Environment (CMG 2014) (20)

Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst ITThings You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
Things You MUST Know Before Deploying OpenStack: Bruno Lago, Catalyst IT
 
Expedia 3x3 presentation
Expedia 3x3 presentationExpedia 3x3 presentation
Expedia 3x3 presentation
 
Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015Triangle Devops Meetup 10/2015
Triangle Devops Meetup 10/2015
 
Aws uk ug #8 not everything that happens in vegas stay in vegas
Aws uk ug #8   not everything that happens in vegas stay in vegasAws uk ug #8   not everything that happens in vegas stay in vegas
Aws uk ug #8 not everything that happens in vegas stay in vegas
 
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
Your Testing Is Flawed: Introducing A New Open Source Tool For Accurate Kuber...
 
Netflix Architecture and Open Source
Netflix Architecture and Open SourceNetflix Architecture and Open Source
Netflix Architecture and Open Source
 
Antifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failureAntifragility and testing for distributed systems failure
Antifragility and testing for distributed systems failure
 
Security in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps EngineersSecurity in CI/CD Pipelines: Tips for DevOps Engineers
Security in CI/CD Pipelines: Tips for DevOps Engineers
 
Gatling
Gatling Gatling
Gatling
 
Performance Test Automation With Gatling
Performance Test Automation  With GatlingPerformance Test Automation  With Gatling
Performance Test Automation With Gatling
 
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
Automating Speed: A Proven Approach to Preventing Performance Regressions in ...
 
Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...Lessons learned from designing QA automation event streaming platform(IoT big...
Lessons learned from designing QA automation event streaming platform(IoT big...
 
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps WayDevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
DevOpsDays Taipei 2019 - Mastering IaC the DevOps Way
 
Engineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the CloudEngineering Netflix Global Operations in the Cloud
Engineering Netflix Global Operations in the Cloud
 
(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud(ISM301) Engineering Netflix Global Operations In The Cloud
(ISM301) Engineering Netflix Global Operations In The Cloud
 
Performance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei RadovPerformance testing in scope of migration to cloud by Serghei Radov
Performance testing in scope of migration to cloud by Serghei Radov
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Continuous Performance Testing
Continuous Performance TestingContinuous Performance Testing
Continuous Performance Testing
 
Agile devops in the cloud
Agile devops in the cloudAgile devops in the cloud
Agile devops in the cloud
 
WSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the CloudWSO2Con Asia 2014 - Agile DevOps in the Cloud
WSO2Con Asia 2014 - Agile DevOps in the Cloud
 

Dernier

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 

Dernier (20)

Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 

Ensuring Performance in a Fast-Paced Environment (CMG 2014)

  • 1. Ensuring Performance in a Fast- Paced Environment Martin Spier Performance Engineering @ Netflix @spiermar mspier@netflix.com Performance & Capacity 2014 by CMG
  • 2. Martin Spier ● Performance Engineer @ Netflix ● Previously @ Expedia and Dell ● Performance o Architecture, Tuning and Profiling o Testing and Frameworks o Tool Development ● Blog @ http://overloaded.io ● Twitter @spiermar
  • 3. ● World's leading Internet television network ● ⅓ of all traffic heading into American homes at peak hours ● > 50 million members ● > 40 countries ● > 1 billion hours of TV shows and movies per month ● > 100s different client devices
  • 4. Agenda ● How Netflix Works o Culture, Development Model, High-level Architecture, Platform ● Ensuring Performance o Auto-Scaling, Squeeze Tests, Simian Army, Hystrix, Redundancy, Canary Analysis, Performance Test Framework, Large Scale Tests
  • 5. Freedom and Responsibility ● Culture deck* is TRUE o 9M+ views ● Minimal process ● Context over control ● Root access to everything ● No approvals required ● Only Senior Engineers * http://www.slideshare.net/reed2001/culture-1798664
  • 6. Independent Development Teams ● Highly aligned, loosely coupled ● Free to define release cycles ● Free to choose use any methodology ● But it’s an agile environment ● And there is a “paved road”
  • 7. Development Agility ● Continuous innovation cycle ● Shorter development cycles ● Automate everything! ● Self-service deployments ● A/B Tests ● Failure cost close to zero ● Lower time to market ● Innovation > Risk
  • 8.
  • 9. Architecture ● Scalable and Resilient ● Micro-services ● Stateless ● Assume Failure ● Backwards Compatible ● Service Discovery
  • 10. Zuul & Dynamic Routing ● Zuul, the front door for all requests from devices and websites to the backend of the Netflix streaming application ● Dynamic Routing ● Monitoring ● Resiliency and Security ● Region and AZ Failure * https://github.com/Netflix/zuul
  • 11. Cloud ● Amazon’s AWS ● Multi-region Active/Active ● Ephemeral Instances ● Auto-Scaling ● Netflix OSS (https://github.com/Netflix)
  • 12. Performance Engineering ● Not a part of any development team ● Not a shared service ● Through consultation improve and maintain the performance and reliability ● Provide self-service performance analysis utilities ● Disseminate performance best practices ● And we’re hiring!
  • 14. Auto-Scaling ● 5-6x Intraday ● Auto-Scaling Groups (ASGs) ● Reactive Auto-Scaling ● Predictive Auto-Scaling (Scryer)
  • 15. Squeeze Tests ● Stress Test, with Production Load ● Steering Production Traffic ● Understand the Upper Limits of Capacity ● Adjust Auto-Scaling Policies ● Automated Squeeze Tests
  • 16. Red/Black Pushes ● New builds are rolled out as new Auto-Scaling Groups (ASGs) ● Elastic Load Balancers (ELBs) control the traffic going to each ASG ● Fast and simple rollback if issues are found ● Canary Clusters are used to test builds before a full rollout
  • 17. Monitoring: Atlas ● Humongous, 1.2 billion distinct time series ● Integrated to all systems, production and test ● 1 minute resolution, quick roll ups ● 12-month persistence ● API and querying UI ● System and Application Level ● Servo (github.com/Netflix/servo) ● Custom dashboards
  • 18. Vector ● 1 second Resolution ● No Persistence ● Leverages Performance Co- Pilot (PCP) ● System-level Metrics ● Java Metrics (parfait) ● ElasticSearch, Cassandra ● Flame Graphs (Brendan Gregg)
  • 19. Mogul ● ASG and Instance Level ● Resource Demand; ● Performance Characteristics; ● And Downstream Dependencies.
  • 20. Slalom ● Cluster Level ● High-level Demand Flow ● Cross-application Request Tracing ● Downstream and Upstream Demand
  • 21. Canary Release “Canary release is a technique to reduce the risk of introducing a new software version in production by slowly rolling out the change to a small subset of users before rolling it out to the entire infrastructure and making it available to everybody.”
  • 22. Automatic Canary Analysis (ACA) Exactly what the name implies. An automated way of analyzing a canary release.
  • 23. ACA: Use Case ● You are a service owner and have finished implementing a new feature into your application. ● You want to determine if the new build, v1.1, is performing analogous to the existing build. ● The new build is deployed automatically to a canary cluster ● A small percentage of production traffic is steered to the canary cluster ● After a short period of time, canary analysis is triggered
  • 24. Automated Canary Analysis ● For a given set of metrics, ACA will compare samples from baseline and canary; ● Determine if they are analogous; ● Identify any metrics that deviate from the baseline; ● And generate a score that indicates the overall similarity of the canary.
  • 25. Automated Canary Analysis ● The score will be associated with a Go/No-Go decision; ● And the new build will be rolled out (or not) to the rest of the production environment. ● No workload definitions ● No synthetic load
  • 26. What about pre-production Performance Testing? When is it appropriate?
  • 27. Not always! Sometimes it doesn't make sense to run performance tests.
  • 28. Remember the short release cycles? With the short time span between production builds, pre-production tests don’t warn us much sooner. (And there’s ACA)
  • 29. So when? When it brings value. Not just because is part of a process.
  • 30. When? Use Cases ● New Services ● Large Code Refactoring ● Architecture Changes ● Workload Changes ● Proof of Concept ● Initial Cluster Sizing ● Instance Type Migration
  • 31. Use Cases, cont. ● Troubleshooting ● Tuning ● Teams that release less frequently o Intermediary Builds ● Base Components (Paved Road) o Amazon Cloud Images (AMIs) o Platform o Common Libraries
  • 32. Who? ● Push “tests” to development teams ● Development understands the product, they developed It ● Performance Engineering knows the tools and techniques (so we help!) ● Easier to scale the effort!
  • 33. How? Environment ● Free to create any environment configuration ● Integration stack ● Full production-like or scaled-down environment ● Hybrid model o Performance + integration stack ● Production testing
  • 34. How? Test Framework ● Built around JMeter
  • 35. How? Test Framework ● Runs on Amazon’s EC2 ● Leverages Jenkins for orchestration
  • 36. How? Analysis ● In-house developed web analysis tool and API ● Results persisted on Amazon’s S3 and RDS
  • 37. How? Analysis ● Automated analysis built-in (thresholds) ● Customized alerts ● Interface with monitoring tools
  • 38.
  • 39. Large Scale Tests ● > 100k req/s ● > 100 of load generators ● High Throughput Components o In-Memory Caches ● Component scaling ● Full production tests
  • 40. Large Scale Tests: Problems ● Your test client is likely the first bottleneck ● Components are (often) not designed to scale o Great performance per node; o But they don’t scale horizontally. o Controller, data feeder, load generator*, result collection, result analysis, monitoring * often the exception
  • 41. Large Scale Tests: Single Controller ● Single controller, multiple load generators ● Controller also serves as data feeder ● Controller collects all results synchronously ● Controller aggregates monitoring data ● Batch and async might alleviate the problem ● Analysis of large result sets is heavy (think percentiles)
  • 42. Large Scale Tests: Distributed Model ● Data Feeding and Load Generation o No Controller o Independent Load Generators ● Data Collection and Monitoring o Decentralized Monitoring Platform ● Data Analysis o Aggregation at node level o Hive/Pig o ElasticSearch
  • 43. Takeaways ● Canary analysis ● Testing only when it brings VALUE ● Leveraging cloud for tests ● Automated test analysis ● Pushing execution to development teams ● Open source tools
  • 44. Martin Spier mspier@netflix.com @spiermar http://overloaded.io/
  • 45. References ● parfait (https://code.google.com/p/parfait/) ● servo (https://github.com/Netflix/servo) ● hystrix (https://github.com/Netflix/Hystrix) ● culture deck ( http://www.slideshare.net/reed2001/culture-1798664) ● zuul (https://github.com/Netflix/zuul) ● scryer ( http://techblog.netflix.com/2013/11/scryer-netflixs-predictive- auto-scaling.html)
  • 47. Simian Army ● Ensures cloud handles failures through regular testing ● The Monkeys o Chaos Monkey: Resiliency o Latency: Artificial Delays o Conformity: Best-practices o Janitor: Unused Instances o Doctor: Health checks o Security: Security Violations o Chaos Gorilla: AZ Failure o Chaos Kong: Region Failure
  • 48. “... is a latency and fault tolerance library designed to isolate points of access to remote systems ...” ● Stop cascading failures. ● Fallbacks and graceful degradation ● Fail fast and rapid recovery ● Thread and semaphore isolation with circuit breakers ● Real-time monitoring and configuration changes * https://github.com/Netflix/Hystrix
  • 49. Real-time Analytics Platform (RTA) ● ACA runs on top of RTA ● Compute Engines o OpenCPU (R) o OpenPY (Python) ● Data Sources o Real-time Monitoring Systems o Big Data Platforms ● Reporting, Scheduling, Persistence
  • 50. Slow Performance Regression ● Deviation => “acceptable” regression ● Small performance regressions might sneak in ● Short release cycle = many releases ● Many releases = cumullative regression
  • 52. Testing Lower Level Components ● Base AMIs o OS (Linux), tools and agents ● Common Application Platform ● Common Libraries ● Reference Application o Leverages a common architecture (front, middle, data, memcache, jar clients, Hystrix) o Implements functions that stress specific resources (cpu, service, db)