SlideShare une entreprise Scribd logo
1  sur  65
Resilience and Security @ Scale – Lessons Learned
Jason Chan - chan@netflix.com
Netflix, Inc.


 “Netflix is the world’s leading Internet television
  network with more than 33 million members in
   40 countries enjoying more than one billion
   hours of TV shows and movies per month,
            including original series . . .”

Source: http://ir.netflix.com
Me
 Director of Engineering @ Netflix
 Responsible for:
   Cloud app, product, infrastructure, ops security
 Previously:
   Led security team @ VMware
   Earlier, primarily security consulting at @stake, iSEC Partners
Netflix in the Cloud – Why?
Availability and the Move to Streaming
“Undifferentiated Heavy Lifting”
Netflix Culture




“may well be the most important document ever to come out of the Valley.”
                    Sheryl Sandberg, Facebook COO
Scale and Usage Curve
Netflix is now ~99% in the cloud
On the way to the cloud . . . (architecture)
On the way to the cloud . . . (organization)




                              (or NoOps, depending on definitions)
Some As-Is #s
  33m+ subscribers
  10,000s of systems
  100s of engineers, apps
  ~250 test deployments/day **
  ~70 production deployments/day **




    ** Sample based on one week‟s activities
Common Approaches to Reslience
Common Controls to Promote Resilience
 Architectural committees       Designed to standardize on
 Change approval boards          design patterns, vendors, etc.
 Centralized deployments        Problems for Netflix:
                                    Freedom and Responsibility
 Vendor-specific, component-
                                     Culture
  level HA
                                    Highly aligned and loosely
 Standards and checklists           coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       Designed to control and de-
 Change approval boards          risk change
 Centralized deployments        Focus on artifacts, test and
                                  rollback plans
 Vendor-specific, component-
  level HA                       Problems for Netflix:
                                    Freedom and Responsibility
 Standards and checklists
                                     Culture
                                    Highly aligned and loosely
                                     coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       Separate Ops team deploys at
 Change approval boards          a pre-ordained time (e.g.
                                  weekly, monthly)
 Centralized deployments
                                 Problems for Netflix:
 Vendor-specific, component-
                                    Freedom and Responsibility
  level HA
                                     Culture
 Standards and checklists          Highly aligned and loosely
                                     coupled
                                    Innovation cycles
Common Controls to Promote Resilience
 Architectural committees       High reliance on vendor
 Change approval boards          solutions to provide HA and
                                  resilience
 Centralized deployments
                                 Problems for Netflix:
 Vendor-specific, component-
                                    Traditional data center oriented
  level HA
                                     systems do not translate well
 Standards and checklists           to the cloud
                                    Heavy use of open source
Common Controls to Promote Resilience
 Architectural committees       Designed for repeatable
 Change approval boards          execution
 Centralized deployments        Problems for Netflix:
                                    Not suitable for load-based
 Vendor-specific, component-
                                     scaling and heavy automation
  level HA
                                    Reliance on humans
 Standards and checklists
Approaches to Resilience @ Netflix
What does the business value?
 Customer experience                  Remember these guys?
 Innovation and agility
 In other words:
    Stability and availability for
     customer experience
    Rapid development and
     change to continually improve
     product and outpace
     competition
 Not that different from anyone
  else
Overall Approach
 Understand and solve for relevant failure modes
 Rely on automation and tools instead of committees for
  evaluating architecture and changes
 Make deployment easy and standardized
Cloud Application Failure Modes and Effects
Failure Mode         Probability    Current Mitigation
App Failure          High           Automated fallback response
AWS Region Failure Low              Wait for recovery
AWS Zone Failure     Medium         Continue running in 2 of 3 zones
Datacenter Failure   Medium         Continue migrating to cloud
Data Store Failure   Low            Restore from S3
S3 Failure           Low            Restore from remote archive


   Risk-based approach given likely failures
   Tackle high-probability events first
Simian Army
Goals of Simian Army




“Each system has to be able to succeed, no matter what, even all on its own.
We‟re designing each distributed system to expect and tolerate failure from
other systems on which it depends.”

http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
Chaos Monkey
 “By frequently causing failures, we force our services to
  be built in a way that is more resilient.”
 Terminates cluster nodes during business hours
 Rejects “If it ain‟t broke, don‟t fix it”
 Goals:
    Simulate random hardware failures, human error at small scale
    Identify weaknesses
    No service impact
Chaos Gorilla
 Chaos Monkey‟s bigger brother
 Standard deployment pattern is to distribute
  load/systems/data across three data centers (AZs)
 What happens if one is lost?
 Goals:
   Simulate data center loss, hardware/service failures at larger
    scale
   Identify weaknesses, dependencies, etc.
   Minimal service impact
Latency Monkey
 Distributed systems have many upstream/downstream
  connections
 How fault-tolerant are systems to dependency
  failure/slowdown?
 Goals:
   Simulate latencies and error codes, see how a service responds
   Survivable services regardless of dependencies
Conformity Monkey
 Without architecture review, how do you ensure designs
  leverage known successful patterns?
 Conformity Monkey provides automated analysis for
  pattern adherence
 Goals:
   Evaluate deployment modes (data center distribution)
   Evaluate health checks, discoverability, versions of key libraries
   Help ensure service has best chance of successful operation
Non-Simian Approaches
 Org model
   Engineers write, deploy, support code
 Culture
   De-centralized with as few processes and rules as possible
   Lots of local autonomy
   “If you‟re not failing, you‟re not trying hard enough”
   Peer pressure
 Productive and transparent incident reviews
AppSec Challenges
Lots of Good Advice
  BSIMM
  Microsoft SDL
  SAFECode
But, what works?




  Forrester Consulting, 12/10
Especially, given phenomena such as DevOps,
cloud, agile, and the unique characteristics of an
                   organization?
Deploying Code at Netflix
A common graph @ Netflix
                               Weekend afternoon ramp-up
 Lots of watching in prime time                          Not as much in early morning




             Old way - pay and provision for peak, 24/7/365

   Multiply this pattern across the dozens of apps that comprise the
                        Netflix streaming service
Solution: Load-Based Autoscaling
Autoscaling
 Goals:
   # of systems matches load requirements
   Load per server is constant
   Happens without intervention (the „auto‟ in autoscaling)
 Results:
   Clusters continuously add & remove nodes
   New nodes must mirror existing
Every change requires a new cluster push
(not an incremental change to existing systems)
Deploying code must be easy
           (it is)
Netflix Deployment Pipeline


                 RPM with
                app-specific                   VM template
                    bits                      ready to launch


                   YUM                             AMI




Perforce/Git                      Bakery                            ASG
Code change                    Base image +                      Cluster config
Config change                     RPM                           Running systems
Operational Impact
 No changes to running systems
 No systems mgmt infrastructure (Puppet, Chef, etc.)
 Fewer logins to prod
 No snowflakes
 Trivial “rollback”
Security Impact
 Need to think differently on:
    Vulnerability management
    Patch management
    User activity monitoring
    File integrity monitoring
    Forensic investigations
Architecture, organization, deployment
            are all different.
         What about security?
We‟ve adapted too.
Some principles we‟ve found useful.
Cloud Application Security: What We Emphasize
Points of Emphasis
 Integrate                  Two contexts:
                               1. Integration with your
 Make the right way easy         engineering ecosystem
 Self-service, with           2. Integration of your security
  exceptions                      controls
                             Organization
 Trust, but verify
                             SCM, build and release
                             Monitoring and alerting




                                                                 47
Integration: Base AMI Testing
 Base AMI – VM/instance template used for all cloud systems
      Average instance age = ~24 days (one-time sample)

 The base AMI is managed like other packages, via P4, Jenkins, etc.
 We watch the SCM directory & kick off testing when it changes
 Launch an instance of the AMI, perform vuln scan and other checks

                                                    SCAN COMPLETED ALERT

                                                    Site name: AMI1

                                                    Stopped by: N/A

                                                    Total Scan Time: 4 minutes 46 seconds

                                                    Critical Vulnerabilities: 5
                                                    Severe Vulnerabilities:   4
                                                    Moderate Vulnerabilities: 4
Integration: Control Packaging and Installation

  From the RPM spec file of a webserver:
 Requires:   ossec cloudpassage nflx-base-harden hyperguard-enforcer



 Pulls in the following RPMs:
    HIDS agent
    Config assessment/firewall agent
    Host hardening package
    WAF
Integration: Timeline (Chronos)
 What IP addresses have been blacklisted by the WAF in
  the last few weeks?
 GET /api/v1/event?timelines=type:blacklist&start=20130125000000000

 Which security groups have changed today?
 GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
Points of Emphasis
 Integrate                  Developers are lazy

 Make the right way easy
 Self-service, with
  exceptions
 Trust, but verify
Making it Easy: Cryptex
 Crypto: DDIY (“Don‟t Do It Yourself”)
 Many uses of crypto in web/distributed systems:
   Encrypt/decrypt (cookies, data, etc.)
   Sign/verify (URLs, data, etc.)
 Netflix also uses heavily for device activation, DRM
  playback, etc.
Making it Easy: Cryptex
 Multi-layer crypto system (HSM basis, scale out layer)
   Easy to use
   Key management handled transparently
   Access control and auditable operations
Making it Easy: Cloud-Based SSO
 In the AWS cloud, access to data center services is
  problematic
   Examples: AD, LDAP, DNS
 But, many cloud-based systems require authN, authZ
   Examples: Dashboards, admin UIs
 Asking developers to securely handle/accept credentials
  is also problematic
Making it Easy: Cloud-Based SSO
 Solution: Leverage OneLogin SaaS SSO (SAML) used
  by IT for enterprise apps (e.g. Workday, Google Apps)
 Uses Active Directory credentials
 Provides a single & centralized login page
    Developers don‟t accept username & password directly
 Built filter for our base server to make SSO/authN trivial
Points of Emphasis
 Integrate                  Self-service is perhaps the
                              most transformative cloud
 Make the right way easy     characteristic
 Self-service, with         Failing to adopt this for security
  exceptions                  controls will lead to friction
 Trust, but verify
Self-Service: Security Groups
 Asgard cloud orchestration tool allows developers to
  configure their own firewall rules
 Limited to same AWS account, no IP-based rules
Points of Emphasis
 Integrate                  Culture precludes traditional
                              “command and control”
 Make the right way easy     approach
 Self-service, with         Organizational desire for agile,
  exceptions                  DevOps, CI/CD blur traditional
                              security engagement
 Trust, but verify           touchpoints
Trust but Verify: Security Monkey
 Cloud APIs make verification       Includes:
  and analysis of configuration         Certificate checking
  and running state simpler             Firewall analysis
 Security Monkey created as            IAM entity analysis
  the framework for this analysis       Limit warnings
                                        Resource policy analysis
Trust but Verify: Security Monkey




                   From: Security Monkey
                   Date: Wed, 24 Oct 2012 17:08:18 +0000
                   To: Security Alerts
                   Subject: prod Changes Detected


                          Table of Contents:
                              Security Groups

                                      Changed Security Group


                                          <sgname> (eu-west-1 / prod)
                                           <#Security Group/<sgname> (eu-west-1 / prod)>
Trust but Verify: Exploit Monkey
  AWS Autoscaling group is unit of deployment, so
   changes signal a good time to rerun dynamic scans

 On 10/23/12 12:35 PM, Exploit Monkey wrote:

 I noticed that testapp-live has changed current ASG name from testapp-
 live-v001 to testapp-live-v002.

 I'm starting a vulnerability scan against test app from these
 private/public IPs:
 10.29.24.174
Takeaways
  Netflix runs a large, dynamic service in AWS

  Newer concepts like cloud & DevOps need an
   updated approach to resilience and security

  Specific context can help jumpstart a pragmatic
   and effective security program
Netflix References
 http://netflix.github.com
 http://techblog.netflix.com
 http://slideshare.net/netflix
Other References
 http://www.webpronews.com/netflix-outage-angers-customers-2008-
  08
 http://www.pcmag.com/article2/0,2817,2395372,00.asp
 http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php
 http://bsimm.com/online/
 http://www.microsoft.com/en-
  us/download/confirmation.aspx?id=29884
 http://www.slideshare.net/reed2001/culture-1798664
 http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg-
  calls-maybe-the-most-important-document-ever-to-come-out-of-the-
  valley/
 http://www.gauntlt.org
Questions?




             chan@netflix.com

Contenu connexe

Tendances

Security at the Speed of Software Development
Security at the Speed of Software DevelopmentSecurity at the Speed of Software Development
Security at the Speed of Software DevelopmentDevOps.com
 
Overcoming Security Challenges in DevOps
Overcoming Security Challenges in DevOpsOvercoming Security Challenges in DevOps
Overcoming Security Challenges in DevOpsAlert Logic
 
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...Jules Pierre-Louis
 
Securing Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOpsSecuring Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOpsAmazon Web Services
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudAmazon Web Services
 
Maturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOpsMaturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOpsAmazon Web Services
 
Managing Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice ApplicationsManaging Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice ApplicationsJules Pierre-Louis
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015Shannon Lietz
 
SecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot ArmySecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot Armyconjur_inc
 
SecDevOps: The New Black of IT
SecDevOps: The New Black of ITSecDevOps: The New Black of IT
SecDevOps: The New Black of ITCloudPassage
 
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay BhargavOWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay BhargavAbhay Bhargav
 
DevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With AutomationDevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With AutomationUtkarsh Pandey
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015Shannon Lietz
 
Integrating DevOps and Security
Integrating DevOps and SecurityIntegrating DevOps and Security
Integrating DevOps and SecurityStijn Muylle
 
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016Shannon Lietz
 
Cloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSACloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSAShannon Lietz
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012Nick Galbreath
 
DevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving SecurityDevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving SecurityFranklin Mosley
 

Tendances (20)

Security at the Speed of Software Development
Security at the Speed of Software DevelopmentSecurity at the Speed of Software Development
Security at the Speed of Software Development
 
Overcoming Security Challenges in DevOps
Overcoming Security Challenges in DevOpsOvercoming Security Challenges in DevOps
Overcoming Security Challenges in DevOps
 
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
Microservice Monitoring and Quality Management for Modern Apps and Infrastruc...
 
Securing Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOpsSecuring Systems at Cloud Scale with DevSecOps
Securing Systems at Cloud Scale with DevSecOps
 
Successfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the CloudSuccessfully Implementing DEV-SEC-OPS in the Cloud
Successfully Implementing DEV-SEC-OPS in the Cloud
 
Maturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOpsMaturing your organization from DevOps to DevSecOps
Maturing your organization from DevOps to DevSecOps
 
Managing Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice ApplicationsManaging Quality of Service for Containerized Microservice Applications
Managing Quality of Service for Containerized Microservice Applications
 
DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015DevSecCon KeyNote London 2015
DevSecCon KeyNote London 2015
 
SecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot ArmySecDevOps 2.0 - Managing Your Robot Army
SecDevOps 2.0 - Managing Your Robot Army
 
SecDevOps: The New Black of IT
SecDevOps: The New Black of ITSecDevOps: The New Black of IT
SecDevOps: The New Black of IT
 
Implementing DevSecOps
Implementing DevSecOpsImplementing DevSecOps
Implementing DevSecOps
 
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay BhargavOWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
OWASP AppSec EU - SecDevOps, a view from the trenches - Abhay Bhargav
 
DevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With AutomationDevOps In Azure: Deliver Value With Automation
DevOps In Azure: Deliver Value With Automation
 
ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015ISACA Ireland Keynote 2015
ISACA Ireland Keynote 2015
 
Integrating DevOps and Security
Integrating DevOps and SecurityIntegrating DevOps and Security
Integrating DevOps and Security
 
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
A Throwaway Deck for Cloud Security Essentials 2.0 delivered at RSA 2016
 
Cloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSACloud Security Essentials 2.0 at RSA
Cloud Security Essentials 2.0 at RSA
 
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
DevOpsSec: Appling DevOps Principles to Security, DevOpsDays Austin 2012
 
DevSecOps OWASP
DevSecOps OWASPDevSecOps OWASP
DevSecOps OWASP
 
DevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving SecurityDevSecOps: Minimizing Risk, Improving Security
DevSecOps: Minimizing Risk, Improving Security
 

En vedette

Cloud risk and business continuity v21
Cloud risk and business continuity v21Cloud risk and business continuity v21
Cloud risk and business continuity v21Jorge Sebastiao
 
2013 michael coates-javaone
2013 michael coates-javaone2013 michael coates-javaone
2013 michael coates-javaoneMichael Coates
 
Security at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at YahooSecurity at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at YahooAlex Stamos
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud SecurityJason Chan
 
Virtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit PerspectivesVirtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit PerspectivesJason Chan
 
Cloud Application Security: Lessons Learned
Cloud Application Security: Lessons LearnedCloud Application Security: Lessons Learned
Cloud Application Security: Lessons LearnedJason Chan
 
Practical Security Automation
Practical Security AutomationPractical Security Automation
Practical Security AutomationJason Chan
 
Amazon Web Services Security
Amazon Web Services SecurityAmazon Web Services Security
Amazon Web Services SecurityJason Chan
 
AWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's PerspectiveAWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's PerspectiveJason Chan
 
The Psychology of Security Automation
The Psychology of Security AutomationThe Psychology of Security Automation
The Psychology of Security AutomationJason Chan
 
Defending Netflix from Abuse
Defending Netflix from AbuseDefending Netflix from Abuse
Defending Netflix from AbuseJason Chan
 
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...Amazon Web Services
 
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...Amazon Web Services
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconAdrian Cockcroft
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud ArchitectureAdrian Cockcroft
 

En vedette (15)

Cloud risk and business continuity v21
Cloud risk and business continuity v21Cloud risk and business continuity v21
Cloud risk and business continuity v21
 
2013 michael coates-javaone
2013 michael coates-javaone2013 michael coates-javaone
2013 michael coates-javaone
 
Security at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at YahooSecurity at Scale - Lessons from Six Months at Yahoo
Security at Scale - Lessons from Six Months at Yahoo
 
Practical Cloud Security
Practical Cloud SecurityPractical Cloud Security
Practical Cloud Security
 
Virtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit PerspectivesVirtualization: Security and IT Audit Perspectives
Virtualization: Security and IT Audit Perspectives
 
Cloud Application Security: Lessons Learned
Cloud Application Security: Lessons LearnedCloud Application Security: Lessons Learned
Cloud Application Security: Lessons Learned
 
Practical Security Automation
Practical Security AutomationPractical Security Automation
Practical Security Automation
 
Amazon Web Services Security
Amazon Web Services SecurityAmazon Web Services Security
Amazon Web Services Security
 
AWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's PerspectiveAWS Security: A Practitioner's Perspective
AWS Security: A Practitioner's Perspective
 
The Psychology of Security Automation
The Psychology of Security AutomationThe Psychology of Security Automation
The Psychology of Security Automation
 
Defending Netflix from Abuse
Defending Netflix from AbuseDefending Netflix from Abuse
Defending Netflix from Abuse
 
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
AWS re:Invent 2016: Microservices, Macro Security Needs: How Nike Uses a Mult...
 
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
AWS re:Invent 2016: Security Automation: Spend Less Time Securing Your Applic...
 
Netflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at GlueconNetflix Architecture Tutorial at Gluecon
Netflix Architecture Tutorial at Gluecon
 
Netflix Global Cloud Architecture
Netflix Global Cloud ArchitectureNetflix Global Cloud Architecture
Netflix Global Cloud Architecture
 

Similaire à Resilience and Security @ Scale: Lessons Learned

Fast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSFast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSAmazon Web Services
 
Enabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOpsEnabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOpsJosh Boyd
 
Devtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVDevtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVAlex Henthorn-Iwane
 
Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainSerena Software
 
Implementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureImplementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureDavide Veronese
 
Deploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragilityDeploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragilitySpyros Lambrinidis
 
Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402Rosalind Radcliffe
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtVincent Burckhardt
 
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)Alex Henthorn-Iwane
 
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...XebiaLabs
 
Enterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, ReleaseEnterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, ReleaseIBM UrbanCode Products
 
ClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWSClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWSAmazon Web Services
 
Dev ops developer (session 3)
Dev ops developer (session 3)Dev ops developer (session 3)
Dev ops developer (session 3)MSDEVMTL
 
CSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps sessionCSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps sessionTom Laszewski
 
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...Amazon Web Services
 
Scrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALMScrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALMMarco Silva
 
Application Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco SilvaApplication Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco SilvaAgile Connect®
 
Infrastructure as Code with Chef
Infrastructure as Code with ChefInfrastructure as Code with Chef
Infrastructure as Code with ChefSarah Hynes Cheney
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build AutomationHeiswayi Nrird
 

Similaire à Resilience and Security @ Scale: Lessons Learned (20)

Fast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWSFast, Secure Deployments with Docker on AWS
Fast, Secure Deployments with Docker on AWS
 
Enabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOpsEnabling multicloud in the enterprise with DevSecOps
Enabling multicloud in the enterprise with DevSecOps
 
Devtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFVDevtest Orchestration for SDN & NFV
Devtest Orchestration for SDN & NFV
 
Continuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchainContinuous Delivery series: How to automate your infrastructure toolchain
Continuous Delivery series: How to automate your infrastructure toolchain
 
Implementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architectureImplementing dev ops to face a two speed it architecture
Implementing dev ops to face a two speed it architecture
 
Deploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragilityDeploying more technology to shift from agility to anti-fragility
Deploying more technology to shift from agility to anti-fragility
 
Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402Dev ops for mainframe innovate session 2402
Dev ops for mainframe innovate session 2402
 
Agile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is builtAgile and continuous delivery – How IBM Watson Workspace is built
Agile and continuous delivery – How IBM Watson Workspace is built
 
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
The Carrier DevOps Trend (Presented to Okinawa Open Days Conference)
 
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
XebiaLabs, CloudBees, Puppet Labs Webinar Slides - IT Automation for the Mode...
 
Enterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, ReleaseEnterprise DevOps: Scaling Build, Deploy, Test, Release
Enterprise DevOps: Scaling Build, Deploy, Test, Release
 
ClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWSClearScale: Continuous Automation with Docker on AWS
ClearScale: Continuous Automation with Docker on AWS
 
Dev ops developer (session 3)
Dev ops developer (session 3)Dev ops developer (session 3)
Dev ops developer (session 3)
 
CSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps sessionCSC AWS re:Invent Enterprise DevOps session
CSC AWS re:Invent Enterprise DevOps session
 
What is DevOps?
What is DevOps?What is DevOps?
What is DevOps?
 
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
(ENT210) Accelerating Business Innovation with DevOps on AWS | AWS re:Invent ...
 
Scrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALMScrum Portugal Meeting 1 Lisbon - ALM
Scrum Portugal Meeting 1 Lisbon - ALM
 
Application Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco SilvaApplication Lifecycle Management (ALM), by Marco Silva
Application Lifecycle Management (ALM), by Marco Silva
 
Infrastructure as Code with Chef
Infrastructure as Code with ChefInfrastructure as Code with Chef
Infrastructure as Code with Chef
 
DevOps and Build Automation
DevOps and Build AutomationDevOps and Build Automation
DevOps and Build Automation
 

Dernier

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentMahmoud Rabie
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesKari Kakkonen
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialJoão Esperancinha
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfROWELL MARQUINA
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI AgeCprime
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Nikki Chapple
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkPixlogix Infotech
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 

Dernier (20)

Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
Digital Tools & AI in Career Development
Digital Tools & AI in Career DevelopmentDigital Tools & AI in Career Development
Digital Tools & AI in Career Development
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Testing tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examplesTesting tools and AI - ideas what to try with some tool examples
Testing tools and AI - ideas what to try with some tool examples
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Kuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorialKuma Meshes Part I - The basics - A tutorial
Kuma Meshes Part I - The basics - A tutorial
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
QMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdfQMMS Lesson 2 - Using MS Excel Formula.pdf
QMMS Lesson 2 - Using MS Excel Formula.pdf
 
A Framework for Development in the AI Age
A Framework for Development in the AI AgeA Framework for Development in the AI Age
A Framework for Development in the AI Age
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
Microsoft 365 Copilot: How to boost your productivity with AI – Part two: Dat...
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
React Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App FrameworkReact Native vs Ionic - The Best Mobile App Framework
React Native vs Ionic - The Best Mobile App Framework
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 

Resilience and Security @ Scale: Lessons Learned

  • 1. Resilience and Security @ Scale – Lessons Learned Jason Chan - chan@netflix.com
  • 2. Netflix, Inc. “Netflix is the world’s leading Internet television network with more than 33 million members in 40 countries enjoying more than one billion hours of TV shows and movies per month, including original series . . .” Source: http://ir.netflix.com
  • 3. Me  Director of Engineering @ Netflix  Responsible for:  Cloud app, product, infrastructure, ops security  Previously:  Led security team @ VMware  Earlier, primarily security consulting at @stake, iSEC Partners
  • 4. Netflix in the Cloud – Why?
  • 5. Availability and the Move to Streaming
  • 7. Netflix Culture “may well be the most important document ever to come out of the Valley.” Sheryl Sandberg, Facebook COO
  • 9. Netflix is now ~99% in the cloud
  • 10. On the way to the cloud . . . (architecture)
  • 11. On the way to the cloud . . . (organization) (or NoOps, depending on definitions)
  • 12. Some As-Is #s  33m+ subscribers  10,000s of systems  100s of engineers, apps  ~250 test deployments/day **  ~70 production deployments/day ** ** Sample based on one week‟s activities
  • 13. Common Approaches to Reslience
  • 14. Common Controls to Promote Resilience  Architectural committees  Designed to standardize on  Change approval boards design patterns, vendors, etc.  Centralized deployments  Problems for Netflix:  Freedom and Responsibility  Vendor-specific, component- Culture level HA  Highly aligned and loosely  Standards and checklists coupled  Innovation cycles
  • 15. Common Controls to Promote Resilience  Architectural committees  Designed to control and de-  Change approval boards risk change  Centralized deployments  Focus on artifacts, test and rollback plans  Vendor-specific, component- level HA  Problems for Netflix:  Freedom and Responsibility  Standards and checklists Culture  Highly aligned and loosely coupled  Innovation cycles
  • 16. Common Controls to Promote Resilience  Architectural committees  Separate Ops team deploys at  Change approval boards a pre-ordained time (e.g. weekly, monthly)  Centralized deployments  Problems for Netflix:  Vendor-specific, component-  Freedom and Responsibility level HA Culture  Standards and checklists  Highly aligned and loosely coupled  Innovation cycles
  • 17. Common Controls to Promote Resilience  Architectural committees  High reliance on vendor  Change approval boards solutions to provide HA and resilience  Centralized deployments  Problems for Netflix:  Vendor-specific, component-  Traditional data center oriented level HA systems do not translate well  Standards and checklists to the cloud  Heavy use of open source
  • 18. Common Controls to Promote Resilience  Architectural committees  Designed for repeatable  Change approval boards execution  Centralized deployments  Problems for Netflix:  Not suitable for load-based  Vendor-specific, component- scaling and heavy automation level HA  Reliance on humans  Standards and checklists
  • 20. What does the business value?  Customer experience  Remember these guys?  Innovation and agility  In other words:  Stability and availability for customer experience  Rapid development and change to continually improve product and outpace competition  Not that different from anyone else
  • 21. Overall Approach  Understand and solve for relevant failure modes  Rely on automation and tools instead of committees for evaluating architecture and changes  Make deployment easy and standardized
  • 22. Cloud Application Failure Modes and Effects Failure Mode Probability Current Mitigation App Failure High Automated fallback response AWS Region Failure Low Wait for recovery AWS Zone Failure Medium Continue running in 2 of 3 zones Datacenter Failure Medium Continue migrating to cloud Data Store Failure Low Restore from S3 S3 Failure Low Restore from remote archive  Risk-based approach given likely failures  Tackle high-probability events first
  • 24. Goals of Simian Army “Each system has to be able to succeed, no matter what, even all on its own. We‟re designing each distributed system to expect and tolerate failure from other systems on which it depends.” http://techblog.netflix.com/2010/12/5-lessons-weve-learned-using-aws.html
  • 25.
  • 26. Chaos Monkey  “By frequently causing failures, we force our services to be built in a way that is more resilient.”  Terminates cluster nodes during business hours  Rejects “If it ain‟t broke, don‟t fix it”  Goals:  Simulate random hardware failures, human error at small scale  Identify weaknesses  No service impact
  • 27. Chaos Gorilla  Chaos Monkey‟s bigger brother  Standard deployment pattern is to distribute load/systems/data across three data centers (AZs)  What happens if one is lost?  Goals:  Simulate data center loss, hardware/service failures at larger scale  Identify weaknesses, dependencies, etc.  Minimal service impact
  • 28. Latency Monkey  Distributed systems have many upstream/downstream connections  How fault-tolerant are systems to dependency failure/slowdown?  Goals:  Simulate latencies and error codes, see how a service responds  Survivable services regardless of dependencies
  • 29. Conformity Monkey  Without architecture review, how do you ensure designs leverage known successful patterns?  Conformity Monkey provides automated analysis for pattern adherence  Goals:  Evaluate deployment modes (data center distribution)  Evaluate health checks, discoverability, versions of key libraries  Help ensure service has best chance of successful operation
  • 30. Non-Simian Approaches  Org model  Engineers write, deploy, support code  Culture  De-centralized with as few processes and rules as possible  Lots of local autonomy  “If you‟re not failing, you‟re not trying hard enough”  Peer pressure  Productive and transparent incident reviews
  • 32. Lots of Good Advice  BSIMM  Microsoft SDL  SAFECode
  • 33. But, what works? Forrester Consulting, 12/10
  • 34. Especially, given phenomena such as DevOps, cloud, agile, and the unique characteristics of an organization?
  • 35. Deploying Code at Netflix
  • 36. A common graph @ Netflix Weekend afternoon ramp-up Lots of watching in prime time Not as much in early morning Old way - pay and provision for peak, 24/7/365 Multiply this pattern across the dozens of apps that comprise the Netflix streaming service
  • 38. Autoscaling  Goals:  # of systems matches load requirements  Load per server is constant  Happens without intervention (the „auto‟ in autoscaling)  Results:  Clusters continuously add & remove nodes  New nodes must mirror existing
  • 39. Every change requires a new cluster push (not an incremental change to existing systems)
  • 40. Deploying code must be easy (it is)
  • 41. Netflix Deployment Pipeline RPM with app-specific VM template bits ready to launch YUM AMI Perforce/Git Bakery ASG Code change Base image + Cluster config Config change RPM Running systems
  • 42. Operational Impact  No changes to running systems  No systems mgmt infrastructure (Puppet, Chef, etc.)  Fewer logins to prod  No snowflakes  Trivial “rollback”
  • 43. Security Impact  Need to think differently on:  Vulnerability management  Patch management  User activity monitoring  File integrity monitoring  Forensic investigations
  • 44. Architecture, organization, deployment are all different. What about security?
  • 45. We‟ve adapted too. Some principles we‟ve found useful.
  • 46. Cloud Application Security: What We Emphasize
  • 47. Points of Emphasis  Integrate  Two contexts: 1. Integration with your  Make the right way easy engineering ecosystem  Self-service, with 2. Integration of your security exceptions controls  Organization  Trust, but verify  SCM, build and release  Monitoring and alerting 47
  • 48. Integration: Base AMI Testing  Base AMI – VM/instance template used for all cloud systems  Average instance age = ~24 days (one-time sample)  The base AMI is managed like other packages, via P4, Jenkins, etc.  We watch the SCM directory & kick off testing when it changes  Launch an instance of the AMI, perform vuln scan and other checks SCAN COMPLETED ALERT Site name: AMI1 Stopped by: N/A Total Scan Time: 4 minutes 46 seconds Critical Vulnerabilities: 5 Severe Vulnerabilities: 4 Moderate Vulnerabilities: 4
  • 49. Integration: Control Packaging and Installation  From the RPM spec file of a webserver: Requires: ossec cloudpassage nflx-base-harden hyperguard-enforcer  Pulls in the following RPMs:  HIDS agent  Config assessment/firewall agent  Host hardening package  WAF
  • 50. Integration: Timeline (Chronos)  What IP addresses have been blacklisted by the WAF in the last few weeks?  GET /api/v1/event?timelines=type:blacklist&start=20130125000000000  Which security groups have changed today?  GET /api/v1/event?timelines=type:securitygroup&start=20130206000000000
  • 51. Points of Emphasis  Integrate  Developers are lazy  Make the right way easy  Self-service, with exceptions  Trust, but verify
  • 52. Making it Easy: Cryptex  Crypto: DDIY (“Don‟t Do It Yourself”)  Many uses of crypto in web/distributed systems:  Encrypt/decrypt (cookies, data, etc.)  Sign/verify (URLs, data, etc.)  Netflix also uses heavily for device activation, DRM playback, etc.
  • 53. Making it Easy: Cryptex  Multi-layer crypto system (HSM basis, scale out layer)  Easy to use  Key management handled transparently  Access control and auditable operations
  • 54. Making it Easy: Cloud-Based SSO  In the AWS cloud, access to data center services is problematic  Examples: AD, LDAP, DNS  But, many cloud-based systems require authN, authZ  Examples: Dashboards, admin UIs  Asking developers to securely handle/accept credentials is also problematic
  • 55. Making it Easy: Cloud-Based SSO  Solution: Leverage OneLogin SaaS SSO (SAML) used by IT for enterprise apps (e.g. Workday, Google Apps)  Uses Active Directory credentials  Provides a single & centralized login page  Developers don‟t accept username & password directly  Built filter for our base server to make SSO/authN trivial
  • 56. Points of Emphasis  Integrate  Self-service is perhaps the most transformative cloud  Make the right way easy characteristic  Self-service, with  Failing to adopt this for security exceptions controls will lead to friction  Trust, but verify
  • 57. Self-Service: Security Groups  Asgard cloud orchestration tool allows developers to configure their own firewall rules  Limited to same AWS account, no IP-based rules
  • 58. Points of Emphasis  Integrate  Culture precludes traditional “command and control”  Make the right way easy approach  Self-service, with  Organizational desire for agile, exceptions DevOps, CI/CD blur traditional security engagement  Trust, but verify touchpoints
  • 59. Trust but Verify: Security Monkey  Cloud APIs make verification  Includes: and analysis of configuration  Certificate checking and running state simpler  Firewall analysis  Security Monkey created as  IAM entity analysis the framework for this analysis  Limit warnings  Resource policy analysis
  • 60. Trust but Verify: Security Monkey From: Security Monkey Date: Wed, 24 Oct 2012 17:08:18 +0000 To: Security Alerts Subject: prod Changes Detected Table of Contents: Security Groups Changed Security Group <sgname> (eu-west-1 / prod) <#Security Group/<sgname> (eu-west-1 / prod)>
  • 61. Trust but Verify: Exploit Monkey  AWS Autoscaling group is unit of deployment, so changes signal a good time to rerun dynamic scans On 10/23/12 12:35 PM, Exploit Monkey wrote: I noticed that testapp-live has changed current ASG name from testapp- live-v001 to testapp-live-v002. I'm starting a vulnerability scan against test app from these private/public IPs: 10.29.24.174
  • 62. Takeaways  Netflix runs a large, dynamic service in AWS  Newer concepts like cloud & DevOps need an updated approach to resilience and security  Specific context can help jumpstart a pragmatic and effective security program
  • 63. Netflix References  http://netflix.github.com  http://techblog.netflix.com  http://slideshare.net/netflix
  • 64. Other References  http://www.webpronews.com/netflix-outage-angers-customers-2008- 08  http://www.pcmag.com/article2/0,2817,2395372,00.asp  http://www.readwriteweb.com/archives/etech_amazon_cto_aws.php  http://bsimm.com/online/  http://www.microsoft.com/en- us/download/confirmation.aspx?id=29884  http://www.slideshare.net/reed2001/culture-1798664  http://techcrunch.com/2013/01/31/read-what-facebooks-sandberg- calls-maybe-the-most-important-document-ever-to-come-out-of-the- valley/  http://www.gauntlt.org
  • 65. Questions? chan@netflix.com