Same basic flow as the keynote, but with a lot more detail, and we had a lot more interactive discussion rather than a presentation format. See part 2 for some more specific detail and links to other presentations.
1. Introduction:
Building Using The NetflixOSS
Architecture
May 2013
Adrian Cockcroft
@adrianco #netflixcloud @NetflixOSS
http://www.linkedin.com/in/adriancockcroft
2. Presentation vs. Tutorial
• Presentation
– Short duration, focused subject
– One presenter to many anonymous audience
– A few questions at the end
• Tutorial
– Time to explore in and around the subject
– Tutor gets to know the audience
– Discussion, rat-holes, “bring out your dead”
3. Introduction – Who are you?
Netflix Open Source Cloud Prize
Cloud Native – More details
NetflixOSS – Cloud Native On-Ramp
4. Adrian Cockcroft
• Director, Architecture for Cloud Systems, Netflix Inc.
– Previously Director for Personalization Platform
• Distinguished Availability Engineer, eBay Inc. 2004-7
– Founding member of eBay Research Labs
• Distinguished Engineer, Sun Microsystems Inc. 1988-2004
– 2003-4 Chief Architect High Performance Technical Computing
– 2001 Author: Capacity Planning for Web Services
– 1999 Author: Resource Management
– 1995 & 1998 Author: Sun Performance and Tuning
– 1996 Japanese Edition of Sun Performance and Tuning
• SPARC & Solarisパフォーマンスチューニング (サンソフトプレスシリー
ズ)
• More
– Twitter @adrianco – Blog http://perfcap.blogspot.com
– Presentations at http://www.slideshare.net/adrianco
5. Attendee Introductions
• Who are you, where do you work
• Why are you here today, what do you need
• “Bring out your dead”
– Do you have a specific problem or question?
– One sentence elevator pitch
• What instrument do you play?
13. Judges
Aino Corry
Program Chair for Qcon/GOTO
Martin Fowler
Chief Scientist ThoughtworksSimon Wardley
Strategist
Yury Izrailevsky
VP Cloud Netflix
Werner Vogels
CTO Amazon Joe Weinman
SVP Telx, Author “Cloudonomics”
14. What are Judges Looking For?
Eligible, Apache 2.0 licensed
NetflixOSS project pull requests
Original and useful contribution to NetflixOSS
Good code quality and structure
Documentation on how to build and run it
Code that successfully builds and passes a test suite
Evidence that code is in use by other projects, or is running in production
A large number of watchers, stars and forks on github
15. What do you win?
One winner in each of the 10 categories
Ticket and expenses to attend AWS
Re:Invent 2013 in Las Vegas
A Trophy
16. How do you enter?
Get a (free) github account
Fork github.com/netflix/cloud-prize
Send us your email address
Describe and build your entry
Twitter #cloudprize
17. Entrants
Netflix
Engineering
Six Judges Winners
Nominations
Conforms to
Rules
Working
Code
Community
Traction
Categories
Registration
Opened
March 13
Github
Apache
Licensed
Contributions
Github
Close Entries
September 15
Github
Award
Ceremony
Dinner
November
AWS
Re:Invent
Ten Prize
Categories
$10K cash
$5K AWS
AWS
Re:Invent
Tickets
Trophy
22. Netflix Member Web Site Home Page
Personalization Driven – How Does It Work?
23. How Netflix Streaming Works
Customer Device
(PC, PS3, TV…)
Web Site or
Discovery API
User Data
Personalization
Streaming API
DRM
QoS Logging
OpenConnect
CDN Boxes
CDN
Management and
Steering
Content Encoding
Consumer
Electronics
AWS Cloud
Services
CDN Edge
Locations
24. Real Web Server Dependencies Flow
(Netflix Home page business transaction as seen by AppDynamics)
Start Here
memcached
Cassandra
Web service
S3 bucket
Personalization movie group choosers
(for US, Canada and Latam)
Each icon is
three to a few
hundred
instances
across three
AWS zones
25. New Cloud Native Patterns
Micro-services and Chaos engines
Highly available systems composed
from ephemeral components
Open Source is the default
28. Netflix vs. Amazon Prime
• Do retailers competing with Amazon use AWS?
– Yes, lots of them, Netflix is no different
• Does Prime have a platform advantage?
– No, because Netflix gets to run on AWS
• Does Netflix take Amazon Prime seriously?
– Yes, but so far Prime isn’t impacting our business
29. Amazon Video 1.31%
18x Prime
25x Prime
Nov
2012
Streaming
Bandwidth
March
2013
Mean
Bandwidth
+39% 6mo
30. The Google Cloud Question
Why doesn’t Netflix use Google
Cloud as well as AWS?
31. Google Cloud – Wait and See
Pro’s
• Cloud Native
• Huge scale for internal apps
• Exposing internal services
• Nice clean API model
• Starting a price war
• Fast for what it does
• Rapid start & minute billing
Con’s
• In beta until last week
• No big customers yet
• Missing many key features
• Different arch model
• Missing billing options
• No SSD or huge instances
• Zone maintenance windows
But: Anyone interested is welcome to port NetflixOSS components to Google Cloud
32. Cloud Wars: Price and Performance
AWS vs.
GCS War
Private
Cloud
What Changed:
Everyone using
AWS or GCS gets
the price cuts and
performance
improvements, as
they happen. No
need to switch
vendor.
No Change:
Locked in for
three years.
34. Fitting Into Public Scale
Public
Grey
Area
Private
1,000 Instances 100,000 Instances
Netflix FacebookStartups
35. How big is Public?
AWS upper bound estimate based on the number of public IP Addresses
Every provisioned instance gets a public IP by default
AWS Maximum Possible Instance Count 3.7 Million
Growth >10x in Three Years, >2x Per Annum
43. Netflix Outages
• Running very fast with scissors
– Mostly self inflicted – bugs, mistakes from pace of change
– Some caused by AWS bugs and mistakes
• Incident Life-cycle Management by Platform Team
– No runbooks, no operational changes by the SREs
– Tools to identify what broke and call the right developer
• Next step is multi-region
– Investigating and building in stages during 2013
– Could have prevented some of our 2012 outages
44. Managing Multi-Region Availability
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
Cassandra Replicas
Zone A
Cassandra Replicas
Zone B
Cassandra Replicas
Zone C
Regional Load Balancers
UltraDNS
DynECT
DNS
AWS
Route53
Denominator – manage traffic via multiple DNS providers
Denominator
45. Cloud Native Big Data
Size the cluster to the data
Size the cluster to the questions
Never wait for space or answers
46. Netflix Dataoven
Data Warehouse
Over 2 Petabytes
Ursula
Aegisthus
Data Pipelines
From cloud
Services
~100 Billion
Events/day
From C*
Terabytes of
Dimension
data
Hadoop Clusters – AWS EMR
1300 nodes 800 nodes Multiple 150 nodes Nightly
RDS
Metadata
Gateways
Tools
47. Cloud Native Patterns
Master copies of data are cloud resident
Dynamically provisioned micro-services
Services are distributed and ephemeral
48. Cloud Native Architecture
Distributed Quorum
NoSQL Datastores
Autoscaled Micro
Services
Autoscaled Micro
Services
Clients Things
JVM JVM
JVM JVM
Cassandra Cassandra Cassandra
Memcached
JVM
Zone A Zone B Zone C
50. How to get to Cloud Native?
Freedom and Responsibility for Developers
Decentralize and Automate Ops Activities
Integrate DevOps into the Business Organization
51. Four Transitions
• Management: Integrated Roles in a Single Organization
– Business, Development, Operations -> BusDevOps
• Developers: Denormalized Data – NoSQL
– Decentralized, scalable, available, polyglot
• Responsibility from Ops to Dev: Continuous Delivery
– Decentralized small daily production updates
• Responsibility from Ops to Dev: Agile Infrastructure - Cloud
– Hardware in minutes, provisioned directly by developers
52. Netflix BusDevOps Organization
Chief Product
Officer
VP Product
Management
Directors
Product
VP UI
Engineering
Directors
Development
Developers +
DevOps
UI Data
Sources
AWS
VP Discovery
Engineering
Directors
Development
Developers +
DevOps
Discovery
Data Sources
AWS
VP Platform
Directors
Platform
Developers +
DevOps
Platform
Data Sources
AWS
Denormalized, independently
updated and scaled data
Cloud, independently updated
and scaled infrastructure
Code, independently updated
continuous delivery
55. Ephemeral Instances
• Largest services are autoscaled
• Average lifetime of an instance is 36 hours
P
u
s
h
Autoscale Up
Autoscale Down
56. A Cloud Native Open Source Platform
See netflix.github.com
57. Three Questions
Why is Netflix doing this?
How does it all fit together?
What is coming next?
58. Beware of Geeks Bearing Gifts: Strategies for an
Increasingly Open Economy
Simon Wardley - Researcher at the Leading Edge Forum
59. How did Netflix get ahead?
Netflix BusDevOps Org
• Doing it since 2009
• SaaS Applications
• PaaS for agility
• Public IaaS for AWS features
• Big data in the cloud
• Integrating many APIs
• FOSS from github
• Renting hardware for 1hr
• Coding in Java/Groovy/Scala
Traditional IT Operations
• Taking their time
• Pilot private cloud projects
• Beta quality installations
• Small scale
• Integrating several vendors
• Paying big $ for software
• Paying big $ for consulting
• Buying hardware for 3yrs
• Hacking at scripts
60. Netflix Platform Evolution
Bleeding Edge
Innovation
Common
Pattern
Shared
Pattern
2009-2010 2011-2012 2013-2014
Netflix ended up several years ahead of the
industry, but it’s becoming commoditized now
61. Making it easy to follow
Exploring the wild west each time vs. laying down a shared route
62. Establish our
solutions as Best
Practices / Standards
Hire, Retain and
Engage Top
Engineers
Build up Netflix
Technology Brand
Benefit from a
shared ecosystem
Goals
66. AWS Account
Asgard Console
Archaius
Config Service
Cross region
Priam C*
Pytheas
Dashboards
Atlas
Monitoring
Genie, Lipstick
Hadoop Services
AWS Usage
Cost Monitoring
Multiple AWS Regions
Eureka Registry
Exhibitor ZK
Edda History
Simian Army
3 AWS Zones
Application
Clusters
Autoscale Groups
Instances
Priam
Cassandra
Persistent Storage
Evcache
Memcached
Ephemeral Storage
NetflixOSS Services Scope
67. •Baked AMI – Tomcat, Apache, your code
•Governator – Guice based dependency injection
•Archaius – dynamic configuration properties client
•Eureka - service registration client
Initialization
•Karyon - Base Server for inbound requests
•RxJava – Reactive pattern
•Hystrix/Turbine – dependencies and real-time status
•Ribbon - REST Client for outbound calls
Service
Requests
•Astyanax – Cassandra client and pattern library
•Evcache – Zone aware Memcached client
•Curator – Zookeeper patterns
•Denominator – DNS routing abstraction
Data Access
•Blitz4j – non-blocking logging
•Servo – metrics export for autoscaling
•Atlas – high volume instrumentation
Logging
NetflixOSS Instance Libraries
68. •CassJmeter – Load testing for Cassandra
•Circus Monkey – Test account reservation rebalancingTest Tools
•Janitor Monkey – Cleans up unused resources
•Efficiency Monkey
•Doctor Monkey
•Howler Monkey – Complains about AWS limits
Maintenance
•Chaos Monkey – Kills Instances
•Chaos Gorilla – Kills Availability Zones
•Chaos Kong – Kills Regions
•Latency Monkey – Latency and error injection
Availability
•Security Monkey – security group and S3 bucket permissions
•Conformity Monkey – architectural pattern warningsSecurity
NetflixOSS Testing and Automation
69. More Use Cases
More
Features
Better portability
Higher availability
Easier to deploy
Contributions from end users
Contributions from vendors
What’s Coming Next?
70. Vendor Driven Portability
Interest in using NetflixOSS for Enterprise Private Clouds
“It’s done when it runs Asgard”
Functionally complete
Demonstrated March
Release 3.3 in 2Q13
Some vendor interest
Needs AWS compatible Autoscaler
Some vendor interest
Many missing features
“Confused” AWS API strategy
72. Functionality and scale now, portability coming
Moving from parts to a platform in 2013
Netflix is fostering a cloud native ecosystem
Rapid Evolution - Low MTBIAMSH
(Mean Time Between Idea And Making Stuff Happen)
Hive – thin metadata layer on top of S3Used for ad-hoc analytics (Ursula for merge ETL)HiveQL gets compiled into set of MR jobs (1 -> many)Is a CLI – runs on the gateways, not like a relational DB server, or a service that the query gets shipped toPig – used for ETL (can create DAGs, workflows for Hadoop processes)Pig scripts also get compiled into MR jobsJava – straight up Hadoop, not for the faint of heart. Some recommendation algorithms are in Hadoop.Python/Java – UDFsApplications such as Sting use the tools on some gateway to access all the various componentsNext – focus on two key components: Data & Clusters
When Netflix first moved to cloud it was bleeding edge innovation, we figured stuff out and made stuff up from first principles. Over the last two years more large companies have moved to cloud, and the principles, practices and patterns have become better understood and adopted. At this point there is intense interest in how Netflix runs in the cloud, and several forward looking organizations adopting our architectures and starting to use some of the code we have shared. Over the coming years, we want to make it easier for people to share the patterns we use.
The railroad made it possible for California to be developed quickly, by creating an easy to follow path we can create a much bigger ecosystem around the Netflix platform