Christian's part of the AWS re:Invent 2015 talk shared with Sajee Mathew - ARC304 - Designing for SaaS: Next Generation Software Delivery Models on AWS. Full video of the 60 minute presentation: https://www.youtube.com/watch?v=d16aUztH9hk&list=PLhr1KZpdzukdRxs_pGJm-qSy5LayL6W_Y
2. $ whoami
Co-Founder & CTO, Sumo Logic
Cloud-based Machine Data Analytics Service
Applications, Operations, Security
Chief Architect, ArcSight
Major SIEM player in the enterprise space
Log Management for security and compliance
3. From Data to Decisions
DEVOPS
Streamline continuous
delivery
Monitor KPI’s and
Metrics
Accelerate
Troubleshooting
IT INFRASTRUCTURE
AND OPERATIONS
Monitor all workloads
Troubleshoot and
increase uptime
Simplify, Modernize,
and save costs
COMPLIANCE AND
SECURITY
COMPLIANCE AND
SECURITY
Automate and
demonstrate compliance
Audit all systems
Think beyond rules
Cloud Analytics Platform
DEVOPS
IT INFRASTRUCTURE
AND OPERATIONS
COMPLIANCE AND
SECURITY
4. Cloud Analytics Platform
From Data to Decisions
DEVOPS
IT INFRASTRUCTURE
AND OPERATIONS
COMPLIANCE AND
SECURITY
Customer A Cloud
COLLECTOR COLLECTOR
Customer A Data Center Customer B Data Center
COLLECTOR
Customer B Cloud
COLLECTOR
6. Why SaaS?
Because enterprise software sucks™
Too much pain for the customer
Time spent running the system is not spent using the system
Expensive when done adding hardware and people
7. Why SaaS?
Because enterprise software sucks™
Too much pain for the customer
Time spent running the system, not spent using the system
Expensive when done adding hardware and people
Disastrous for the vendor
No control over the runtime, hard to diagnose problems
Kills innovation because each release lives forever
8. Why AWS?
We are developers, not data center people
AWS has turned the data center into an API
As developers, we understand reuse (libraries, OSs, …)
Today’s systems require reuse on a higher level
Do you really want to care for 4,000 machines? HA? DR?
9. Anti-monolithic
In previous gigs, we dealt with monolithic systems
With Sumo, we knew what we needed to build, no MVP required
Get data into the system, index it, provide query function
So we had a logical breakdown immediately
And we knew it had to scale…
…not just to the biggest customer, but to all customers!
12. Scale Today
50 TB of new incoming data per day
Double-digit PB of data under management
>2,000,000 queries/day
Thousands of instances in 4 regions globally
19. What We Actually Did
Compose applications from layers of modules
Whole system is Scala on top of the JVM
One Maven POM per module, one main() per application
Initially one GitHub repository per module, today just one project
Right size AWS instance for each application cluster
Each application exposes a façade
Avro over HTTP, or Avro over HornetQ, or Avro over Kafka
22. Deployment wide services
Ingest
Search
Internal tools
receiver
hornetq-
forge
forge
cqsplitter
search
cloud
collector
service
api
con-
cierge
stream
katta
glass,
ganglia
bill
mix
meta
config
zoo-
keeper
appvault org
raw
hornetq-
inbound
cocoa
bloom
filter
analyticscsi
cqmerger
rework
view
autoview
depman
hornetq-
internal
hornetq-
metadata
nrt
2 to the power of 5 services
(“32”), 170+ modules
Don’t even ask about the #
of dependencies
At least 3 of each –
everything is a separately
scalable cluster
23. Service Discovery
Loose coupling in the large…
A deployment is made up of many things
Some of these things need to talk to each other
Some of these things come and go
Don’t pass in a huge list of static dependencies
Start each application with one parameter
$ bin/receiver prod.service-registry.sumologic.com
24. Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle
25. Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle
26. Anti-singletenant
Multi-dimensional scaling predicates multitenancy
This is a data processing platform – cost matters!
Autoscaling single tenants is too fine-grained for us
Also, efficiency… one code line “master” in deployment
Customers aren’t pets, they are cattle
Yum yum yum…
FEATURE FLAGS!!!
28. Just one typical Sumo Logic customer - 8x Variance!
Money flushed down the toilet
29. Just one typical Sumo Logic customer - 8x Variance!
Money flushed down the toilet
Load per tenant fluctuates wildly, but
aggregated system load just goes up slowly
30. Anti-manual
We use Jenkins, of course
We still build system versions as cross-cuts and QA them
We are busy moving toward true continuous delivery
Application Groups for things that evolve together…
…and that can be deployed together
33. dsh: Another AWS Deployment Tool
Model-driven, describe desired state, run to make it so
High performance due to parallelization
Covers all layers of the stack – AWS, OS, Sumo Logic
Easy to use and extend, scriptable CLI
Developer-friendly, Scala-based, high-level APIs
34. Data Access
Layer
Delivery
Authentication &
Authorization
MeteringMonitoring
Ordering
Provisioning
Billing
Analytics
Resource Management SaaS Application(s)
Business Services Core Platform Services
Interaction
Application
AdditionalApplications
Application Lifecycle
Management
EC2
EC2
Route53
S3 Glacier
CloudFront
DynamoDB RDSElastiCache
DynamoDB
DynamoDB RedShift
WorkSpaces
CloudWatch CloudTrail
IAM
CodeDeploy
BeanstalkCloudFormationOpsWorksSWF
SWF
EMR EMR Kinesis
SNS
Mobile
Analytics
Kinesis SNS
CognitoDirectory
Service
CloudSearch
AppStream
SES SQS
SWF XCode
Data
Pipeline
52. What Does the Future Hold?
Super happy to see Amazon EFS introduced
Borderline unnaturally excited about AWS KMS
Planning on using AWS Lambda as a “plugin system”
Implementing Mesos for new services
Very excited about Docker to enable better utilization
Our 3rd generation analytics platform helps customers gain instant insights into this growing pool of machine data within their complex environments. Proven machine learning analytics help customers gain deep visibility across DevOps, IT ops and compliance and security environments.
For DevOps – our service empowers DevOps teams with a simple and scalable solution for monitoring KPI's and metrics across the entire stack to deliver quality software. With pattern recognition and transaction analytics teams spend less time troubleshooting and more time developing code and real-time dashboards allow them to quickly collaborate for root cause analysis of bugs and fix performance issues before they impact customers.
For IT Ops – Sumo Logic helps transforms IT data into better customer experience and business decisions by extracting valuable information such as latencies, performance metrics, trends and critical events tied with core systems and services. IT can monitor complex workloads and migrations for errors, warnings, performance and availability across cloud and on-premises infrastructure stacks and modernize their management stack with a SaaS solution designed for elastic scale with lower TCO.
Compliance and Security – Our service helps organizations simplify and automate compliance and security monitoring across their entire stack with predictive analytics, pre-built searches, real-time dashboards, and pre-defined reports.
Our 3rd generation analytics platform helps customers gain instant insights into this growing pool of machine data within their complex environments. Proven machine learning analytics help customers gain deep visibility across DevOps, IT ops and compliance and security environments.
For DevOps – our service empowers DevOps teams with a simple and scalable solution for monitoring KPI's and metrics across the entire stack to deliver quality software. With pattern recognition and transaction analytics teams spend less time troubleshooting and more time developing code and real-time dashboards allow them to quickly collaborate for root cause analysis of bugs and fix performance issues before they impact customers.
For IT Ops – Sumo Logic helps transforms IT data into better customer experience and business decisions by extracting valuable information such as latencies, performance metrics, trends and critical events tied with core systems and services. IT can monitor complex workloads and migrations for errors, warnings, performance and availability across cloud and on-premises infrastructure stacks and modernize their management stack with a SaaS solution designed for elastic scale with lower TCO.
Compliance and Security – Our service helps organizations simplify and automate compliance and security monitoring across their entire stack with predictive analytics, pre-built searches, real-time dashboards, and pre-defined reports.
This is my personal experience from the last decade
It sucks for the customer and it sucks for the vendor
This is our experience from the ArcSight days
The system can’t just run on some gameboy sitting in the corner
There’s big servers required, and in our case even a big Oracle database
We just gave the customer an “installation guide” and hoped for the best
Not having control over the execution environment puts the developer into a severe disadvantage
“Works on my machine” is the daily reality but how do you debug something you can’t touch?
Too many degrees of freedom for the customer to make the wrong decision: OS choice vs available funds, storage setup and RAID levels, …
Everything you do becomes instant legacy
Every release you push to customers will slow down your future velocity
You will spend all your time back porting fixes to old versions
Because your big customers refuse to take the time, money and pain to upgrade
The result is that you become a maintenance organizations
Why would you do this voluntarily?
http://microservices.io/articles/scalecube.html
I wish we had actually read that back then in detail
But our intuition got us pretty close
http://microservices.io/articles/scalecube.html
This actually worked out
We called it an “Internal SOA”
We forgot one thing tho, but more about that later
http://microservices.io/articles/scalecube.html
This was extremely hotly debated internally
http://microservices.io/articles/scalecube.html
If one thing fails, the rest might continue to function
High cohesion, low coupling
http://microservices.io/articles/scalecube.html
Every order of magnitude of scale something will break
You will not be able to predict what
You need to be able to just fix that, in a running system
http://microservices.io/articles/scalecube.html
Every order of magnitude of scale something will break
You will not be able to predict what
You need to be able to just fix that, in a running system
With about 200 modules, code review is really hard when not in a single repo
We also use messaging heavily in the ingestion path
How they actually look like
So this is our version of the Microservices death star
Because scaling is hard and has latency
Why make it harder than it has to be?
Have you ever implemented a closed loop controller?
The system itself as a whole scales slowly
Our customers behave in unforeseen ways
But they never do so at the same time
Customers are balanced within the system all the time
In the majority of cases we don’t have to spike-scale
Our system is not batch-based so latency really matters
http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
Because scaling is hard and has latency
Why make it harder than it has to be?
Have you ever implemented a closed loop controller?
The system itself as a whole scales slowly
Our customers behave in unforeseen ways
But they never do so at the same time
Customers are balanced within the system all the time
In the majority of cases we don’t have to spike-scale
Our system is not batch-based so latency really matters
http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
Because scaling is hard and has latency
Why make it harder than it has to be?
Have you ever implemented a closed loop controller?
The system itself as a whole scales slowly
Our customers behave in unforeseen ways
But they never do so at the same time
Customers are balanced within the system all the time
In the majority of cases we don’t have to spike-scale
Our system is not batch-based so latency really matters
http://s133.photobucket.com/user/Lurkerlake/media/cow3.png.htmlcow3.png
Assume for a second that we had to provision for this customer at the peak…
Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price
And it wouldn’t even be able to absorb the spike
Assume for a second that we had to provision for this customer at the peak…
Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price
And it wouldn’t even be able to absorb the spike
Assume for a second that we had to provision for this customer at the peak…
Most of the time, there would be too many resources, driving up the cost of providing the service and hence the price
And it wouldn’t even be able to absorb the spike
First level job builds “system”, deploys to NITE which has the latest cross cut
If that crosscut passes, push to STAG, where QA will tear into it
Ultimately, push to LONG where the rest of the company gets to see the latest that survives
If nothing bad happens in LONG, it goes to PROD, usually once per week
*Need to make sure this is list is accurate*
*Need to make sure this is list is accurate*
EFS – we thought for a long time that it would help to further decouple data from processing but it looks too expensive right now ($0.30/GB/month)
Being able to allow customers to manage the encryption keys is a big deal for us
We managed to get PCI certified based on what we have built but in an ideal future we would have customers control over the keys
There’s points in our product where we would like customers to add functionality. Charting, query operators – looking into Lambda to enable this safely
With all the microservices being their own clusters we are wasting resources that we pay for