Let's say you're a data scientist, and you've been asked to build infrastructure. Here I've distilled some best practices as an introduction for people who are new to DevOps.
2. Just Enough DevOps for Data Scientists
abida@salesforce.com
@ anyabida1
Anya Bida, SRE at Salesforce
3. About Anya
Sr. Member of Technical Staff (SRE)
Salesforce Production Engineering
Salesforce Einstein Platform
Co-organizer SF Big Analytics
Spark Tuning
• Cheat-sheet
• Talks
Previously at Alpine Data, SRI
PhD Mayo Clinic, BS Johns Hopkins
@anyabida1
4. What I am going to talk about
What is DevOps
Salesforce Einstein Scales
Our goal
Top 10 tips
What’s next?
11. Tip 1: Plan for Failure
Take off that Data Scientist hat now.
12. Simple Dashboard with KPIs
Tip 1: Plan for Failure
Take off that Data Scientist hat now.
13. Tip 1: Plan for Failure
Take off that Data Scientist hat now.
https://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead
Simple Dashboard with KPIs
• Request & error rates
• Longest response times - upper
95th & 99th percentile
• Capacity
• Events
Jos Boumans,
Salesforce DMP
slides
14. Tip 1: Plan for Failure
Take off that Data Scientist hat now.
https://www.slideshare.net/jiboumans/how-to-measure-everything-a-million-metrics-per-second-with-minimal-developer-overhead
Simple Dashboard with KPIs
• Request & error rates
• Longest response times - upper
95th & 99th percentile
• Capacity
• Events
Collect metrics from every
machine.
Troubleshoot with all the
metrics at your disposal
15. Tip 2: Blue Green Deployments
https://docs.mobingi.com/official/guide/bg-deploy
Blue Machine
(old)
Green Machine
(new)
Users
16. Tip 3: Assume people make mistakes
Technical debt
• Every manual change
• Duplicate metrics
Scale down resources
• Terminate unused machines
• Janitor Monkey
• Understand the cost per job
• Jobs should not accumulate files on disk
17. Tip 4: Changes should be auditable
Schaper - the tool to compare schemas
https://www.linkedin.com/in/huqixiu/
Qixiu “Q” Hu
18. Tip 4: Changes should be auditable
Schaper - the tool to compare schemas
https://www.linkedin.com/in/huqixiu/
Qixiu “Q” Hu
CREATE TABLE myConferences (
name text ,
city text,
early_bird timeuuid,
late_bird timeuuid,
PRIMARY KEY ((name, city),
early_bird)
) WITH CLUSTERING ORDER BY
(early_bird DESC);
CREATE TABLE myConferences (
name text ,
city text,
early_bird timeuuid,
late_bird timeuuid,
PRIMARY KEY ((name, city),
early_bird)
) WITH CLUSTERING ORDER BY
(early_bird DESC);
19. Tip 4: Changes should be auditable
Schaper - the tool to compare schemas
https://www.linkedin.com/in/huqixiu/
Qixiu “Q” Hu
CREATE TABLE myConferences (
name text ,
city text,
early_bird timeuuid,
late_bird timeuuid,
PRIMARY KEY ((name, city),
early_bird)
) WITH CLUSTERING ORDER BY
(early_bird DESC);
CREATE TABLE myConferences (
name text ,
city text,
early_bird timeuuid,
late_bird timeuuid,
discount_code string,
PRIMARY KEY ((name, city),
early_bird)
) WITH CLUSTERING ORDER BY
(early_bird DESC);
20. Tip 5: Configuration management
Network Connectivity
• 20 parameters
User Access
• 50 parameters
Deploy cluster (eg Mesos)
• 20 non-default parameters
Deploy a microservice
• 50 parameters
Schedule a job
• 3 parameters
SUM X 3 regions
X 20 metrics
Approx.6000
21. Templates for Automation
Service discovery
Creating dashboards
• Prod, non-prod, …
Log queries
Cost analysis
Tip 6: Pick a naming convention
<service>.
<environment>.
<region>.
<hostname>.
<metric>
22. Tip 7: Permissions
Every user, service, & job should have specific, auditable permissions.
Cluster Manager
Scheduler
IAM
IAM Roles
• User has an IAM Role
• Job has an IAM Role
• IAM Roles determine read /
write access to data
IAM
Out
Logs
IAM
In
23. Understanding Memory Management in Spark For Fun And Profit Shivnath Babu (Duke University, Unravel Data Systems)
Mayuresh Kunjir (Duke University)
Tip 8: Understand resource allocation
Node Memory
Container Memory
8Gb
Node Memory
Container
Memory
8Gb
29. Getting started tips:
1. Plan for failure
2. Blue / Green Deployments
3. Assume people make mistakes
4. Changes should be auditable
5. Configuration management
6. Pick a naming convention
7. Permissions
• user, service, job
8. Understand resource allocation
9. Monitor multiple viewpoints
30. Getting started tips: 1. Plan for failure
2. Blue / Green Deployments
3. Assume people make mistakes
4. Changes should be auditable
5. Configuration management
6. Pick a naming convention
7. Permissions
• user, service, job
8. Understand resource allocation
9. Monitor multiple viewpoints
10. Infrastructure as Code
31. Did we just automate ourselves
out of our jobs?
Nope. Now we have time to take on new projects and grow…
32. More info:
Jos Boumans,
Salesforce DMP
slides
SRE How Google Runs
Production Systems book
James Ward,
Engineering & Open Source
Ambassador at Salesforce
High Performance
spark book
33. More info:
Real Time ML Pipelines in Multi-Tenant Environments
Director of Engineering Karl Skucha & Lead Engineer Yan Yang
Introduction to Machine Learning
Engineering & Open Source Ambassador James Ward
Fantastic ML apps and how to build them
Principal Engineer, Matthew Tovbin
Fireworks - lighting up the sky with millions of Sparks
Director of Engineering Thomas Gerber
Functional Linear Algebra in Scala
Engineer & Professor Vlad Patryshev
Panel: Functional Programming for Machine Learning
Saturday @ 2:10pm —Complex Machine Learning Pipelines Made Easy
Machine Learning Engineers Till Bergmann & Chris Rupley
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
What DevOps actually IS???
-- cross section of infrastructure,
-- here’s all the things data scientists need to support themselves at scale
We need to build an infra that scales at the pace of Salesforce.
Salesforce Einstein is serving 475 Million predictions per day, and growing.So how do we do this from an infra perspective?
Even if you do everything right, machines WILL fail.
Collect metrics by installing statsd on every machine.
Should I automate the file removal
Better: keep your files in a distributed, versioned storage system
Infra team will monitor disk usage
Lets say I have a database with one replica on the east coast, and one replica on the west coast.
My database schema, here represented as a table, is as follows.
Right now my schemas are identical across data centers.
But if someone changes the schema for one of my replicas, I want to know immediately.
So my schemas should be auditable.
Q on our SRE team built the tool schaper to compare schemas. Schaper is generic - it supports ElasticSearch, Cassandra, MongoDb, etc., and provides a report when there is a schema change. I NEED TO KNOW when my schema changes. Obviously this could be very important information. Wink, wink.
Schaper is also modular - it’s plug-n-play. So this is an example of how we ensure changes are auditable. Cassandra: Keyspaces
Database replication
Schaper is one example of the type of tools that could be built to audit changes. From the audit, we can automate some action, depending on the particular change or …
We haven’t open sourced this tool, yet, just an example
When to automate? Any task that’s done 10x per year should be automated.
IAC should be correct, comprehensible, and composable.
How the number of clicks can be so big20clicks per cluster x 3regions x 20metrics
IAC
-- networking layer
-- provisioning
-- build and deploy
-- monitoring
-- manage
IAM definitionIdentity and access management
Authorization & Authentication
Ok, so I’ve got my container, which uses maybe 8Gb of RAM. Now I want to know if my container can launch on my cluster.
So my cluster has 3 nodes, let’s say, and 8Gb total RAM on each node. CAN MY 8GB CONTAINER LAUNCH ON THIS CLUSTER?
Since 4Gb of ram is used on each node, the cluster memory available is 4x3 = 12Gb, so if I only monitor cluster level metrics, then my container will fail to launch.
The image above shows sample connectivity for development, staging and production environments. It helps us verify there are no unintended rules etc..
Mention the three lone servers - should we review these? Are these supposed to be there?
This tool is not open sourced, but just an example of the internal tools we build - and you can too!
Double clicking a node shows its connectivity. This is useful for debugging issues.
We can filter by resource type, names, tags etc.
Taken together, hopefully I’ve convinced you that each piece of your infra should be deployed and managed as code.
This has been “Just enough devops for data scientists”
This has been “Just enough devops for data scientists”