The pervasiveness of cloud and containers has led to systems that are much more distributed and dynamic in nature. Highly elastic microservice and serverless architectures mean containers spin up on demand and scale to zero when that demand goes away. In this world, servers are very much cattle, not pets. This shift has exposed deficiencies in some of the tools and practices we used in the world of servers-as-pets. Specifically, there are questions around how we monitor and debug these types of systems at scale. And with the rise of DevOps and product mindset, making data-driven decisions is becoming increasingly important for agile development teams.
In this talk, we discuss a new approach to system monitoring and data collection: the observability pipeline. For organizations that are heavily siloed, this approach can help empower teams when it comes to operating their software. The observability pipeline provides a layer of abstraction that allows you to get operational data such as logs and metrics everywhere it needs to be without impacting developers and the core system. Unlocking this data can also be a huge win for the business with things like auditability, business analytics, and pricing. Lastly, it allows you to change backing data systems easily or test multiple in parallel. With the amount of data and the number of tools modern systems demand these days, we'll see how the observability pipeline becomes just as essential to the operations of a service as the CI/CD pipeline.
38. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
Asia Pacific
BI Server BI Server
Microservice Microservice
Microservice Microservice
39. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
40. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
41. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
44. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
45. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
“DevOps”
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
50. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
“DevOps”
55. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
“DevOps”
57. @tyler_treat
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
North America
BI Server BI Server
Microservice Microservice
Microservice Microservice
CDN
CI/CD
Repo Repo Repo Repo
Builder Builder Builder
Builder Builder Builder
Artifacts Artifacts Artifacts
Deployer Deployer
Infrastructure
Load Balancers Orchestrators DNS Configuration . . .
“DevOps”
85. @tyler_treat
Data Available
Understanding
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
86. @tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
87. @tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
88. @tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
89. @tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
FACTS
HYPOTHESES
90. @tyler_treat
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
91. @tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Unknown Knowns
• Things we understand but are not
aware of
• “We implemented an orchestrator to
ensure the system is always running”
Known Knowns
• Things we are aware of and understand
• “The system has a 1GB memory limit”
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
ASSUMPTIONS FACTS
HYPOTHESES
92. @tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
MonitoringObservability
93. @tyler_treat
Unknown Unknowns
• Things we are neither aware of nor
understand
• “Instances churn because the
orchestrator restarts the process when
it approaches its memory limit, causing
sporadic failures and slowdowns”
DISCOVERIES
Data Available
Understanding
Known Unknowns
• Things we are aware of but don’t
understand
• “The system exceeded its memory limit
and crashed, causing an outage”
HYPOTHESES
TestingExploring
96. @tyler_treat
Some
challenges…
Observability Data
application logs
system logs
audit logs
application metrics
distributed traces
events
- Locked up inside a single vendor’s solution
- Not readily available across the enterprise
(or in some cases, too readily available)
- Many tools and products needed for
different data and use cases
- Tool and data needs vary from team to
team
- Ever-changing landscape of tools, products,
and services
- Sheer volume of data can be overwhelming
114. System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client
System System System System
116. System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client
System System System System
118. System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sumo Logic
Collector
Universal
Analytics Client
S3 Client
…
New Relic APM
Agent
System
Sum
Co
Universal
Analytics Client
S3 Client
…
New R
A
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sumo Logic
Collector
Universal
Analytics Client
Sum
Co
Universal
Analytics Client
System System System System
Honeytail AgentHoneytail Agent Honeytail Agent Honey
Honeytail Agent Honeytail Agent Honeytail Agent Honey
158. @tyler_treat
We need a component to consume data
from the pipeline, perform filtering, and
write it to the appropriate backends.
5. Data Router
159. @tyler_treat
May perform transformations and processing of data,
but heavy processing should be the responsibility of a
backend system (e.g. alerting or aggregations).
171. @tyler_treat
Evolving to an Observability Pipeline
• Adopt structured logging
• Move log/data collection out of process
• Use a centralized logging system
• Introduce a streaming data solution
• Start adding data consumers
182. @tyler_treat
Benefits
• Pattern can be evolved to with quick wins along the way
• Maps to elastic and serverless architectures better
• Empowers teams in siloed organizations and unlocks data for other parts
of the business
• Enables teams to use the tools best suited to their needs
• Easier to change tools or evaluate them side-by-side by decoupling
• Minimizes impact on developers and the core system
184. @tyler_treat
Downsides
• Moving away from agent-based model means we have to handle data
routing ourselves
• A lot of the Data Router components might need to be custom-made
using various vendor SDKs or client libraries (assuming they have
APIs)
• This also means we might lose some of the value-add features of
certain agents
• Unclear how well this maps to pull-based models (e.g. Prometheus)