SlideShare une entreprise Scribd logo
1  sur  67
Télécharger pour lire hors ligne
Laurent Bernaille
How the OOM-Killer Deleted my Namespace
(and Other Kubernetes Tales)
@lbernail
Datadog
Over 400 integrations
Over 1,500 employees
Over 12,000 customers
Runs on millions of hosts
Trillions of data points per day
10000s hosts in our infra
10s of k8s clusters with 100-4000 nodes
Multi-cloud
Very fast growth
@lbernail
Context
● We have shared a few stories already
○ Kubernetes the Very Hard Way
○ 10 Ways to Shoot Yourself in the Foot with Kubernetes
○ Kubernetes DNS horror stories
● The scale of our Kubernetes infra keeps growing
○ New scalability issues
○ But also, much more complex problems
How The OOM-killer Deleted my Namespace
@lbernail
"My namespace and everything in it are
completely gone"
Symptoms
@lbernail
Feb 17 14:30: Namespace and workloads deleted
Investigation
DeleteCollection calls by resource (audit logs)
@lbernail
Audit logs
● Delete calls for all resources in the namespace
● No delete call for the namespace resource
Why was the namespace deleted??
Investigation
@lbernail
Deletion call, 4d before
Audit logs for the namespace
Feb 13th (resource deletion happened on Feb 17th)
Successful namespace delete call, 4d
before workloads were deleted
@lbernail
Engineer did not delete the namespace
But engineer was migrating to new deployment method
@lbernail
Spinnaker deploys (v1)
chart
deploy kubectl apply
@lbernail
Helm 3 deploys (v2)
chart
deploy helm upgrade
@lbernail
Big difference
A
B
C
B'
C'
A
B
C
B'
C'
A
@lbernail
Big difference
A
B
C
B'
C'
A
B
C
B'
C'
A
A
B
C
B'
C'
A
B
C
B'
C'
@lbernail
Big difference
A
B
C
B'
C'
A
B
C
B'
C'
A
A
B
C
B'
C'
A
B
C
B'
C'
v2: removing a resource from the chart deletes it from the cluster
@lbernail
What happened?
1- Chart refactoring: move namespace out of it
2- Deployment: deletes namespace
But why did it take 4 days ?
@lbernail
Namespace Controller
Discover all API resources
Delete all resources in
namespace
@lbernail
Namespace Controller
=> If discovery fails, namespace content is not deleted
Kubernetes 1.14+: Delete as many resources as possible
Kubernetes 1.16+: Add Conditions to namespaces to surface errors
@lbernail
Namespace Controller logs
unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1:
the server is currently unable to handle the request
Controller-manager contained many occurrences of
No context so hard to link with the incident but
● This is the error from the Discovery call
● We have a likely cause: metrics.k8s.io/v1beta1
@lbernail
metrics-server
● Additional controller providing node/pod metrics
● Registers an ApiService
○ Additional Kubernetes APIs
○ Managed by the external controller
Additional APIs (group and version)
Where should the apiserver proxy calls for this APIs
@lbernail
Events so far
● Feb 13th: namespace deletion call
● Feb 13-17th
○ Namespace controller can't discover metrics APIs
○ Namespace content is not deleted
○ At each sync, controller retries and fails
● Feb 17th
○ ?
○ Namespace controller is unblocked
○ Everything in the namespace is deleted
@lbernail
Events so far
● Feb 13th: error leads to namespace deletion call
● Feb 13-17th
○ Namespace controller can't discover metrics APIs
○ Namespace content is not deleted
○ At each sync, controller retries and fails
● Feb 17th
○ OOM-killer kills metrics-server and it is restarted
○ Metrics-server is now working fine
○ Namespace controller is unblocked
○ Everything in the namespace is deleted
@lbernail
Why was metrics-server failing?
@lbernail
Metrics-server setup
metrics-server
certificate
management
sidecar
pkill
Pod with two containers
● metrics-server
● sidecar: rotate server certificate, and signal metrics-server to reload
Requires shared pid namespace
● shareProcessNamespace (pod spec)
● PodShareProcessNamespace (feature gate on Kubelet and Apiserver)
@lbernail
Metrics-server deployment
metrics-server other controllers
coredns
Deployed on a pool of nodes dedicated to core controllers ("system")
@lbernail
Metrics-server deployment
"old" nodes
no PodShareProcessNamespace
Subset of nodes using an old image without PodShareProcessNamespace
● metrics-server: fine until it was rescheduled on an old node
● At this point, server certificate is no longer reloaded
recent nodes
metrics-server
@lbernail
Full chain of events
● Before Feb 13th
○ metrics-server scheduled on a "bad" node
○ controller discovery calls fail
● Feb 13th: error leads to namespace deletion call
● Feb 13-17th
○ Namespace controller can't discover metrics APIs
○ Namespace content is not deleted
○ At each sync, controller retries and fails
● Feb 17th
○ OOM-killer kills metrics-server and it is restarted
○ Metrics-server is now working fine (certificate up to date)
○ Namespace controller is unblocked
○ Everything in the namespace is deleted
@lbernail
Key take-away
● Apiservice extensions are great but can impact your cluster
○ If the API is not used, they can be down for days without impact
○ But some controllers need to discover all apis and will fail
● In addition, Apiservice extensions security is hard
○ If you are curious, go and see
New node image does not work
and it's Weird...
@lbernail
Context
● We build images for our Kubernetes nodes
● A small change pushed
● Nodes become NotReady after a few days
● Change completely unrelated to kubelet / runtime
● We can't promote to Prod
@lbernail
Investigation
Conditions:
Type Status Reason Message
---- ------ ------ -------
OutOfDisk False KubeletHasSufficientDisk kubelet has sufficient disk space available
MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available
DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure
PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available
Ready False KubeletNotReady container runtime is down
Node description
@lbernail
Runtime is down?
$ crictl pods
POD ID STATE NAME NAMESPACE
681bb3aec7fba Ready local-volume-provisioner-jtwzf local-volume-provisioner
53f122f351da2 Ready datadog-agent-qctt2-8ljgp datadog-agent
c1edb4f19713c Ready localusers-z26n9 prod-platform
01cf5cafd6bc9 Ready node-monitoring-ws2xf node-monitoring
39f01bdaade86 Ready kube2iam-zzdt6 kube2iam
3fcf520680a63 Ready kube-proxy-sr7tn kube-system
5ecd0966634f1 Ready node-local-dns-m2zbp coredns
Looks completely fine
Let's look at containerd
@lbernail
Kubelet logs
remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc =
context deadline exceeded
kubelet.go:2130] Container runtime sanity check failed: rpc error: code = DeadlineExceeded desc =
context deadline exceeded
kubelet.go:1803] skipping pod synchronization - [container runtime is down]
"Status from runtime service failed"
➢ Is there a CRI call for this?
Something is definitely wrong
@lbernail
Testing with crictl
● crictl has no "status" subcommand
● but "crictl info" looks close, let's check
@lbernail
Crictl info
$ crictl info
^C
● Hangs indefinitely
● But what does Status do?
● Almost nothing, except this
@lbernail
CNI status
● Really not much here right?
● But we have calls that hang and a lock
● Could this be related?
@lbernail
Containerd goroutine dump
containerd[737]: goroutine 107298383 [semacquire, 2 minutes]:
containerd[737]: goroutine 107290532 [semacquire, 4 minutes]:
containerd[737]: goroutine 107282673 [semacquire, 6 minutes]:
containerd[737]: goroutine 107274360 [semacquire, 8 minutes]:
containerd[737]: goroutine 107266554 [semacquire, 10 minutes]:
Blocked goroutines?
The runtime error from the kubelet also happens every 2mn
containerd[737]: goroutine 107298383 [semacquire, 2 minutes]:
...
containerd[737]: .../containerd/vendor/github.com/containerd/go-cni.(*libcni).Status()
containerd[737]: .../containerd/vendor/github.com/containerd/go-cni/cni.go:126
containerd[737]: .../containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).Status(...)
containerd[737]: .../containerd/containerd/vendor/github.com/containerd/cri/pkg/server/status.go:45
Goroutine details match exactly the codepath we looked at
@lbernail
Seems CNI related
Let's bypass containerd/kubelet
$ ip netns add debug
$ CNI_PATH=/opt/cni/bin cnitool add pod-network /var/run/netns/debug
Works great and we even have connectivity
$ ip netns exec debug ping 8.8.8.8
PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data.
64 bytes from 8.8.8.8: icmp_seq=1 ttl=111 time=2.41 ms
64 bytes from 8.8.8.8: icmp_seq=2 ttl=111 time=1.70 ms
64 bytes from 8.8.8.8: icmp_seq=3 ttl=111 time=1.76 ms
64 bytes from 8.8.8.8: icmp_seq=4 ttl=111 time=1.72 ms
@lbernail
What about Delete?
$ CNI_PATH=/opt/cni/bin cnitool del pod-network /var/run/netns/debug
^C
● hangs indefinitely
● and we can even find the original process still running
ps aux | grep cni
root 367 0.0 0.1 410280 8004 ? Sl 15:37 0:00
/opt/cni/bin/cni-ipvlan-vpc-k8s-unnumbered-ptp
@lbernail
CNI plugin
● On this cluster we use the Lyft CNI plugin
● The delete calls fails for cni-ipvlan-vpc-k8s-unnumbered-ptp
● After a few tests, we tracked the problem to this call
● Weird, this is in the netlink library!
@lbernail
The root cause
Many thanks to Daniel Borkmann!
@lbernail
Summary
● Packer configured to use the latest Ubuntu 18.04
● The default kernel changed from 4.15 to 5.4
● Netlink library had a bug for kernels 4.20+
● Bug only happened on pod deletion
@lbernail
Status
● Containerd
○ Status calls trigger config file reload, which acquires a RWLock on the config mutex
○ CNI Add/Del calls acquire a RLock on the mutex
○ If CNI Add/Del hangs, Status calls block
○ Does not impact containerd 1.4+ (separate fsnotify.Watcher to reload config)
● cni-ipvlan-vpc-k8s ("Lyft" CNI plugin)
○ Bug in Netlink library with kernel 4.20+
○ Does not impact cni-ipvlan-vpc-k8s 0.6.2+ (different function used)
@lbernail
Key take-away
Nodes are not abstract compute resources
● Kernel
● Distribution
● Hardware
Error messages can be misleading
● For CNI errors, runtime usually reports a clear error
● Here we had nothing but a timeout
How a single app can overwhelm the
control plane
@lbernail
Context
● Users report connectivity issue to a cluster
● Apiservers are not doing well
@lbernail
What's with the Apiservers?
Apiservers can't reach etcd
Apiservers crash/restart
@lbernail
Etcd?
Etcd memory usage is spiking
Etcd is oom-killed
@lbernail
What we know
● Cluster size has not significantly changed
● The control plane has not been updated
● So it is very likely related to api calls
@lbernail
Apiserver requests
Spikes in inflight requests
Spikes in list calls
(very expensive)
@lbernail
Why are list calls expensive?
Understanding Apiserver caching
apiserver
etcd
cache List + Watch
@lbernail
Why are list calls expensive?
What happens for GET/LIST calls?
apiserver
etcd
cache List + Watch
client storagehandler
GET/LIST RV=X
● Resources have versions (ResourceVersion, RV)
● For GET/LIST with RV=X
If cachedVersion >= X, return cachedVersion
else wait up to 3s, if cachedVersion>=X return cachedVersion
else Error: "Too large resource version"
➢ RV=X: "at least as fresh as X"
➢ RV=0: current version in cache
@lbernail
Why are list calls expensive?
What about GET/LIST without a resourceVersion?
apiserver
etcd
cache List + Watch
client storagehandler
GET/LIST RV=""
● If RV is not set, apiserver performs a Quorum read against etcd (for consistency)
● This is the behavior of
○ kubectl get
○ client.CoreV1().Pods("").List() with default options (client-go)
@lbernail
Illustration
● Test on a large cluster with more than 30k pods
● Using table view ("application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json")
○ Only ~25MB of data to minimize transfer time (full JSON: ~1GB)
time curl 'https://cluster.dog/api/v1/pods'
real 0m4.631s
time curl 'https://cluster.dog/api/v1/pods?resourceVersion=0'
real 0m1.816s
@lbernail
What about label filters?
time curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A'
real 0m3.658s
time curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A&resourceVersion=0'
real 0m0.079s
● Call with RV="" is slightly faster (less data to send)
● Call with RV=0 is much much faster
> Filtering is performed on cached data
● When RV="", all pods are still retrieved from etcd and then filtered on apiservers
@lbernail
Why not filter in etcd?
etcd key structure
/registry/{resource type}/{namespace)/{resource name}
So we can ask etcd for:
● a specific resource
● all resources of a type in a namespace
● all resources of a type
● no other filtering / indexing
curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A'
real 0m3.658s <== get all pods (30k) in etcd, filter on apiserver
curl 'https://cluster.dog/api/v1/namespaces/datadog/pods?labelSelector=app=A'
real 0m0.188s <== get pods from datadog namespace (1000) in etcd, filter on apiserver
curl 'https://cluster.dog/api/v1/namespaces/datadog/pods?labelSelector=app=A&resourceVersion=0'
real 0m0.058s <== get pods from datadog namespace in apiserver cache, filter
@lbernail
Informers instead of List
How do informers work?
apiserver
etcd
cache List + Watch
Informer storagehandler
1- LIST RV=0
2- WATCH RV=X
● Informers are much better because
○ They maintain a local cache, updated on changes
○ They start with a LIST using RV=0 to get data from the cache
● Edge case
○ Disconnections: can trigger a new LIST with RV="" (to avoid getting back in time)
○ Kubernetes 1.20 will have EfficientWatchResumption (alpha)
https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption
@lbernail
Summary
● LIST calls go to etcd by default and can have a huge impact
● LIST calls with label filters still retrieve everything from etcd
● Avoid LIST, use Informers
● kubectl get uses LIST with RV=""
● kubectl get could allow setting RV=0
○ Much faster: better user experience
○ Much better for etcd and apiservers
○ Trade-off: small inconsistency window
Many, many thanks to Wojtek-t for his help getting all this right
@lbernail
Back to the incident
● We know the problem comes from LIST calls
● What application is issuing these calls?
@lbernail
Audit logs
Cumulated query time by user for list calls
A single user accounts for 2+ days of query time over 20 minutes!
@lbernail
Audit logs
Cumulated query time by user for list/get calls over a week
@lbernail
Nodegroup controller?
● An in-house controller to manage pools of nodes
● Used extensively for 2 years
● But recent upgrade deployed: deletion protection
○ Check if pods are running on pool of nodes
○ Deny nodegroup deletion if it is the case
@lbernail
How did it work?
On nodegroup delete
1. List all nodes in this nodegroup based on labels
=> Some groups have 100+ nodes
2. List all pods on each node filtering on bound node
=> List all pods (30k) for node
=> Performed in parallel for all nodes in the group
=> The bigger the nodegroup, the worse the impact
All LIST calls here retrieve full node/pod list from etcd
@lbernail
What we learned
● List calls are very dangerous
○ The volume of data can be very large
○ Filtering happens on the apiserver
○ Use Informers (whenever possible)
● Audit logs are extremely useful
○ Who did what when?
○ Which users are responsible for processing time
Conclusion
@lbernail
Conclusion
● Apiservice extensions are powerful but can harm your cluster
● Nodes are not abstract compute resources
● For large (1000+ nodes) clusters
○ Understanding Apiserver caching is important
○ Client interactions with Apiservers are critical
○ Avoid LIST calls
@lbernail
Thank you
We’re hiring!
https://www.datadoghq.com/careers/
laurent@datadoghq.com
@lbernail
@lbernail
How the OOM Killer Deleted My Namespace

Contenu connexe

Tendances

Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog ScaleDocker, Inc.
 
[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network TroubleshootingOpen Source Consulting
 
Ansible with AWS
Ansible with AWSAnsible with AWS
Ansible with AWSAllan Denot
 
What's new in Ansible 2.0
What's new in Ansible 2.0What's new in Ansible 2.0
What's new in Ansible 2.0Allan Denot
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNSbacongobbler
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Henning Jacobs
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014Amazon Web Services
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"OpenStack Korea Community
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with PrometheusOpenStack Korea Community
 
Docker at OpenDNS
Docker at OpenDNSDocker at OpenDNS
Docker at OpenDNSOpenDNS
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Henning Jacobs
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법Open Source Consulting
 
DevOps Guide to Container Networking
DevOps Guide to Container NetworkingDevOps Guide to Container Networking
DevOps Guide to Container NetworkingDirk Wallerstorfer
 
Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)DongHyeon Kim
 
Kube-AWS
Kube-AWSKube-AWS
Kube-AWSCoreOS
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsYinghai Lu
 
Scaling docker with kubernetes
Scaling docker with kubernetesScaling docker with kubernetes
Scaling docker with kubernetesLiran Cohen
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Henning Jacobs
 

Tendances (20)

Kubernetes at Datadog Scale
Kubernetes at Datadog ScaleKubernetes at Datadog Scale
Kubernetes at Datadog Scale
 
[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting[오픈소스컨설팅] Linux Network Troubleshooting
[오픈소스컨설팅] Linux Network Troubleshooting
 
Ansible with AWS
Ansible with AWSAnsible with AWS
Ansible with AWS
 
What's new in Ansible 2.0
What's new in Ansible 2.0What's new in Ansible 2.0
What's new in Ansible 2.0
 
Docker with BGP - OpenDNS
Docker with BGP - OpenDNSDocker with BGP - OpenDNS
Docker with BGP - OpenDNS
 
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
Optimizing Kubernetes Resource Requests/Limits for Cost-Efficiency and Latenc...
 
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
(WEB401) Optimizing Your Web Server on AWS | AWS re:Invent 2014
 
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
[OpenInfra Days Korea 2018] Day 1 - T4-7: "Ceph 스토리지, PaaS로 서비스 운영하기"
 
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
[OpenInfra Days Korea 2018] Day 2 - E6 - OpenInfra monitoring with Prometheus
 
Gobblin on-aws
Gobblin on-awsGobblin on-aws
Gobblin on-aws
 
Docker at OpenDNS
Docker at OpenDNSDocker at OpenDNS
Docker at OpenDNS
 
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
Running Kubernetes in Production: A Million Ways to Crash Your Cluster - Cont...
 
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법[오픈소스컨설팅] EFK Stack 소개와 설치 방법
[오픈소스컨설팅] EFK Stack 소개와 설치 방법
 
DevOps Guide to Container Networking
DevOps Guide to Container NetworkingDevOps Guide to Container Networking
DevOps Guide to Container Networking
 
Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)
 
Kube-AWS
Kube-AWSKube-AWS
Kube-AWS
 
Becoming a Git Master
Becoming a Git MasterBecoming a Git Master
Becoming a Git Master
 
High Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and SolutionsHigh Performance Erlang - Pitfalls and Solutions
High Performance Erlang - Pitfalls and Solutions
 
Scaling docker with kubernetes
Scaling docker with kubernetesScaling docker with kubernetes
Scaling docker with kubernetes
 
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
Ensuring Kubernetes Cost Efficiency across (many) Clusters - DevOps Gathering...
 

Similaire à How the OOM Killer Deleted My Namespace

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kevin Lynch
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKevin Lynch
 
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Tobias Schneck
 
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without DowntimeHow to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtimeloodse
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on KubernetesJoerg Henning
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKel Cecil
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsAmbassador Labs
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug huntingAndrea Righi
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyJérémy Wimsingues
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices worldKarol Chrapek
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthScyllaDB
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Laurent Bernaille
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsC4Media
 
KubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container SchedulingKubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container SchedulingKubeAcademy
 
Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?Mathieu Herbert
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNBelmiro Moreira
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事smalltown
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesObjectRocket
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsWeaveworks
 

Similaire à How the OOM Killer Deleted My Namespace (20)

Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
Kubernetes @ Squarespace (SRE Portland Meetup October 2017)
 
Kubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the DatacenterKubernetes @ Squarespace: Kubernetes in the Datacenter
Kubernetes @ Squarespace: Kubernetes in the Datacenter
 
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
Kubermatic How to Migrate 100 Clusters from On-Prem to Google Cloud Without D...
 
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without DowntimeHow to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
How to Migrate 100 Clusters from On-Prem to Google Cloud Without Downtime
 
Elasticsearch on Kubernetes
Elasticsearch on KubernetesElasticsearch on Kubernetes
Elasticsearch on Kubernetes
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOpsDevOps Days Boston 2017: Real-world Kubernetes for DevOps
DevOps Days Boston 2017: Real-world Kubernetes for DevOps
 
Kernel bug hunting
Kernel bug huntingKernel bug hunting
Kernel bug hunting
 
Heroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success storyHeroku to Kubernetes & Gihub to Gitlab success story
Heroku to Kubernetes & Gihub to Gitlab success story
 
Container orchestration and microservices world
Container orchestration and microservices worldContainer orchestration and microservices world
Container orchestration and microservices world
 
Using eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster HealthUsing eBPF to Measure the k8s Cluster Health
Using eBPF to Measure the k8s Cluster Health
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019
 
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and DaemonsDisenchantment: Netflix Titus, Its Feisty Team, and Daemons
Disenchantment: Netflix Titus, Its Feisty Team, and Daemons
 
KubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container SchedulingKubeCon EU 2016: A Practical Guide to Container Scheduling
KubeCon EU 2016: A Practical Guide to Container Scheduling
 
Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?Kubernetes: Managed or Not Managed?
Kubernetes: Managed or Not Managed?
 
Moving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERNMoving from CellsV1 to CellsV2 at CERN
Moving from CellsV1 to CellsV2 at CERN
 
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
AgileTW Feat. DevOpsTW: 維運 Kubernetes 的兩三事
 
Database as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on KubernetesDatabase as a Service (DBaaS) on Kubernetes
Database as a Service (DBaaS) on Kubernetes
 
Free GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOpsFree GitOps Workshop + Intro to Kubernetes & GitOps
Free GitOps Workshop + Intro to Kubernetes & GitOps
 

Plus de Laurent Bernaille

Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksLaurent Bernaille
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesLaurent Bernaille
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksLaurent Bernaille
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Laurent Bernaille
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applicationsLaurent Bernaille
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Laurent Bernaille
 

Plus de Laurent Bernaille (7)

Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay Networks
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architectures
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay Networks
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applications
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006
 

Dernier

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 

Dernier (20)

From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 

How the OOM Killer Deleted My Namespace

  • 1. Laurent Bernaille How the OOM-Killer Deleted my Namespace (and Other Kubernetes Tales)
  • 2. @lbernail Datadog Over 400 integrations Over 1,500 employees Over 12,000 customers Runs on millions of hosts Trillions of data points per day 10000s hosts in our infra 10s of k8s clusters with 100-4000 nodes Multi-cloud Very fast growth
  • 3. @lbernail Context ● We have shared a few stories already ○ Kubernetes the Very Hard Way ○ 10 Ways to Shoot Yourself in the Foot with Kubernetes ○ Kubernetes DNS horror stories ● The scale of our Kubernetes infra keeps growing ○ New scalability issues ○ But also, much more complex problems
  • 4. How The OOM-killer Deleted my Namespace
  • 5. @lbernail "My namespace and everything in it are completely gone" Symptoms
  • 6. @lbernail Feb 17 14:30: Namespace and workloads deleted Investigation DeleteCollection calls by resource (audit logs)
  • 7. @lbernail Audit logs ● Delete calls for all resources in the namespace ● No delete call for the namespace resource Why was the namespace deleted?? Investigation
  • 8. @lbernail Deletion call, 4d before Audit logs for the namespace Feb 13th (resource deletion happened on Feb 17th) Successful namespace delete call, 4d before workloads were deleted
  • 9. @lbernail Engineer did not delete the namespace But engineer was migrating to new deployment method
  • 11. @lbernail Helm 3 deploys (v2) chart deploy helm upgrade
  • 15. @lbernail What happened? 1- Chart refactoring: move namespace out of it 2- Deployment: deletes namespace But why did it take 4 days ?
  • 16. @lbernail Namespace Controller Discover all API resources Delete all resources in namespace
  • 17. @lbernail Namespace Controller => If discovery fails, namespace content is not deleted Kubernetes 1.14+: Delete as many resources as possible Kubernetes 1.16+: Add Conditions to namespaces to surface errors
  • 18. @lbernail Namespace Controller logs unable to retrieve the complete list of server APIs: metrics.k8s.io/v1beta1: the server is currently unable to handle the request Controller-manager contained many occurrences of No context so hard to link with the incident but ● This is the error from the Discovery call ● We have a likely cause: metrics.k8s.io/v1beta1
  • 19. @lbernail metrics-server ● Additional controller providing node/pod metrics ● Registers an ApiService ○ Additional Kubernetes APIs ○ Managed by the external controller Additional APIs (group and version) Where should the apiserver proxy calls for this APIs
  • 20. @lbernail Events so far ● Feb 13th: namespace deletion call ● Feb 13-17th ○ Namespace controller can't discover metrics APIs ○ Namespace content is not deleted ○ At each sync, controller retries and fails ● Feb 17th ○ ? ○ Namespace controller is unblocked ○ Everything in the namespace is deleted
  • 21. @lbernail Events so far ● Feb 13th: error leads to namespace deletion call ● Feb 13-17th ○ Namespace controller can't discover metrics APIs ○ Namespace content is not deleted ○ At each sync, controller retries and fails ● Feb 17th ○ OOM-killer kills metrics-server and it is restarted ○ Metrics-server is now working fine ○ Namespace controller is unblocked ○ Everything in the namespace is deleted
  • 23. @lbernail Metrics-server setup metrics-server certificate management sidecar pkill Pod with two containers ● metrics-server ● sidecar: rotate server certificate, and signal metrics-server to reload Requires shared pid namespace ● shareProcessNamespace (pod spec) ● PodShareProcessNamespace (feature gate on Kubelet and Apiserver)
  • 24. @lbernail Metrics-server deployment metrics-server other controllers coredns Deployed on a pool of nodes dedicated to core controllers ("system")
  • 25. @lbernail Metrics-server deployment "old" nodes no PodShareProcessNamespace Subset of nodes using an old image without PodShareProcessNamespace ● metrics-server: fine until it was rescheduled on an old node ● At this point, server certificate is no longer reloaded recent nodes metrics-server
  • 26. @lbernail Full chain of events ● Before Feb 13th ○ metrics-server scheduled on a "bad" node ○ controller discovery calls fail ● Feb 13th: error leads to namespace deletion call ● Feb 13-17th ○ Namespace controller can't discover metrics APIs ○ Namespace content is not deleted ○ At each sync, controller retries and fails ● Feb 17th ○ OOM-killer kills metrics-server and it is restarted ○ Metrics-server is now working fine (certificate up to date) ○ Namespace controller is unblocked ○ Everything in the namespace is deleted
  • 27. @lbernail Key take-away ● Apiservice extensions are great but can impact your cluster ○ If the API is not used, they can be down for days without impact ○ But some controllers need to discover all apis and will fail ● In addition, Apiservice extensions security is hard ○ If you are curious, go and see
  • 28. New node image does not work and it's Weird...
  • 29. @lbernail Context ● We build images for our Kubernetes nodes ● A small change pushed ● Nodes become NotReady after a few days ● Change completely unrelated to kubelet / runtime ● We can't promote to Prod
  • 30. @lbernail Investigation Conditions: Type Status Reason Message ---- ------ ------ ------- OutOfDisk False KubeletHasSufficientDisk kubelet has sufficient disk space available MemoryPressure False KubeletHasSufficientMemory kubelet has sufficient memory available DiskPressure False KubeletHasNoDiskPressure kubelet has no disk pressure PIDPressure False KubeletHasSufficientPID kubelet has sufficient PID available Ready False KubeletNotReady container runtime is down Node description
  • 31. @lbernail Runtime is down? $ crictl pods POD ID STATE NAME NAMESPACE 681bb3aec7fba Ready local-volume-provisioner-jtwzf local-volume-provisioner 53f122f351da2 Ready datadog-agent-qctt2-8ljgp datadog-agent c1edb4f19713c Ready localusers-z26n9 prod-platform 01cf5cafd6bc9 Ready node-monitoring-ws2xf node-monitoring 39f01bdaade86 Ready kube2iam-zzdt6 kube2iam 3fcf520680a63 Ready kube-proxy-sr7tn kube-system 5ecd0966634f1 Ready node-local-dns-m2zbp coredns Looks completely fine Let's look at containerd
  • 32. @lbernail Kubelet logs remote_runtime.go:434] Status from runtime service failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded kubelet.go:2130] Container runtime sanity check failed: rpc error: code = DeadlineExceeded desc = context deadline exceeded kubelet.go:1803] skipping pod synchronization - [container runtime is down] "Status from runtime service failed" ➢ Is there a CRI call for this? Something is definitely wrong
  • 33. @lbernail Testing with crictl ● crictl has no "status" subcommand ● but "crictl info" looks close, let's check
  • 34. @lbernail Crictl info $ crictl info ^C ● Hangs indefinitely ● But what does Status do? ● Almost nothing, except this
  • 35. @lbernail CNI status ● Really not much here right? ● But we have calls that hang and a lock ● Could this be related?
  • 36. @lbernail Containerd goroutine dump containerd[737]: goroutine 107298383 [semacquire, 2 minutes]: containerd[737]: goroutine 107290532 [semacquire, 4 minutes]: containerd[737]: goroutine 107282673 [semacquire, 6 minutes]: containerd[737]: goroutine 107274360 [semacquire, 8 minutes]: containerd[737]: goroutine 107266554 [semacquire, 10 minutes]: Blocked goroutines? The runtime error from the kubelet also happens every 2mn containerd[737]: goroutine 107298383 [semacquire, 2 minutes]: ... containerd[737]: .../containerd/vendor/github.com/containerd/go-cni.(*libcni).Status() containerd[737]: .../containerd/vendor/github.com/containerd/go-cni/cni.go:126 containerd[737]: .../containerd/vendor/github.com/containerd/cri/pkg/server.(*criService).Status(...) containerd[737]: .../containerd/containerd/vendor/github.com/containerd/cri/pkg/server/status.go:45 Goroutine details match exactly the codepath we looked at
  • 37. @lbernail Seems CNI related Let's bypass containerd/kubelet $ ip netns add debug $ CNI_PATH=/opt/cni/bin cnitool add pod-network /var/run/netns/debug Works great and we even have connectivity $ ip netns exec debug ping 8.8.8.8 PING 8.8.8.8 (8.8.8.8) 56(84) bytes of data. 64 bytes from 8.8.8.8: icmp_seq=1 ttl=111 time=2.41 ms 64 bytes from 8.8.8.8: icmp_seq=2 ttl=111 time=1.70 ms 64 bytes from 8.8.8.8: icmp_seq=3 ttl=111 time=1.76 ms 64 bytes from 8.8.8.8: icmp_seq=4 ttl=111 time=1.72 ms
  • 38. @lbernail What about Delete? $ CNI_PATH=/opt/cni/bin cnitool del pod-network /var/run/netns/debug ^C ● hangs indefinitely ● and we can even find the original process still running ps aux | grep cni root 367 0.0 0.1 410280 8004 ? Sl 15:37 0:00 /opt/cni/bin/cni-ipvlan-vpc-k8s-unnumbered-ptp
  • 39. @lbernail CNI plugin ● On this cluster we use the Lyft CNI plugin ● The delete calls fails for cni-ipvlan-vpc-k8s-unnumbered-ptp ● After a few tests, we tracked the problem to this call ● Weird, this is in the netlink library!
  • 40. @lbernail The root cause Many thanks to Daniel Borkmann!
  • 41. @lbernail Summary ● Packer configured to use the latest Ubuntu 18.04 ● The default kernel changed from 4.15 to 5.4 ● Netlink library had a bug for kernels 4.20+ ● Bug only happened on pod deletion
  • 42. @lbernail Status ● Containerd ○ Status calls trigger config file reload, which acquires a RWLock on the config mutex ○ CNI Add/Del calls acquire a RLock on the mutex ○ If CNI Add/Del hangs, Status calls block ○ Does not impact containerd 1.4+ (separate fsnotify.Watcher to reload config) ● cni-ipvlan-vpc-k8s ("Lyft" CNI plugin) ○ Bug in Netlink library with kernel 4.20+ ○ Does not impact cni-ipvlan-vpc-k8s 0.6.2+ (different function used)
  • 43. @lbernail Key take-away Nodes are not abstract compute resources ● Kernel ● Distribution ● Hardware Error messages can be misleading ● For CNI errors, runtime usually reports a clear error ● Here we had nothing but a timeout
  • 44. How a single app can overwhelm the control plane
  • 45. @lbernail Context ● Users report connectivity issue to a cluster ● Apiservers are not doing well
  • 46. @lbernail What's with the Apiservers? Apiservers can't reach etcd Apiservers crash/restart
  • 47. @lbernail Etcd? Etcd memory usage is spiking Etcd is oom-killed
  • 48. @lbernail What we know ● Cluster size has not significantly changed ● The control plane has not been updated ● So it is very likely related to api calls
  • 49. @lbernail Apiserver requests Spikes in inflight requests Spikes in list calls (very expensive)
  • 50. @lbernail Why are list calls expensive? Understanding Apiserver caching apiserver etcd cache List + Watch
  • 51. @lbernail Why are list calls expensive? What happens for GET/LIST calls? apiserver etcd cache List + Watch client storagehandler GET/LIST RV=X ● Resources have versions (ResourceVersion, RV) ● For GET/LIST with RV=X If cachedVersion >= X, return cachedVersion else wait up to 3s, if cachedVersion>=X return cachedVersion else Error: "Too large resource version" ➢ RV=X: "at least as fresh as X" ➢ RV=0: current version in cache
  • 52. @lbernail Why are list calls expensive? What about GET/LIST without a resourceVersion? apiserver etcd cache List + Watch client storagehandler GET/LIST RV="" ● If RV is not set, apiserver performs a Quorum read against etcd (for consistency) ● This is the behavior of ○ kubectl get ○ client.CoreV1().Pods("").List() with default options (client-go)
  • 53. @lbernail Illustration ● Test on a large cluster with more than 30k pods ● Using table view ("application/json;as=Table;v=v1beta1;g=meta.k8s.io, application/json") ○ Only ~25MB of data to minimize transfer time (full JSON: ~1GB) time curl 'https://cluster.dog/api/v1/pods' real 0m4.631s time curl 'https://cluster.dog/api/v1/pods?resourceVersion=0' real 0m1.816s
  • 54. @lbernail What about label filters? time curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A' real 0m3.658s time curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A&resourceVersion=0' real 0m0.079s ● Call with RV="" is slightly faster (less data to send) ● Call with RV=0 is much much faster > Filtering is performed on cached data ● When RV="", all pods are still retrieved from etcd and then filtered on apiservers
  • 55. @lbernail Why not filter in etcd? etcd key structure /registry/{resource type}/{namespace)/{resource name} So we can ask etcd for: ● a specific resource ● all resources of a type in a namespace ● all resources of a type ● no other filtering / indexing curl 'https://cluster.dog/api/v1/pods?labelSelector=app=A' real 0m3.658s <== get all pods (30k) in etcd, filter on apiserver curl 'https://cluster.dog/api/v1/namespaces/datadog/pods?labelSelector=app=A' real 0m0.188s <== get pods from datadog namespace (1000) in etcd, filter on apiserver curl 'https://cluster.dog/api/v1/namespaces/datadog/pods?labelSelector=app=A&resourceVersion=0' real 0m0.058s <== get pods from datadog namespace in apiserver cache, filter
  • 56. @lbernail Informers instead of List How do informers work? apiserver etcd cache List + Watch Informer storagehandler 1- LIST RV=0 2- WATCH RV=X ● Informers are much better because ○ They maintain a local cache, updated on changes ○ They start with a LIST using RV=0 to get data from the cache ● Edge case ○ Disconnections: can trigger a new LIST with RV="" (to avoid getting back in time) ○ Kubernetes 1.20 will have EfficientWatchResumption (alpha) https://github.com/kubernetes/enhancements/tree/master/keps/sig-api-machinery/1904-efficient-watch-resumption
  • 57. @lbernail Summary ● LIST calls go to etcd by default and can have a huge impact ● LIST calls with label filters still retrieve everything from etcd ● Avoid LIST, use Informers ● kubectl get uses LIST with RV="" ● kubectl get could allow setting RV=0 ○ Much faster: better user experience ○ Much better for etcd and apiservers ○ Trade-off: small inconsistency window Many, many thanks to Wojtek-t for his help getting all this right
  • 58. @lbernail Back to the incident ● We know the problem comes from LIST calls ● What application is issuing these calls?
  • 59. @lbernail Audit logs Cumulated query time by user for list calls A single user accounts for 2+ days of query time over 20 minutes!
  • 60. @lbernail Audit logs Cumulated query time by user for list/get calls over a week
  • 61. @lbernail Nodegroup controller? ● An in-house controller to manage pools of nodes ● Used extensively for 2 years ● But recent upgrade deployed: deletion protection ○ Check if pods are running on pool of nodes ○ Deny nodegroup deletion if it is the case
  • 62. @lbernail How did it work? On nodegroup delete 1. List all nodes in this nodegroup based on labels => Some groups have 100+ nodes 2. List all pods on each node filtering on bound node => List all pods (30k) for node => Performed in parallel for all nodes in the group => The bigger the nodegroup, the worse the impact All LIST calls here retrieve full node/pod list from etcd
  • 63. @lbernail What we learned ● List calls are very dangerous ○ The volume of data can be very large ○ Filtering happens on the apiserver ○ Use Informers (whenever possible) ● Audit logs are extremely useful ○ Who did what when? ○ Which users are responsible for processing time
  • 65. @lbernail Conclusion ● Apiservice extensions are powerful but can harm your cluster ● Nodes are not abstract compute resources ● For large (1000+ nodes) clusters ○ Understanding Apiserver caching is important ○ Client interactions with Apiservers are critical ○ Avoid LIST calls