Kubernetes is a very powerful and complicated system, and many users don’t understand the underlying systems. Come learn how your users can abuse container runtimes, overwhelm your control plane, and cause outages - it’s actually quite easy!
In the last year, we have containerized hundreds of applications and deployed them in large scale clusters (more than 1000 nodes). The journey was eventful and we learned a lot along the way. We’ll share stories of our ten favorite Kubernetes foot guns, including the dangers of cargo culting, rolling updates gone wrong, the pitfalls of initContainers, and nightmarish daemonset upgrades. The talk will present solutions we adopted to avoid or work around some these problems and will finally show several improvements we plan deploy in the future.
Similar to the Kubecon talk with the same title with a few new incidents.
DevoxxFR 2024 Reproducible Builds with Apache Maven
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris)
1. 10 Ways to Shoot Yourself in the Foot
with Kubernetes, #9 Will Surprise You!
Laurent Bernaille, @lbernail
Staff Engineer, Datadog
#ContainerDayFR
2. #ContainerDayFR
Who are we?
Datadog is a monitoring service; metrics, traces, logs,
dashboards, alerts, etc.
Cloud Native service provider
www.datadoghq.com
We run the cloud infrastructure that powers the
product.
Cloud Native end user
7. @lbernail #ContainerDayFR
DNS is broken
Coredns rolling update and IPVS
Virtual Server
- Pod A
- Pod B
- Pod C
IPVS conntrack
- :X => A:53
- :Y => B:53
- :Z => C:53
Pod A
Pod B
Pod C
App
8. @lbernail #ContainerDayFR
Under high load ports are reused faster than they expire
Virtual Server
- Pod B
- Pod C
- Pod D
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
DNS is broken
9. @lbernail #ContainerDayFR
Under high load ports are reused faster than they expire
Virtual Server
- Pod B
- Pod C
- Pod D
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
net/ipv4/vs/expire_nodest_conn
DNS is broken
10. @lbernail #ContainerDayFR
Graceful termination?
Virtual Server
- Pod A Weight=0
- Pod B Weight=1
- Pod C Weight=1
- Pod D Weight=1
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
DNS is broken
11. @lbernail #ContainerDayFR
DNS is broken
● Before graceful termination
○ Backend removed
○ Traffic reusing port is blackholed
● After graceful termination
○ Under high-load port are reused fast
○ The backend pod is never removed
○ When the pod terminates, black-holed
○ Fix: https://github.com/kubernetes/kubernetes/pull/77802
34. @lbernail #ContainerDayFR
DNS updates take 30+ minutes to propagate
● We use external-dns for cross-cluster communications
● Usually propagation is ~5mn
● But on a cluster...
Existing records (cached)
"Retrieved 1671 records from registry in 525ns"
"Generated 1670 records from sources in 36mn12"
"Calculated plan in 37.8ms"
From external-dns logs
Expected records (from k8s)
36. @lbernail #ContainerDayFR
Why does it take so long to generate records?
● To generate records external-dns does the following
a. Get all services with specific annotation
37. @lbernail #ContainerDayFR
Why does it take so long to generate records?
● To generate records external-dns does the following
a. Get all services with specific annotation
b. For each headless service
■ List all pods matching the service selector
■ Add the pod IP for running pods to the record
38. @lbernail #ContainerDayFR
Why does it take so long to generate records?
● To generate records external-dns does the following
a. Get all services with specific annotation
b. For each headless service
■ List all pods matching the service selector
■ Add the pod IP for running pods to the record
c. Add the record as DNS Round-robin
39. @lbernail #ContainerDayFR
Why does it take so long to generate records?
● To generate records external-dns does the following
a. Get all services with specific annotation
b. For each headless service
■ List all pods matching the service selector
■ Add the pod IP for running pods to the record
c. Add the record as DNS Round-robin
● In this cluster, we use a lot of headless services (1500+)
40. @lbernail #ContainerDayFR
What had changed?
“List pods” calls duration (from apiserver audit logs)
Listing pods 75ms => 1.25s
> external-dns does this 1600+ times in each sync loop
> 1600*1.2s => ~33mn
42. @lbernail #ContainerDayFR
Why was list so slow?
Number of pods in the main namespace
Pods: 900 => 4400
> List calls much slower
Number of “Evicted” pods Evicted pods: 3500
> Remain in apiserver
43. @lbernail #ContainerDayFR
Why did the apiserver slow down so much?
Apiserver memory usage
Apiservers configured to use 24GB
> No room for additional caching
44. @lbernail #ContainerDayFR
Why did the apiserver slow down so much?
Apiserver memory usage
Apiservers configured to use 24GB
> No room for additional caching
ETCD traffic sent
A lot more reads in ETCD
> 32MB/s => 70 MB/s
> Much slower than from cache
46. @lbernail #ContainerDayFR
Where are the evictions coming from?
Evictions over time (from Kubernetes events)
A lot of evictions
Disk usage Triggered because usage >85%
48. @lbernail #ContainerDayFR
Pod eviction under disk pressure
● Rank pods according to disk usage
> local volumes + logs & writable layer
● Evict pods with highest ranking
● What happened?
a. Application uses a cache in an empty-dir
b. Change in traffic pattern > cache usage became more important
c. Application was evicted frequently during a few hours
49. @lbernail #ContainerDayFR
Fixes
1) Manually remove evicted pods and monitor (immediate win)
2) Tune the controller-manager garbage collection
“terminated-pod-gc-threshold” (default: 12500)
3) Update external-dns to use informers (cache) instead of list calls
> Fixed upstream
“Use k8s informer cache instead of making active API GET requests”
https://github.com/kubernetes-incubator/external-dns/pull/917
52. @lbernail #ContainerDayFR
Kernel Audit DDOS
For this account, log volume did x20 in 30mn
> Enabled audit with a daemonset
> Nodes running Kubernetes generate a huge amount of audit logs (exec, clones, iptables…)
Logs per source
60. @lbernail #ContainerDayFR
● 100+ nodes Cassandra cluster
● Deployed fine
● Broken the following morning
Ghosts in the Cassandra cluster
61. @lbernail #ContainerDayFR
● 100+ nodes Cassandra cluster
● Deployed fine
● Broken the following morning
> 25% pods pending: “Volume affinity issue”
Ghosts in the Cassandra cluster
62. @lbernail #ContainerDayFR
● 100+ nodes Cassandra cluster
● Deployed fine
● Broken the following morning
> 25% pods pending: “Volume affinity issue”
> 25% nodes had been deleted + local volumes bound to deleted nodes
Ghosts in the Cassandra cluster
63. @lbernail #ContainerDayFR
● 100+ nodes Cassandra cluster
● Deployed fine
● Broken the following morning
> 25% pods pending: “Volume affinity issue”
> 25% nodes had been deleted + local volumes bound to deleted nodes
Nodes per AZ > ASG rebalance
Ghosts in the Cassandra cluster
73. @lbernail #ContainerDayFR
Cronjob moved to new namespace
Tolerations were fine
But vault permission were tied to the old namespace
InitContainers failed
InitContainers enter Crashloop Backoff
What happened?
74. @lbernail #ContainerDayFR
Cronjob moved to new namespace
Tolerations were fine
But vault permission were tied to the old namespace
InitContainers failed
InitContainers enter Crashloop Backoff
Default “concurrencyPolicy” for cronjob (Allow)
These pods consume resources
After a while new pods can’t be scheduled => Pending
What happened?
75. @lbernail #ContainerDayFR
Cronjob moved to new namespace
Tolerations were fine
But vault permission were tied to the old namespace
InitContainers failed
InitContainers enter Crashloop Backoff
Default “concurrencyPolicy” for cronjob (Allow)
These pods consume resources
After a while new pods can’t be scheduled => Pending
Actually an upstream bug
InitContainer restarts not counted against “backoffLimit” (job spec)
Fixed upstream: https://github.com/kubernetes/kubernetes/pull/77647
What happened?
92. @lbernail #ContainerDayFR
Load on nodepool Symptoms
● Deployment slow
● Pod took 1+mn to start
Cause
● Load on nodes running deployment high
Perf issues: slow deployments
93. @lbernail #ContainerDayFR
Load on nodepool
EBS burst balance
Write iops
Symptoms
● Deployment slow
● Pod took 1+mn to start
Cause
● Load on nodes running deployment high
● Because IOs are being rate limited
Perf issues: slow deployments
95. @lbernail #ContainerDayFR
DNS queries
● Nodes were running coredns pods
● An app started doing 5000+ dns queries per second
● coredns logs filled the disk
Why ??
101. @lbernail #ContainerDayFR
iops by node in the group
iops by node in the group (zoomed)
One node still had more iops
> io spikes every minute
Weird outlier
102. @lbernail #ContainerDayFR
iops by node in the group
iops by node in the group (zoomed)
containers created by node
One node still had more iops
> io spikes every minute
> Single host get all pod for a cronjob
> Suspending it => 0
> Changing the job to sleep => 0
Weird outlier
103. @lbernail #ContainerDayFR
Why is the job using so much IOs???
Job
ips = $(aws ec2 describe-instances --filter=consul)
kubectl update endpoints consul $ips
104. @lbernail #ContainerDayFR
Job
ips = $(aws ec2 describe-instances --filter=consul)
kubectl update endpoints consul $ips
$ kubectl get componentstatuses
$ find $HOME/.kube/ |wc -l
163
$ kubectl -v=8 get componentstatuses | grep "GET https" -c
45
Why is the job using so much IOs???
105. #ContainerDayFR
1. Be very careful with Daemonsets
2. Cronjobs can easily create problems
3. DNS is hard
4. Cloud infra not transparent
5. Containers not really contained
6. Train users
Key take-aways