10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris)

10 Ways to Shoot Yourself in the Foot
with Kubernetes, #9 Will Surprise You!
Laurent Bernaille, @lbernail
Staff Engineer, Datadog
#ContainerDayFR

#ContainerDayFR
Who are we?
Datadog is a monitoring service; metrics, traces, logs,
dashboards, alerts, etc.
Cloud Native service provider
www.datadoghq.com
We run the cloud infrastructure that powers the
product.
Cloud Native end user

@lbernail #ContainerDayFR
Sometimes it Breaks

Why am I here?
Lessons learned from incidents with Kubernetes in production!
And especially the self inflicted ones

#ContainerDayFR
#1 It’s never always DNS

DNS is broken
Coredns rolling update and IPVS
Virtual Server
- Pod A
- Pod B
- Pod C
IPVS conntrack
- :X => A:53
- :Y => B:53
- :Z => C:53
Pod A
Pod B
Pod C
App

Under high load ports are reused faster than they expire
Virtual Server
- Pod B
- Pod C
- Pod D
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
DNS is broken

Under high load ports are reused faster than they expire
Virtual Server
- Pod B
- Pod C
- Pod D
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
net/ipv4/vs/expire_nodest_conn
DNS is broken

Graceful termination?
Virtual Server
- Pod A Weight=0
- Pod B Weight=1
- Pod C Weight=1
- Pod D Weight=1
IPVS conntrack
- :X => A:53 TTL:300s
- :Y => B:53
- :Z => C:53
Pod B
Pod C
App
Pod D
black-holed
DNS is broken

DNS is broken
● Before graceful termination
○ Backend removed
○ Traffic reusing port is blackholed
● After graceful termination
○ Under high-load port are reused fast
○ The backend pod is never removed
○ When the pod terminates, black-holed
○ Fix: https://github.com/kubernetes/kubernetes/pull/77802

#ContainerDayFR
#2 Jobs are not starting

52 Completed
1 ErrImagePull
155 Evicted
7 ImagePullBackOff
7 Init:ErrImagePull
4 Init:Error
49 Init:ImagePullBackOff
16 Pending
1267 Running
2 Terminating
Impossible to pull images

Number of image pulls sharply increases, sustained for several hours.
Image Stampede

Number of image pulls sharply increases, sustained for several hours.
Then a sudden change.
Image Stampede

● Permission change on bucket
● DaemonSet in CrashLoopBackoff
Image Stampede

● imagePullPolicy: Always
Image Stampede

● imagePullPolicy: Always
● ~1000 pods pulling through 3 NAT instances
● Daily quota reached for NAT IPs
● All NAT instances are impacted
Image Stampede

Image Stampede, follow-up #1
● Replaced the impacted NAT instances

● Replaced the impacted NAT instances
● Apps couldn’t connect to CloudSQL anymore...

● Admission webhook denying “latest” tag

● Applied to
○ Deployments
○ StatefulSets
○ DaemonSets
○ Jobs
○ Pods

● Applied to
○ Deployments
○ StatefulSets
○ DaemonSets
○ Jobs
○ Pods
● Controllers can’t create pods for existing workloads

#ContainerDayFR
#3 “I can’t kubectl”

Load on apiservers
Symptoms: apiservers unresponsive
> load => 50 (on 8 core nodes)
Apiserver DDOS

Load on apiservers
Usable memory
> Free memory => 0
> apiservers OOM killed
Apiserver DDOS

Load on apiservers
Usable memory
Outgoing traffic
> Free memory => 0
> apiservers OOM killed
> network traffic much higher
Apiserver DDOS

kube2iam pod restarts
Seemed related to kube2iam update
> Lots of restarts
Apiserver DDOS

> Lots of restarts
> cgroup OOM-killer
> Memory usage much higher
> Increase limit
kube2iam memory usage / limit
Apiserver DDOS

> Lots of restarts
> cgroup OOM-killer
> Memory usage much higher
> Increase limit
Cause?
> Patch in kube2iam
> Small typo
> kube2iam pods syncing all pods
> Broke apiservers + kube2iam
kube2iam memory usage / limit
Apiserver DDOS

#ContainerDayFR
#4 DNS updates take 30mn

DNS updates take 30+ minutes to propagate
● We use external-dns for cross-cluster communications
● Usually propagation is ~5mn
● But on a cluster...
Existing records (cached)
"Retrieved 1671 records from registry in 525ns"
"Generated 1670 records from sources in 36mn12"
"Calculated plan in 37.8ms"
From external-dns logs
Expected records (from k8s)

Why does it take so long to generate records?
● To generate records external-dns does the following

a. Get all services with specific annotation

b. For each headless service
■ List all pods matching the service selector
■ Add the pod IP for running pods to the record

c. Add the record as DNS Round-robin

c. Add the record as DNS Round-robin
● In this cluster, we use a lot of headless services (1500+)

What had changed?
“List pods” calls duration (from apiserver audit logs)
Listing pods 75ms => 1.25s
> external-dns does this 1600+ times in each sync loop
> 1600*1.2s => ~33mn

Why was list so slow?
Number of pods in the main namespace
Pods: 900 => 4400
> List calls much slower

Why was list so slow?
Number of pods in the main namespace
Pods: 900 => 4400
> List calls much slower
Number of “Evicted” pods Evicted pods: 3500
> Remain in apiserver

Why did the apiserver slow down so much?
Apiserver memory usage
Apiservers configured to use 24GB
> No room for additional caching

Why did the apiserver slow down so much?
Apiserver memory usage
Apiservers configured to use 24GB
> No room for additional caching
ETCD traffic sent
A lot more reads in ETCD
> 32MB/s => 70 MB/s
> Much slower than from cache

Where are the evictions coming from?
A lot of evictions
Evictions over time (from Kubernetes events)

Where are the evictions coming from?
Evictions over time (from Kubernetes events)
A lot of evictions
Disk usage Triggered because usage >85%

Kubelet logs
Disk is full
Delete unused images (good)
Evict pods (bad)
Eviction behavior

Pod eviction under disk pressure
● Rank pods according to disk usage
> local volumes + logs & writable layer
● Evict pods with highest ranking
● What happened?
a. Application uses a cache in an empty-dir
b. Change in traffic pattern > cache usage became more important
c. Application was evicted frequently during a few hours

Fixes
1) Manually remove evicted pods and monitor (immediate win)
2) Tune the controller-manager garbage collection
“terminated-pod-gc-threshold” (default: 12500)
3) Update external-dns to use informers (cache) instead of list calls
> Fixed upstream
“Use k8s informer cache instead of making active API GET requests”
https://github.com/kubernetes-incubator/external-dns/pull/917

#ContainerDayFR
#5 “Logs intake just 20x”

Logs intake volume 20x
For this account, log volume did x20 in 30mn
Logs per source

Kernel Audit DDOS
For this account, log volume did x20 in 30mn
> Enabled audit with a daemonset
> Nodes running Kubernetes generate a huge amount of audit logs (exec, clones, iptables…)
Logs per source

#ContainerDayFR
#6 “Where did my pods go”

HorizontalPodAutoscaler
apiVersion: apps/v1
kind: Deployment
spec:
replicas: 60

apiVersion: apps/v1
kind: Deployment
spec:
replicas: 60
apiVersion: autoscaling/v1
kind: HorizontalPodAutoscaler
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 100
targetCPUUtilizationPercentage: 50

apiVersion: apps/v1
kind: Deployment
spec:
replicas: 60
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 100
Controller manages replica count of
deployment based on the value of a
metric.

apiVersion: apps/v1
kind: Deployment
spec:
replicas: 60
spec:
scaleTargetRef:
apiVersion: apps/v1
kind: Deployment
name: myapp
minReplicas: 1
maxReplicas: 10
Controller manages replica count of
deployment based on the value of a
metric.
Must remove explicit replica count!

Scale to One

#ContainerDayFR
#7 Ghosts in the C* cluster

● 100+ nodes Cassandra cluster
● Deployed fine
● Broken the following morning
Ghosts in the Cassandra cluster

● Deployed fine
> 25% pods pending: “Volume affinity issue”

● Deployed fine
> 25% nodes had been deleted + local volumes bound to deleted nodes

● Deployed fine
> 25% nodes had been deleted + local volumes bound to deleted nodes
Nodes per AZ > ASG rebalance

#ContainerDayFR
#8 Very slow deployments

Symptoms
Deployments getting slower
Very slow deployments

Symptoms
Cause
4000 pending pods

Symptoms
Cause
4000 pending pods
cronjob (*/10) with wrong toleration
Scheduling loop having a hard time

#ContainerDayFR
#8.2 Very slow deployments

Symptoms
Very slow deployments #2
Deploy heartbeat

Symptoms
Cause
4000 pending pods from cronjob
But tolerations were correct
Pending pods
Deploy heartbeat

Symptoms
Cause
4000 pending pods from cronjobs
A lot of preemption attempts
=> Resource issue
Pending pods
Deploy heartbeat
Preemptions attempts

Cronjob moved to new namespace
Tolerations were fine
But vault permission were tied to the old namespace
What happened?

InitContainers failed
InitContainers enter Crashloop Backoff
What happened?

Default “concurrencyPolicy” for cronjob (Allow)
These pods consume resources
After a while new pods can’t be scheduled => Pending
What happened?

Default “concurrencyPolicy” for cronjob (Allow)
These pods consume resources
After a while new pods can’t be scheduled => Pending
Actually an upstream bug
InitContainer restarts not counted against “backoffLimit” (job spec)
Fixed upstream: https://github.com/kubernetes/kubernetes/pull/77647
What happened?

#ContainerDayFR
#9 Contained ?

#ContainerDayFR
#9.1 Broken runtime

Broken runtime #1
Pods stuck “Terminating”
Runtime unresponsive

ps auxf | grep -c defunct
16018
Broken runtime #1

ps auxf | grep -c defunct
16018
readinessProbe:
exec:
command: [server_readiness_probe.sh]
timeoutSeconds: 1
Broken runtime #1

Broken runtime #2

Broken runtime #2
Runtime logs: Failure to StopContainer

Broken runtime #2
ps auxf
[...]
600 containerd
10000 _ containerd-shim
12345 _ java …

Broken runtime #2
ps auxf
[...]
600 containerd
12345 _ java …
kill 12345 => Nothing

Broken runtime #2
ps auxf
[...]
600 containerd
12345 _ java …
kill -9 12345 => Nothing

Broken runtime #2
ps auxf
[...]
600 containerd
12345 _ java …
kill -9 12345 => Nothing
Status: “D” => Uninterruptible sleep

Broken runtime #2
What is the process waiting on?

• Blocked io (nvme issue)
• Deletion starts with cgroup freeze
• Freezing was hanging
Broken runtime #2

#ContainerDayFR
#9.2 Performance issues

Perf issues: slow deployments
Symptoms
● Deployment slow
● Pod took 1+mn to start

Load on nodepool Symptoms
● Deployment slow
Cause
● Load on nodes running deployment high

Load on nodepool
EBS burst balance
Write iops
Symptoms
● Deployment slow
Cause
● Load on nodes running deployment high
● Because IOs are being rate limited

Why ??
DNS queries

DNS queries
● Nodes were running coredns pods
● An app started doing 5000+ dns queries per second
● coredns logs filled the disk
Why ??

Fix #1
Change app to use local DNS cache

IOs still too high
What is this???

Remember audit?
Daemonset had been removed, but audit was still enabled
We dropped traffic at log intake
But we were still writing to disk...
IOs still too high

Fix #2
Disable audit for real

Weird outlier
iops by node in the group One node still had more iops

iops by node in the group
iops by node in the group (zoomed)
One node still had more iops
> io spikes every minute
Weird outlier

iops by node in the group
iops by node in the group (zoomed)
containers created by node
One node still had more iops
> io spikes every minute
> Single host get all pod for a cronjob
> Suspending it => 0
> Changing the job to sleep => 0
Weird outlier

Why is the job using so much IOs???
Job
ips = $(aws ec2 describe-instances --filter=consul)
kubectl update endpoints consul $ips

Job
ips = $(aws ec2 describe-instances --filter=consul)
kubectl update endpoints consul $ips
$ kubectl get componentstatuses
$ find $HOME/.kube/ |wc -l
163
$ kubectl -v=8 get componentstatuses | grep "GET https" -c
45
Why is the job using so much IOs???

#ContainerDayFR
1. Be very careful with Daemonsets
2. Cronjobs can easily create problems
3. DNS is hard
4. Cloud infra not transparent
5. Containers not really contained
6. Train users
Key take-aways

#ContainerDayFR
Thank you!
Questions / comments
twitter: @lbernail
K8s slack: lbernail
We’re hiring!
Come join us (and break k8s!)
https://www.datadoghq.com/careers/

10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris)

Recommended

Recommended

More Related Content

Similar to 10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris)

Similar to 10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris) (20)

More from Laurent Bernaille

More from Laurent Bernaille (15)

Recently uploaded

Recently uploaded (20)

10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! (Container Day Paris)