SlideShare une entreprise Scribd logo
1  sur  93
Télécharger pour lire hors ligne
Robert Boll @roboll_
Laurent Bernaille @lbernail
Making the Most Out of
Kubernetes Audit Logs
@lbernail @roboll_
Datadog
Monitoring service
Over 350 integrations
Over 1,200 employees
Over 8,000 customers
Runs on millions of hosts
Trillions of data points per day
10000s hosts in our infra
10s of Kubernetes clusters
Clusters from 50 to 3000 nodes
Multi-cloud
Very fast growth
@lbernail @roboll_
Understanding what happens can be hard
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
Background:
The Kubernetes API
@lbernail @roboll_
apiserver
etcd
Calls to the apiservers
@lbernail @roboll_
apiserver
etcd
controllers scheduler
Control plane
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
Kubelet
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
kube-proxy
DaemonSet: kube-proxy
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
kube-proxy
other node
apps
Other DaemonSets (cni, etc.)
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
kube-proxy
coredns
other node
apps
Cluster services: DNS
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
kube-proxy cluster-
autoscaler
ingress
controllers
coredns
other node
apps
Other cluster services
@lbernail @roboll_
apiserver
etcd
kubelet controllers scheduler
kube-proxy cluster-
autoscaler
ingress
controllers
other apps
coredns
other node
apps
Probably several other applications
@lbernail @roboll_
And users, of course
apiserver
etcd
kubelet controllers scheduler
kubectl
kube-proxy cluster-
autoscaler
ingress
controllers
other apps
coredns
other node
apps
@lbernail @roboll_
And, surprise, the apiserver itself
apiserver
etcd
kubelet controllers scheduler
kubectl
kube-proxy cluster-
autoscaler
ingress
controllers
other apps
coredns
other node
apps
@lbernail @roboll_
What happens when you kubectl?
$ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
@lbernail @roboll_
Let’s look at details
$ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
apiserver api version namespace resource type resource name
@lbernail @roboll_
A few more GET examples
$ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
$ kubectl get pods
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 List
(paginated)
@lbernail @roboll_
A few more GET examples
$ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
$ kubectl get pods
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500
$ kubectl get pods --watch=true -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?resourceVersion=282725545&watch=true
List &
Watch
@lbernail @roboll_
Describe resource
kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events?
fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9,
involvedObject.namespace=datadog,
involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c
[...]
ResponseBody
{
"kind": "EventList",
"items": [
{
"involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"},
"reason": "Scheduled",
"message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal",
"source": {
"component": "default-scheduler"
},
}
]
}
@lbernail @roboll_
Describe resource
kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events?
fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9,
involvedObject.namespace=datadog,
involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c
[...]
ResponseBody
{
"kind": "EventList",
"items": [
{
"involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"},
"reason": "Scheduled",
"message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal",
"source": {
"component": "default-scheduler"
},
}
]
}
Get resource
@lbernail @roboll_
Describe resource
kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9
[...]
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events?
fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9,
involvedObject.namespace=datadog,
involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c
[...]
ResponseBody
{
"kind": "EventList",
"items": [
{
"involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"},
"reason": "Scheduled",
"message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal",
"source": {
"component": "default-scheduler"
},
}
]
}
Get resource
Get events associated
with resource
@lbernail @roboll_
A few other examples
$ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76&
resourceVersion=282733788&watch=true
Delete
+
List &
Watch
@lbernail @roboll_
A few other examples
$ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76&
resourceVersion=282733788&watch=true
$ kubectl create deployment test --image=busybox -v=8
Request Body:
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"creationTimestamp":null,"labels":{"app":"test"},"name":"test"},"spec":{"replicas":1,"s
elector":{"matchLabels":{"app":"test"}},"strategy":{},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"test"}},"spec":{"container
s":[{"image":"busybox","name":"busybox","resources":{}}]}}},"status":{}}
POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
Minimal
deployment
spec
POST call
@lbernail @roboll_
A few other examples
$ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8
[...]
DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76&
resourceVersion=282733788&watch=true
$ kubectl create deployment test --image=busybox -v=8
Request Body:
{"apiVersion":"apps/v1","kind":"Deployment","metadata":{"creationTimestamp":null,"labels":{"app":"test"},"name":"test"},"spec":{"replicas":1,"s
elector":{"matchLabels":{"app":"test"}},"strategy":{},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"test"}},"spec":{"container
s":[{"image":"busybox","name":"busybox","resources":{}}]}}},"status":{}}
POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
$ kubectl scale deploy test --replicas=2 -v=8
GET https://kubernetes.fury.us1.staging.dog/apis/extensions/v1beta1/namespaces/datadog/deployments/test
Request Body: {"spec":{"replicas":2}}
PATCH https://kubernetes.fury.us1.staging.dog/apis/extensions/v1beta1/namespaces/datadog/deployments/test/scale
GET current
PATCH body
+call
@lbernail @roboll_
Takeaways
● A lot of components are making calls
○ Control plane: controllers, scheduler
○ Node daemons: kubelet, kube-proxy
○ Other controllers: autoscaler, ingress
● “Simple” user ops translate to many API calls
How can we understand what is going on?
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
Audit Logs
@lbernail @roboll_
What are Audit Logs?
● Rich Structured json logs output by the apiserver
● Configurable Verbosity for each resource
● Logging can happen at different processing stages
apiserverclient
1
3 2: Request processing
1: Apiserver receives request, Stage: RequestReceived
2: Apiserver processes request
3: Apiserver answers, Stage: ResponseComplete/ResponseStarted
@lbernail @roboll_
Content of Audit Logs
● What happened?
● Who initiated it?
● Why was it authorized?
● When did it happen?
● From where?
● Depending on verbosity, Request/Response
@lbernail @roboll_
GET example from earlier
Structured JSON log
A lot of information
$ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
@lbernail @roboll_
What happened?
$ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9
@lbernail @roboll_
Who initiated it?
User was laurent.bernaille@datadoghq.com and mapped to groups
- datadoghq.com
- system:authenticated
$ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
@lbernail @roboll_
Why was it authorized?
It was authorized because group datadoghq.com is bound to role
datadoghq:cluster-user by ClusterRoleBinding datadoghq:cluster-admin-binding
(and this role has the required permissions)
$ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
@lbernail @roboll_
When, and from where?
Request received at 20:33:26.757
Response completed at 20:33:26:771
Duration: 14ms
From IP: 10.X.Y.74
$ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
@lbernail @roboll_
Another GET call
$ kubectl get pods -v=8
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500
GET is mapped to different verbs (get/list)
@lbernail @roboll_
Watches
$ kubectl get pods -v=8 -w
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500
GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?resourceVersion=288656279&watch=true
Call 1 : list
stage: ResponseComplete
Call 2 : watch
get + watch parameters
136ms later
stage: ResponseStarted
@lbernail @roboll_
Create call
$ kubectl create deployment test --image=busybox -v=8
POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
@lbernail @roboll_
Create call
$ kubectl create deployment test --image=busybox -v=8
POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
@lbernail @roboll_
Takeaways
Audit logs contain information on all API calls
● What happened?
● Who initiated it?
● Why was it authorized?
● When did it happen?
● From where?
● Depending on verbosity, Request/Response
OK, how do I get them?
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
Configuring Audit Logs
@lbernail @roboll_
Apiserver configuration
kube-apiserver
[...]
--audit-log-path=/var/log/kubernetes/apiserver/audit.log
--audit-policy-file=/etc/kubernetes/audit-policies/policy.yaml
[...]
Where to store them
What to collect
Minimum configuration
Advanced
● Rotation parameters (max size, backup options)
● Alternative backend (webhook)
● Batching mode
@lbernail @roboll_
apiVersion: audit.k8s.io/v1 kind: Policy
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
omitStages:
- "RequestReceived"
resources:
- group: "" # core API group
resources: ["pods"]
verbs: ["create", "patch"”, "update", "delete"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
omitStages:
- "RequestReceived"
resources:
- group: ""
resources: ["pods/log", "pods/status"]
Set of rules
Audit policy: what to log?
@lbernail @roboll_
apiVersion: audit.k8s.io/v1 kind: Policy
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
omitStages:
- "RequestReceived"
resources:
- group: "" # core API group
resources: ["pods"]
verbs: ["create", "patch"”, "update", "delete"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
omitStages:
- "RequestReceived"
resources:
- group: ""
resources: ["pods/log", "pods/status"]
Rules match api call
● api group / version
● resource
● verbs
> Similar to RBAC
Audit policy: what to log?
@lbernail @roboll_
apiVersion: audit.k8s.io/v1 kind: Policy
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
omitStages:
- "RequestReceived"
resources:
- group: "" # core API group
resources: ["pods"]
verbs: ["create", "patch"”, "update", "delete"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
omitStages:
- "RequestReceived"
resources:
- group: ""
resources: ["pods/log", "pods/status"]
For matching API calls
● Which verbosity?
● When? (stage)
Audit policy: when to log?
@lbernail @roboll_
apiVersion: audit.k8s.io/v1 kind: Policy
rules:
# Log pod changes at RequestResponse level
- level: RequestResponse
omitStages:
- "RequestReceived"
resources:
- group: "" # core API group
resources: ["pods"]
verbs: ["create", "patch"”, "update", "delete"]
# Log "pods/log", "pods/status" at Metadata level
- level: Metadata
omitStages:
- "RequestReceived"
resources:
- group: ""
resources: ["pods/log", "pods/status"]
Gotchas
Rules are evaluated in order
First matching rule sets level
Request/RequestResponse
> contain payload data
Careful with security implications
ex: tokenreviews calls
group: “” means core API only
Don’t forget to add
● 3rd party apiservices
● your apiservices
@lbernail @roboll_
Recommendations
● Ignore RequestReceived stage
● Use at least Metadata level for almost everything
○ Possibly ignore healthz, metrics
● Use Request/Response level for critical resource/verbs
○ Very valuable for retroactive debugging
○ Careful for large/sensitive request/response bodies
● Very complete example in GKE
https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh
● Documentation
https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#audit-policy
@lbernail @roboll_
Takeaways
● Getting audit logs is “simple”: 2 flags
● Getting policies right is harder
● You will get a lot of logs
● Requires a solution to analyze them
Let’s look at an overview on a real large cluster
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
10000 foot view for a
large cluster
@lbernail @roboll_
Total number of API calls
~900 calls/second on this 2500 nodes cluster
Number of audit logs per hour
@lbernail @roboll_
Top API users?
apiserver: 40 rps kube-proxy: 20 rps
local-volume-provisioner:20 rps
cronjob-controller: 15 rps
spinnaker: 5-10 rps
@lbernail @roboll_
Top list, missing “small” users
total: ~900rps
Top 25: 150 rps
900 calls/second on this 2500 nodes cluster
What is doing ~80% of API calls?
Total number of API calls
@lbernail @roboll_
Grouping by users is not helping
Calls from users in group “system:nodes”: 750 rps (~80% of API calls)
In this 2500-nodes cluster, this means 0.3 rps per node!
Calls by user group
group=system:nodes
@lbernail @roboll_
Why is “system:nodes” so high?
nodes: 500 rps configmaps: 75 rps secrets: 60 rps
Calls from group “system:nodes” by resource targeted
@lbernail @roboll_
Verbs on “node” for a kubelet
10s
Each node update its status every 10s
@lbernail @roboll_
Verbs on “configmaps”
Only GETs
Regularity but not clear pattern
@lbernail @roboll_
Group by resource name
Each configmap is refreshed every ~5mn => GET call
Similar for secrets
~5mn
@lbernail @roboll_
List latency by resource
Latency for list by resource
@lbernail @roboll_
List latency by resource
Latency for list by resource
apiserver restarts
@lbernail @roboll_
Compare cluster performances
Latency for get pod by cluster (ms)
large clusters (1500+ nodes)
smaller clusters (<700 nodes)
@lbernail @roboll_
Takeaways
● Biggest users are the ones running on each node
○ kubelet, daemonsets (kube-proxy)
○ A lot of effort upstream to reduce their load
○ Be extra careful of damonsets doing API calls
● Audit logs structure allow to filter and slice & dice
● Audit logs are verbose (1000 logs/s in our example)
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
Understanding
Kubernetes Internals
@lbernail @roboll_
Creating a simple deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx
labels:
app: nginx
spec:
replicas: 2
selector:
matchLabels:
app: nginx
template:
metadata:
labels:
app: nginx
spec:
containers:
- name: nginx
image: nginx
@lbernail @roboll_
Sequence
User creates
deploy
deployment controller
creates RS
RS controller
creates pods
Scheduler
bind pods
Kubelets
Update
pod status
@lbernail @roboll_
Sequence
Scheduler
bind pods
Scheduler callCreate
Binding
For nginx pod
To node
ip-10-x-y-123
@lbernail @roboll_
Actually a lot more
node 1 node 2 RS controller Deploy ctrl Scheduler apiserver user
Create of Deployment, RS, pods, binding + nodes accept
Nodes create pods and generate events
Pods are running
Deployments/RS update status
Additions
Apiserver verifies Quotas
Components also get/list
Creation of events + Update of status fields
Complete node workflow
~ 3s
@lbernail @roboll_
Node API calls
Initial calls
Get pod
Update pod/status: ContainerCreating
Get service account token
@lbernail @roboll_
Node API calls
Create events to show progression
MountVolume.SetUp succeeded for volume "default-token-dqmzw"
pulling image "nginx"
Successfully pulled image "nginx"
Created container
Started container
@lbernail @roboll_
Node API calls
Finalize pod creation
Get pod
Update container status to “Running”
@lbernail @roboll_
Takeaways
● A lot of interactions between kube components
● Audit logs give a great understanding of this!
● Events are spiky and get generate a lot of logs
● Events have a 1h default TTL, but stay in audit logs
Let’s identify some problems using audit logs
@lbernail @roboll_
Outline
1. Background: The Kubernetes API
2. Audit Logs
3. Configuring Audit Logs
4. 10000 foot view for a large cluster
5. Understanding Kubernetes Internals
6. Troubleshooting examples
Troubleshooting
examples
@lbernail @roboll_
Troubleshooting
● Understand what happened
○ “Why was a resource deleted?”
● Debug performance regressions/improve performances
○ “Which application is responsible for so many calls?”
● Also, identify issues by looking at HTTP status codes
@lbernail @roboll_
Status codes
Calls by status code
200 201 401
@lbernail @roboll_
4xx only
4xx by status code
401 403 404 422 409
@lbernail @roboll_
4xx only
4xx by status code
401?? (Unauthorized)
@lbernail @roboll_
Analyzing 401s
401s by source IP
4 nodes only, turns out they had expired certificates
@lbernail @roboll_
4xx only
4xx by status code
403?? (Forbidden)
@lbernail @roboll_
Analyzing 403s
403s by user
Most errors from Traefik serviceAccount
Bad RBAC configuration?
@lbernail @roboll_
Analyzing 403s for this user
403s for Traefik serviceAccount by resource
We use Traefik without Kubernetes secrets but it still tries to list them
@lbernail @roboll_
4xx only
4xx by status code
422?? (Invalid)
@lbernail @roboll_
What about 422?
422 by user
kube-system:generic-garbage-collector => ??
@lbernail @roboll_
What is failing?
generic-garbage-collector by verb / resource
patch on pod
@lbernail @roboll_
What is happening?
● Pods are in “Evicted” status and must be kept
● Controlling RS has been deleted and pods should be orphaned
● Garbage collector fails to orphan them (remove ownerRef)
● ReplicaSet has been Terminating for 2 months...
● Root cause: mutating webhook
○ We modify the pod spec at creation
○ Mutating webhook is registered on pods for CREATE/UPDATE
○ We modify immutable fields
○ Garbage collector patch triggers this...
@lbernail @roboll_
Takeaways
● Looking at calls triggering HTTP errors help find issues
○ Misconfigured RBAC (403)
○ Applications doing calls they shouldn’t (403)
○ Expired certificates (401)
○ Other weird things
Conclusion
@lbernail @roboll_
Conclusion
● Audit logs can be incredibly valuable
○ Low-level understanding of Kubernetes
○ Detection of misconfigurations
○ Troubleshooting of issues
○ Identify performance issues
● Taking advantage of them require some effort
○ Policies are not easy to get right
○ They are verbose and require a tool to analyze them
Thank you
We’re hiring!
Visit our Kubecon booth
or https://www.datadoghq.com/careers/
Or contact us directly:
@lbernail laurent@datadoghq.com
@roboll_ roboll@datadoghq.com

Contenu connexe

Tendances

(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdfssuserf8b8bd1
 
Building APIs with Amazon API Gateway
Building APIs with Amazon API GatewayBuilding APIs with Amazon API Gateway
Building APIs with Amazon API GatewayAmazon Web Services
 
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...David J Rosenthal
 
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech Talks
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech TalksDeep Dive - Amazon Kinesis Video Streams - AWS Online Tech Talks
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech TalksAmazon Web Services
 
Spring cloud on kubernetes
Spring cloud on kubernetesSpring cloud on kubernetes
Spring cloud on kubernetesSangSun Park
 
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안Jeongsang Baek
 
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해Terry Cho
 
Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리정명훈 Jerry Jeong
 
OpenAPI 3.0, And What It Means for the Future of Swagger
OpenAPI 3.0, And What It Means for the Future of SwaggerOpenAPI 3.0, And What It Means for the Future of Swagger
OpenAPI 3.0, And What It Means for the Future of SwaggerSmartBear
 
Operator SDK for K8s using Go
Operator SDK for K8s using GoOperator SDK for K8s using Go
Operator SDK for K8s using GoCloudOps2005
 
Introduction to Kubernetes RBAC
Introduction to Kubernetes RBACIntroduction to Kubernetes RBAC
Introduction to Kubernetes RBACKublr
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Chris Aniszczyk
 
Api gateway in microservices
Api gateway in microservicesApi gateway in microservices
Api gateway in microservicesKunal Hire
 
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.sung ki choi
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMartin Etmajer
 
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...AWSKRUG - AWS한국사용자모임
 

Tendances (20)

(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
(발표자료) CentOS EOL에 따른 대응 OS 검토 및 적용 방안.pdf
 
Building APIs with Amazon API Gateway
Building APIs with Amazon API GatewayBuilding APIs with Amazon API Gateway
Building APIs with Amazon API Gateway
 
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...
Cloud Design Patterns - PRESCRIPTIVE ARCHITECTURE GUIDANCE FOR CLOUD APPLICAT...
 
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech Talks
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech TalksDeep Dive - Amazon Kinesis Video Streams - AWS Online Tech Talks
Deep Dive - Amazon Kinesis Video Streams - AWS Online Tech Talks
 
Spring cloud on kubernetes
Spring cloud on kubernetesSpring cloud on kubernetes
Spring cloud on kubernetes
 
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안
소셜게임 서버 개발 관점에서 본 Node.js의 장단점과 대안
 
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해
소프트웨어 개발 트랜드 및 MSA (마이크로 서비스 아키텍쳐)의 이해
 
Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리Google Cloud IAM 계정, 권한 및 조직 관리
Google Cloud IAM 계정, 권한 및 조직 관리
 
OpenAPI 3.0, And What It Means for the Future of Swagger
OpenAPI 3.0, And What It Means for the Future of SwaggerOpenAPI 3.0, And What It Means for the Future of Swagger
OpenAPI 3.0, And What It Means for the Future of Swagger
 
Kubernetes Basics
Kubernetes BasicsKubernetes Basics
Kubernetes Basics
 
Operator SDK for K8s using Go
Operator SDK for K8s using GoOperator SDK for K8s using Go
Operator SDK for K8s using Go
 
Docker Kubernetes Istio
Docker Kubernetes IstioDocker Kubernetes Istio
Docker Kubernetes Istio
 
Introduction to Kubernetes RBAC
Introduction to Kubernetes RBACIntroduction to Kubernetes RBAC
Introduction to Kubernetes RBAC
 
Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)Cloud Native Landscape (CNCF and OCI)
Cloud Native Landscape (CNCF and OCI)
 
Api gateway in microservices
Api gateway in microservicesApi gateway in microservices
Api gateway in microservices
 
Amazon ECS
Amazon ECSAmazon ECS
Amazon ECS
 
Ansible
AnsibleAnsible
Ansible
 
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.
[111015/아꿈사] HTML5를 여행하는 비(非) 웹 개발자를 위한 안내서 - 1부 웹소켓.
 
Monitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on KubernetesMonitoring, Logging and Tracing on Kubernetes
Monitoring, Logging and Tracing on Kubernetes
 
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...
EKS에서 Opentelemetry로 코드실행 모니터링하기 - 신재현 (인덴트코퍼레이션) :: AWS Community Day Online...
 

Similaire à Making the most out of kubernetes audit logs

Building Your Own IoT Platform using FIWARE GEis
Building Your Own IoT Platform using FIWARE GEisBuilding Your Own IoT Platform using FIWARE GEis
Building Your Own IoT Platform using FIWARE GEisFIWARE
 
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...Amazon Web Services
 
FIWARE IoT Proposal & Community
FIWARE IoT Proposal & CommunityFIWARE IoT Proposal & Community
FIWARE IoT Proposal & CommunityFIWARE
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogicRakuten Group, Inc.
 
New and smart way to develop microservice for istio with micro profile
New and smart way to develop microservice for istio with micro profileNew and smart way to develop microservice for istio with micro profile
New and smart way to develop microservice for istio with micro profileEmily Jiang
 
Advanced #2 networking
Advanced #2   networkingAdvanced #2   networking
Advanced #2 networkingVitali Pekelis
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays
 
Super-NetOps Source of Truth
Super-NetOps Source of TruthSuper-NetOps Source of Truth
Super-NetOps Source of TruthJoel W. King
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppet
 
Super-NetOps Source of Truth
Super-NetOps Source of TruthSuper-NetOps Source of Truth
Super-NetOps Source of TruthJoel W. King
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKel Cecil
 
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016Cloud Native Day Tel Aviv
 
FrenchKit 2017: Server(less) Swift
FrenchKit 2017: Server(less) SwiftFrenchKit 2017: Server(less) Swift
FrenchKit 2017: Server(less) SwiftChris Bailey
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGIMike Pittaro
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases WSO2
 
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAutomatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAmazon Web Services
 
JDD2015: Kubernetes - Beyond the basics - Paul Bakker
JDD2015: Kubernetes - Beyond the basics - Paul BakkerJDD2015: Kubernetes - Beyond the basics - Paul Bakker
JDD2015: Kubernetes - Beyond the basics - Paul BakkerPROIDEA
 

Similaire à Making the most out of kubernetes audit logs (20)

Building Your Own IoT Platform using FIWARE GEis
Building Your Own IoT Platform using FIWARE GEisBuilding Your Own IoT Platform using FIWARE GEis
Building Your Own IoT Platform using FIWARE GEis
 
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...
Automatically scaling your Kubernetes workloads - SVC210-S - Santa Clara AWS ...
 
FIWARE IoT Proposal & Community
FIWARE IoT Proposal & CommunityFIWARE IoT Proposal & Community
FIWARE IoT Proposal & Community
 
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
[Rakuten TechConf2014] [C-5] Ichiba Architecture on ExaLogic
 
New and smart way to develop microservice for istio with micro profile
New and smart way to develop microservice for istio with micro profileNew and smart way to develop microservice for istio with micro profile
New and smart way to develop microservice for istio with micro profile
 
Advanced #2 networking
Advanced #2   networkingAdvanced #2   networking
Advanced #2 networking
 
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
apidays Paris 2022 - France Televisions : How we leverage API Platform for ou...
 
Super-NetOps Source of Truth
Super-NetOps Source of TruthSuper-NetOps Source of Truth
Super-NetOps Source of Truth
 
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NYPuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
PuppetDB: A Single Source for Storing Your Puppet Data - PUG NY
 
Super-NetOps Source of Truth
Super-NetOps Source of TruthSuper-NetOps Source of Truth
Super-NetOps Source of Truth
 
Mashing Up The Guardian
Mashing Up The GuardianMashing Up The Guardian
Mashing Up The Guardian
 
Kubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of ContainersKubernetes - Sailing a Sea of Containers
Kubernetes - Sailing a Sea of Containers
 
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
CI Provisioning with OpenStack - Gidi Samuels - OpenStack Day Israel 2016
 
FrenchKit 2017: Server(less) Swift
FrenchKit 2017: Server(less) SwiftFrenchKit 2017: Server(less) Swift
FrenchKit 2017: Server(less) Swift
 
OpenStack API's and WSGI
OpenStack API's and WSGIOpenStack API's and WSGI
OpenStack API's and WSGI
 
NSA for Enterprises Log Analysis Use Cases
NSA for Enterprises   Log Analysis Use Cases NSA for Enterprises   Log Analysis Use Cases
NSA for Enterprises Log Analysis Use Cases
 
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS SummitAutomatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
Automatically Scaling Your Kubernetes Workloads - SVC209-S - Anaheim AWS Summit
 
Mashing Up The Guardian
Mashing Up The GuardianMashing Up The Guardian
Mashing Up The Guardian
 
JDD2015: Kubernetes - Beyond the basics - Paul Bakker
JDD2015: Kubernetes - Beyond the basics - Paul BakkerJDD2015: Kubernetes - Beyond the basics - Paul Bakker
JDD2015: Kubernetes - Beyond the basics - Paul Bakker
 
What's Rio 〜Standalone〜
What's Rio 〜Standalone〜What's Rio 〜Standalone〜
What's Rio 〜Standalone〜
 

Plus de Laurent Bernaille

How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceLaurent Bernaille
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Laurent Bernaille
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Laurent Bernaille
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...Laurent Bernaille
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!Laurent Bernaille
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networkingLaurent Bernaille
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayLaurent Bernaille
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksLaurent Bernaille
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesLaurent Bernaille
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksLaurent Bernaille
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Laurent Bernaille
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applicationsLaurent Bernaille
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Laurent Bernaille
 

Plus de Laurent Bernaille (17)

How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My Namespace
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)Evolution of kube-proxy (Brussels, Fosdem 2020)
Evolution of kube-proxy (Brussels, Fosdem 2020)
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networking
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay Networks
 
Discovering OpenBSD on AWS
Discovering OpenBSD on AWSDiscovering OpenBSD on AWS
Discovering OpenBSD on AWS
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architectures
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay Networks
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applications
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006
 

Dernier

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...panagenda
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demoHarshalMandlekar2
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfNeo4j
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch TuesdayIvanti
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...AliaaTarek5
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Scott Andery
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Dernier (20)

TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
Why device, WIFI, and ISP insights are crucial to supporting remote Microsoft...
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Sample pptx for embedding into website for demo
Sample pptx for embedding into website for demoSample pptx for embedding into website for demo
Sample pptx for embedding into website for demo
 
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyesAssure Ecommerce and Retail Operations Uptime with ThousandEyes
Assure Ecommerce and Retail Operations Uptime with ThousandEyes
 
Connecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdfConnecting the Dots for Information Discovery.pdf
Connecting the Dots for Information Discovery.pdf
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
2024 April Patch Tuesday
2024 April Patch Tuesday2024 April Patch Tuesday
2024 April Patch Tuesday
 
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
(How to Program) Paul Deitel, Harvey Deitel-Java How to Program, Early Object...
 
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
Enhancing User Experience - Exploring the Latest Features of Tallyman Axis Lo...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Making the most out of kubernetes audit logs

  • 1.
  • 2. Robert Boll @roboll_ Laurent Bernaille @lbernail Making the Most Out of Kubernetes Audit Logs
  • 3. @lbernail @roboll_ Datadog Monitoring service Over 350 integrations Over 1,200 employees Over 8,000 customers Runs on millions of hosts Trillions of data points per day 10000s hosts in our infra 10s of Kubernetes clusters Clusters from 50 to 3000 nodes Multi-cloud Very fast growth
  • 5. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 6. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 11. @lbernail @roboll_ apiserver etcd kubelet controllers scheduler kube-proxy DaemonSet: kube-proxy
  • 12. @lbernail @roboll_ apiserver etcd kubelet controllers scheduler kube-proxy other node apps Other DaemonSets (cni, etc.)
  • 13. @lbernail @roboll_ apiserver etcd kubelet controllers scheduler kube-proxy coredns other node apps Cluster services: DNS
  • 14. @lbernail @roboll_ apiserver etcd kubelet controllers scheduler kube-proxy cluster- autoscaler ingress controllers coredns other node apps Other cluster services
  • 15. @lbernail @roboll_ apiserver etcd kubelet controllers scheduler kube-proxy cluster- autoscaler ingress controllers other apps coredns other node apps Probably several other applications
  • 16. @lbernail @roboll_ And users, of course apiserver etcd kubelet controllers scheduler kubectl kube-proxy cluster- autoscaler ingress controllers other apps coredns other node apps
  • 17. @lbernail @roboll_ And, surprise, the apiserver itself apiserver etcd kubelet controllers scheduler kubectl kube-proxy cluster- autoscaler ingress controllers other apps coredns other node apps
  • 18. @lbernail @roboll_ What happens when you kubectl? $ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76
  • 19. @lbernail @roboll_ Let’s look at details $ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 apiserver api version namespace resource type resource name
  • 20. @lbernail @roboll_ A few more GET examples $ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 $ kubectl get pods [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 List (paginated)
  • 21. @lbernail @roboll_ A few more GET examples $ kubectl get pod echodeploy-77cf5c6f6-brj76 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 $ kubectl get pods [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 $ kubectl get pods --watch=true -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?resourceVersion=282725545&watch=true List & Watch
  • 22. @lbernail @roboll_ Describe resource kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events? fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9, involvedObject.namespace=datadog, involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c [...] ResponseBody { "kind": "EventList", "items": [ { "involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"}, "reason": "Scheduled", "message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal", "source": { "component": "default-scheduler" }, } ] }
  • 23. @lbernail @roboll_ Describe resource kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events? fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9, involvedObject.namespace=datadog, involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c [...] ResponseBody { "kind": "EventList", "items": [ { "involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"}, "reason": "Scheduled", "message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal", "source": { "component": "default-scheduler" }, } ] } Get resource
  • 24. @lbernail @roboll_ Describe resource kubectl describe pod echodeploy-77cf5c6f6-5wmw9 -v=8 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9 [...] GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/events? fieldSelector=involvedObject.name=echodeploy-77cf5c6f6-5wmw9, involvedObject.namespace=datadog, involvedObject.uid=770b3a5e-0631-11ea-bc60-12d7306f3c0c [...] ResponseBody { "kind": "EventList", "items": [ { "involvedObject": { "kind": "Pod", "namespace": "datadog", "name": "echodeploy-77cf5c6f6-5wmw9"}, "reason": "Scheduled", "message": "Successfully assigned echodeploy-77cf5c6f6-5wmw9 to ip-10-128-205-156.ec2.internal", "source": { "component": "default-scheduler" }, } ] } Get resource Get events associated with resource
  • 25. @lbernail @roboll_ A few other examples $ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8 [...] DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76& resourceVersion=282733788&watch=true Delete + List & Watch
  • 26. @lbernail @roboll_ A few other examples $ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8 [...] DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76& resourceVersion=282733788&watch=true $ kubectl create deployment test --image=busybox -v=8 Request Body: {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"creationTimestamp":null,"labels":{"app":"test"},"name":"test"},"spec":{"replicas":1,"s elector":{"matchLabels":{"app":"test"}},"strategy":{},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"test"}},"spec":{"container s":[{"image":"busybox","name":"busybox","resources":{}}]}}},"status":{}} POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments Minimal deployment spec POST call
  • 27. @lbernail @roboll_ A few other examples $ kubectl delete pod echodeploy-77cf5c6f6-brj76 -v=8 [...] DELETE https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?fieldSelector=metadata.name%3Dechodeploy-77cf5c6f6-brj76& resourceVersion=282733788&watch=true $ kubectl create deployment test --image=busybox -v=8 Request Body: {"apiVersion":"apps/v1","kind":"Deployment","metadata":{"creationTimestamp":null,"labels":{"app":"test"},"name":"test"},"spec":{"replicas":1,"s elector":{"matchLabels":{"app":"test"}},"strategy":{},"template":{"metadata":{"creationTimestamp":null,"labels":{"app":"test"}},"spec":{"container s":[{"image":"busybox","name":"busybox","resources":{}}]}}},"status":{}} POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments $ kubectl scale deploy test --replicas=2 -v=8 GET https://kubernetes.fury.us1.staging.dog/apis/extensions/v1beta1/namespaces/datadog/deployments/test Request Body: {"spec":{"replicas":2}} PATCH https://kubernetes.fury.us1.staging.dog/apis/extensions/v1beta1/namespaces/datadog/deployments/test/scale GET current PATCH body +call
  • 28. @lbernail @roboll_ Takeaways ● A lot of components are making calls ○ Control plane: controllers, scheduler ○ Node daemons: kubelet, kube-proxy ○ Other controllers: autoscaler, ingress ● “Simple” user ops translate to many API calls How can we understand what is going on?
  • 29. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 31. @lbernail @roboll_ What are Audit Logs? ● Rich Structured json logs output by the apiserver ● Configurable Verbosity for each resource ● Logging can happen at different processing stages apiserverclient 1 3 2: Request processing 1: Apiserver receives request, Stage: RequestReceived 2: Apiserver processes request 3: Apiserver answers, Stage: ResponseComplete/ResponseStarted
  • 32. @lbernail @roboll_ Content of Audit Logs ● What happened? ● Who initiated it? ● Why was it authorized? ● When did it happen? ● From where? ● Depending on verbosity, Request/Response
  • 33. @lbernail @roboll_ GET example from earlier Structured JSON log A lot of information $ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
  • 34. @lbernail @roboll_ What happened? $ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wmw9
  • 35. @lbernail @roboll_ Who initiated it? User was laurent.bernaille@datadoghq.com and mapped to groups - datadoghq.com - system:authenticated $ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
  • 36. @lbernail @roboll_ Why was it authorized? It was authorized because group datadoghq.com is bound to role datadoghq:cluster-user by ClusterRoleBinding datadoghq:cluster-admin-binding (and this role has the required permissions) $ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
  • 37. @lbernail @roboll_ When, and from where? Request received at 20:33:26.757 Response completed at 20:33:26:771 Duration: 14ms From IP: 10.X.Y.74 $ kubectl get pod echodeploy-77cf5c6f6-5wmw9 -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods/echodeploy-77cf5c6f6-5wamw9
  • 38. @lbernail @roboll_ Another GET call $ kubectl get pods -v=8 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 GET is mapped to different verbs (get/list)
  • 39. @lbernail @roboll_ Watches $ kubectl get pods -v=8 -w GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?limit=500 GET https://kubernetes.fury.us1.staging.dog/api/v1/namespaces/datadog/pods?resourceVersion=288656279&watch=true Call 1 : list stage: ResponseComplete Call 2 : watch get + watch parameters 136ms later stage: ResponseStarted
  • 40. @lbernail @roboll_ Create call $ kubectl create deployment test --image=busybox -v=8 POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
  • 41. @lbernail @roboll_ Create call $ kubectl create deployment test --image=busybox -v=8 POST https://kubernetes.fury.us1.staging.dog/apis/apps/v1/namespaces/datadog/deployments
  • 42. @lbernail @roboll_ Takeaways Audit logs contain information on all API calls ● What happened? ● Who initiated it? ● Why was it authorized? ● When did it happen? ● From where? ● Depending on verbosity, Request/Response OK, how do I get them?
  • 43. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 45. @lbernail @roboll_ Apiserver configuration kube-apiserver [...] --audit-log-path=/var/log/kubernetes/apiserver/audit.log --audit-policy-file=/etc/kubernetes/audit-policies/policy.yaml [...] Where to store them What to collect Minimum configuration Advanced ● Rotation parameters (max size, backup options) ● Alternative backend (webhook) ● Batching mode
  • 46. @lbernail @roboll_ apiVersion: audit.k8s.io/v1 kind: Policy rules: # Log pod changes at RequestResponse level - level: RequestResponse omitStages: - "RequestReceived" resources: - group: "" # core API group resources: ["pods"] verbs: ["create", "patch"”, "update", "delete"] # Log "pods/log", "pods/status" at Metadata level - level: Metadata omitStages: - "RequestReceived" resources: - group: "" resources: ["pods/log", "pods/status"] Set of rules Audit policy: what to log?
  • 47. @lbernail @roboll_ apiVersion: audit.k8s.io/v1 kind: Policy rules: # Log pod changes at RequestResponse level - level: RequestResponse omitStages: - "RequestReceived" resources: - group: "" # core API group resources: ["pods"] verbs: ["create", "patch"”, "update", "delete"] # Log "pods/log", "pods/status" at Metadata level - level: Metadata omitStages: - "RequestReceived" resources: - group: "" resources: ["pods/log", "pods/status"] Rules match api call ● api group / version ● resource ● verbs > Similar to RBAC Audit policy: what to log?
  • 48. @lbernail @roboll_ apiVersion: audit.k8s.io/v1 kind: Policy rules: # Log pod changes at RequestResponse level - level: RequestResponse omitStages: - "RequestReceived" resources: - group: "" # core API group resources: ["pods"] verbs: ["create", "patch"”, "update", "delete"] # Log "pods/log", "pods/status" at Metadata level - level: Metadata omitStages: - "RequestReceived" resources: - group: "" resources: ["pods/log", "pods/status"] For matching API calls ● Which verbosity? ● When? (stage) Audit policy: when to log?
  • 49. @lbernail @roboll_ apiVersion: audit.k8s.io/v1 kind: Policy rules: # Log pod changes at RequestResponse level - level: RequestResponse omitStages: - "RequestReceived" resources: - group: "" # core API group resources: ["pods"] verbs: ["create", "patch"”, "update", "delete"] # Log "pods/log", "pods/status" at Metadata level - level: Metadata omitStages: - "RequestReceived" resources: - group: "" resources: ["pods/log", "pods/status"] Gotchas Rules are evaluated in order First matching rule sets level Request/RequestResponse > contain payload data Careful with security implications ex: tokenreviews calls group: “” means core API only Don’t forget to add ● 3rd party apiservices ● your apiservices
  • 50. @lbernail @roboll_ Recommendations ● Ignore RequestReceived stage ● Use at least Metadata level for almost everything ○ Possibly ignore healthz, metrics ● Use Request/Response level for critical resource/verbs ○ Very valuable for retroactive debugging ○ Careful for large/sensitive request/response bodies ● Very complete example in GKE https://github.com/kubernetes/kubernetes/blob/master/cluster/gce/gci/configure-helper.sh ● Documentation https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#audit-policy
  • 51. @lbernail @roboll_ Takeaways ● Getting audit logs is “simple”: 2 flags ● Getting policies right is harder ● You will get a lot of logs ● Requires a solution to analyze them Let’s look at an overview on a real large cluster
  • 52. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 53. 10000 foot view for a large cluster
  • 54. @lbernail @roboll_ Total number of API calls ~900 calls/second on this 2500 nodes cluster Number of audit logs per hour
  • 55. @lbernail @roboll_ Top API users? apiserver: 40 rps kube-proxy: 20 rps local-volume-provisioner:20 rps cronjob-controller: 15 rps spinnaker: 5-10 rps
  • 56. @lbernail @roboll_ Top list, missing “small” users total: ~900rps Top 25: 150 rps 900 calls/second on this 2500 nodes cluster What is doing ~80% of API calls? Total number of API calls
  • 57. @lbernail @roboll_ Grouping by users is not helping Calls from users in group “system:nodes”: 750 rps (~80% of API calls) In this 2500-nodes cluster, this means 0.3 rps per node! Calls by user group group=system:nodes
  • 58. @lbernail @roboll_ Why is “system:nodes” so high? nodes: 500 rps configmaps: 75 rps secrets: 60 rps Calls from group “system:nodes” by resource targeted
  • 59. @lbernail @roboll_ Verbs on “node” for a kubelet 10s Each node update its status every 10s
  • 60. @lbernail @roboll_ Verbs on “configmaps” Only GETs Regularity but not clear pattern
  • 61. @lbernail @roboll_ Group by resource name Each configmap is refreshed every ~5mn => GET call Similar for secrets ~5mn
  • 62. @lbernail @roboll_ List latency by resource Latency for list by resource
  • 63. @lbernail @roboll_ List latency by resource Latency for list by resource apiserver restarts
  • 64. @lbernail @roboll_ Compare cluster performances Latency for get pod by cluster (ms) large clusters (1500+ nodes) smaller clusters (<700 nodes)
  • 65. @lbernail @roboll_ Takeaways ● Biggest users are the ones running on each node ○ kubelet, daemonsets (kube-proxy) ○ A lot of effort upstream to reduce their load ○ Be extra careful of damonsets doing API calls ● Audit logs structure allow to filter and slice & dice ● Audit logs are verbose (1000 logs/s in our example)
  • 66. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 68. @lbernail @roboll_ Creating a simple deployment apiVersion: apps/v1 kind: Deployment metadata: name: nginx labels: app: nginx spec: replicas: 2 selector: matchLabels: app: nginx template: metadata: labels: app: nginx spec: containers: - name: nginx image: nginx
  • 69. @lbernail @roboll_ Sequence User creates deploy deployment controller creates RS RS controller creates pods Scheduler bind pods Kubelets Update pod status
  • 70. @lbernail @roboll_ Sequence Scheduler bind pods Scheduler callCreate Binding For nginx pod To node ip-10-x-y-123
  • 71. @lbernail @roboll_ Actually a lot more node 1 node 2 RS controller Deploy ctrl Scheduler apiserver user Create of Deployment, RS, pods, binding + nodes accept Nodes create pods and generate events Pods are running Deployments/RS update status Additions Apiserver verifies Quotas Components also get/list Creation of events + Update of status fields Complete node workflow ~ 3s
  • 72. @lbernail @roboll_ Node API calls Initial calls Get pod Update pod/status: ContainerCreating Get service account token
  • 73. @lbernail @roboll_ Node API calls Create events to show progression MountVolume.SetUp succeeded for volume "default-token-dqmzw" pulling image "nginx" Successfully pulled image "nginx" Created container Started container
  • 74. @lbernail @roboll_ Node API calls Finalize pod creation Get pod Update container status to “Running”
  • 75. @lbernail @roboll_ Takeaways ● A lot of interactions between kube components ● Audit logs give a great understanding of this! ● Events are spiky and get generate a lot of logs ● Events have a 1h default TTL, but stay in audit logs Let’s identify some problems using audit logs
  • 76. @lbernail @roboll_ Outline 1. Background: The Kubernetes API 2. Audit Logs 3. Configuring Audit Logs 4. 10000 foot view for a large cluster 5. Understanding Kubernetes Internals 6. Troubleshooting examples
  • 78. @lbernail @roboll_ Troubleshooting ● Understand what happened ○ “Why was a resource deleted?” ● Debug performance regressions/improve performances ○ “Which application is responsible for so many calls?” ● Also, identify issues by looking at HTTP status codes
  • 79. @lbernail @roboll_ Status codes Calls by status code 200 201 401
  • 80. @lbernail @roboll_ 4xx only 4xx by status code 401 403 404 422 409
  • 81. @lbernail @roboll_ 4xx only 4xx by status code 401?? (Unauthorized)
  • 82. @lbernail @roboll_ Analyzing 401s 401s by source IP 4 nodes only, turns out they had expired certificates
  • 83. @lbernail @roboll_ 4xx only 4xx by status code 403?? (Forbidden)
  • 84. @lbernail @roboll_ Analyzing 403s 403s by user Most errors from Traefik serviceAccount Bad RBAC configuration?
  • 85. @lbernail @roboll_ Analyzing 403s for this user 403s for Traefik serviceAccount by resource We use Traefik without Kubernetes secrets but it still tries to list them
  • 86. @lbernail @roboll_ 4xx only 4xx by status code 422?? (Invalid)
  • 87. @lbernail @roboll_ What about 422? 422 by user kube-system:generic-garbage-collector => ??
  • 88. @lbernail @roboll_ What is failing? generic-garbage-collector by verb / resource patch on pod
  • 89. @lbernail @roboll_ What is happening? ● Pods are in “Evicted” status and must be kept ● Controlling RS has been deleted and pods should be orphaned ● Garbage collector fails to orphan them (remove ownerRef) ● ReplicaSet has been Terminating for 2 months... ● Root cause: mutating webhook ○ We modify the pod spec at creation ○ Mutating webhook is registered on pods for CREATE/UPDATE ○ We modify immutable fields ○ Garbage collector patch triggers this...
  • 90. @lbernail @roboll_ Takeaways ● Looking at calls triggering HTTP errors help find issues ○ Misconfigured RBAC (403) ○ Applications doing calls they shouldn’t (403) ○ Expired certificates (401) ○ Other weird things
  • 92. @lbernail @roboll_ Conclusion ● Audit logs can be incredibly valuable ○ Low-level understanding of Kubernetes ○ Detection of misconfigurations ○ Troubleshooting of issues ○ Identify performance issues ● Taking advantage of them require some effort ○ Policies are not easy to get right ○ They are verbose and require a tool to analyze them
  • 93. Thank you We’re hiring! Visit our Kubecon booth or https://www.datadoghq.com/careers/ Or contact us directly: @lbernail laurent@datadoghq.com @roboll_ roboll@datadoghq.com