SlideShare une entreprise Scribd logo
1  sur  55
Télécharger pour lire hors ligne
Evolution of kube-proxy
Laurent Bernaille
@lbernail
Datadog
Monitoring service
Over 350 integrations
Over 1,200 employees
Over 8,000 customers
Runs on millions of hosts
Trillions of data points per day
10000s hosts in our infra
10s of Kubernetes clusters
Clusters from 50 to 3000 nodes
Multi-cloud
Very fast growth
What is kube-proxy?
● Network proxy running on each Kubernetes node
● Implements the Kubernetes service abstraction
○ Map service Cluster IPs (virtual IPs) to pods
○ Enable ingress traffic
Kube-proxy
Deployment and service
apiVersion: apps/v1
kind: Deployment
metadata:
name: echodeploy
labels:
app: echo
spec:
replicas: 3
selector:
matchLabels:
app: echo
template:
metadata:
labels:
app: echo
spec:
containers:
- name: echopod
image: lbernail/echo:0.5
apiVersion: v1
kind: Service
metadata:
name: echo
labels:
app: echo
spec:
type: ClusterIP
selector:
app: echo
ports:
- name: http
protocol: TCP
port: 80
targetPort: 5000
Created resources
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
Service
Selector: app=echo
kubectl get all
NAME AGE
deploy/echodeploy 16s
NAME AGE
rs/echodeploy-75dddcf5f6 16s
NAME READY
po/echodeploy-75dddcf5f6-jtjts 1/1
po/echodeploy-75dddcf5f6-r7nmk 1/1
po/echodeploy-75dddcf5f6-zvqhv 1/1
NAME TYPE CLUSTER-IP
svc/echo ClusterIP 10.200.246.139
The endpoint object
Deployment ReplicaSet Pod 1
label: app=echo
Pod 2
label: app=echo
Pod 3
label: app=echo
kubectl describe endpoints echo
Name: echo
Namespace: datadog
Labels: app=echo
Annotations: <none>
Subsets:
Addresses: 10.150.4.10,10.150.6.16,10.150.7.10
NotReadyAddresses: <none>
Ports:
Name Port Protocol
---- ---- --------
http 5000 TCP
Endpoints
Addresses:
10.150.4.10
10.150.6.16
10.150.7.10
Service
Selector: app=echo
Readiness
readinessProbe:
httpGet:
path: /ready
port: 5000
periodSeconds: 2
successThreshold: 2
failureThreshold: 2
● A pod can be started but no ready to serve requests
○ Initialization
○ Connection to backends
● Kubernetes provides an abstraction for this: Readiness Probes
Accessing the service
kubectl run -it test --image appropriate/curl ash
# while true ; do curl 10.200.246.139 ; sleep 1 ; done
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2
Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2
Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2
Endpoint only consist of pods that are Ready
How does it work?
Pod statuses
API Server
Node
kubelet pod
HC
Status updates
Node
kubelet pod
HC
ETCD
pods
Endpoints
API Server
Node
kubelet pod
HC
Status updates
Controller Manager
Watch
- pods
- services
endpoint
controller
Node
kubelet pod
HC
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
Accessing services
API Server
Node A
Controller Manager
Watch
- pods
- services
endpoint
controller
Sync endpoints:
- list pods matching selector
- add IP to endpoints
ETCD
pods
services
endpoints
kube-proxy proxier
Watch
- services
- endpoints
Accessing services
API Server
Node A
Controller Manager
endpoint
controller
ETCD
pods
services
endpoints
kube-proxy proxier
client Node Bpod 1
Node Cpod 2
Implementations
userspace
proxy-mode=userspace
API Server
Node A
kube-proxy
iptables client
Node B
Node C
pod 1
pod 2
kube-proxy binds 1 port per service
Client traffic redirected to kube-proxy
kube-proxy connects to pods
proxy-mode=userspace
PREROUTING
any / any =>
KUBE-PORTALS-CONTAINER
All traffic is processed by kube chains
proxy-mode=userspace
PREROUTING
any / any =>
KUBE-PORTALS-CONTAINER
All traffic is processed by kube chains
KUBE-PORTALS-CONTAINER
any / VIP:PORT => NodeIP:PortX
Portal chain
One rule per service
One port allocated for each service (and bound by kube-proxy)
● Performance (userland proxying)
● Source IP cannot be kept
● Not recommended anymore
● Since Kubernetes 1.2, default is iptables
Limitations
iptables
proxy-mode=iptables
API Server
Node A
kube-proxy iptables
client
Node B
Node C
pod 1
pod 2
Outgoing traffic
1. Client to Service IP
2. DNAT: Client to Pod1 IP
Reverse path
1. Pod1 IP to Client
2. Reverse NAT: Service IP to client
proxy-mode=iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
All traffic is processed by kube chains
proxy-mode=iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
Global Service chain
Identify service and jump to appropriate service chain
proxy-mode=iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
Service chain (one per service)
Use statistic iptables module (probability of rule being applied)
Rules are evaluated sequentially (hence the 33%, 50%, 100%)
proxy-mode=iptables
PREROUTING / OUTPUT
any / any => KUBE-SERVICES
KUBE-SERVICES
any / VIP:PORT => KUBE-SVC-XXX
KUBE-SVC-XXX
any / any proba 33% => KUBE-SEP-AAA
any / any proba 50% => KUBE-SEP-BBB
any / any => KUBE-SEP-CCC
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
Endpoint Chain
Mark hairpin traffic (client = target) for SNAT
DNAT to the endpoint
Hairpin traffic?
API Server
Node A
kube-proxy iptables
pod 1
Node B
Node C
pod 2
pod 3
Client can also be a destination
After DNAT:
Src IP= Pod1, Dst IP= Pod1
No reverse NAT possible
=> SNAT on host for this traffic
1. Pod1 IP => SVC IP
2. SNAT: HostIP => SVC IP
3. DNAT: HostIP => Pod1 IP
Reverse path
1. Pod1 IP => Host IP
2. Reverse NAT: SVC IP => Pod1IP
Persistency
spec:
type: ClusterIP
sessionAffinity: ClientIP
sessionAffinityConfig:
clientIP:
timeoutSeconds: 600
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
recent : set rsource KUBE-SEP-AAA
Use “recent” module
Add Source IP to set named KUBE-SEP-AAA
Persistency
KUBE-SEP-AAA
endpoint IP / any => KUBE-MARK-MASQ
any / any => DNAT endpoint IP:Port
recent : set rsource KUBE-SEP-AAA
Use “recent” module
Add Source IP to set named KUBE-SEP-AAA
KUBE-SVC-XXX
any / any recent: rcheck KUBE-SEP-AAA => KUBE-SEP-AAA
any / any recent: rcheck KUBE-SEP-BBB => KUBE-SEP-BBB
any / any recent: rcheck KUBE-SEP-CCC => KUBE-SEP-CCC
Load-balancing rules
Use recent module
If Source IP is in set named KUBE-SEP-AAA,
jump to KUBE-SEP-AAA
● iptables was not designed for load-balancing
● hard to debug (very quickly, 10000s of rules)
● performance impact
○ control plane: syncing rules
○ data plane: going through rules
Limitations
“Scale Kubernetes to Support 50,000 Services” Haibin Xie & Quinton Hoole
Kubecon Berlin 2017
“Scale Kubernetes to Support 50,000 Services” Haibin Xie & Quinton Hoole
Kubecon Berlin 2017
ipvs
● L4 load-balancer build in the Linux Kernel
● Many load-balancing algorithms
● Native persistency
● Fast atomic update (add/remove endpoint)
● Still relies on iptables for some use cases (SNAT)
● GA since 1.11
proxy-mode=ipvs
proxy-mode=ipvs
Pod X
Pod Y
Pod Z
Service ClusterIP:Port S
Backed by pod:port X, Y, Z
Virtual Server S
Realservers
- X
- Y
- Z
kube-proxy apiserver
Watches endpoints
Updates IPVS
New connection
Pod X
Pod Y
Pod Z
Virtual Server S
Realservers
- X
- Y
- Z
kube-proxy apiserver
IPVS conntrack
App A => S >> A => X
App establishes connection to S
IPVS associates Realserver X
Pod X deleted
Pod X
Pod Y
Pod Z
Virtual Server S
Realservers
- Y
- Z
kube-proxy apiserver
IPVS conntrack
App A => S >> A => X
Apiserver removes X from S endpoints
Kube-proxy removes X from realservers
Kernel drops traffic (no realserver)
Clean-up
Pod Y
Pod Z
Virtual Server S
Realservers
- Y
- Z
kube-proxy apiserver
IPVS conntrack
App A => S >> A => X
net/ipv4/vs/expire_nodest_conn
Delete conntrack entry on next packet
Forcefully terminate (RST)
RST
Limit?
● No graceful termination
● As soon as a pod is Terminating connections are destroyed
● Addressing this took time
Graceful termination (1.12)
Pod X
Pod Y
Pod Z
Virtual Server S
Realservers
- X Weight:0
- Y
- Z
kube-proxy apiserver
IPVS conntrack
App A => S >> A => X
Apiserver removes X from S endpoints
Kube-proxy sets Weight to 0
No new connection
Garbage collection
Pod X
Pod Y
Pod Z
Virtual Server S
Realservers
- Y
- Z
kube-proxy apiserver
IPVS conntrack
App A => S >> A => X
Pod exits and sends FIN
Kube-proxy removes realserver when it
has no active connection
FIN
What if X doesn’t send FIN?
Pod Y
Pod Z
Virtual Server S
Realservers
- X Weight:0
- Y
- Z
kube-proxy
IPVS conntrack
App A => S >> A => X
Conntrack entries expires after 900s
If A sends traffic, it never expires
Traffic blackholed until App detects issue
~15mn
> Mitigation: lower tcp_retries2
● Default timeouts
○ tcp / tcpfin : 900s / 120s
○ udp: 300s
● Under high load, ports can be reused fast and keep entry active
○ Especially with conn_reuse_mode=0
● Very bad for DNS => No graceful termination for UDP
IPVS connection tracking
● Very recent feature: allow for configurable timeouts
● Ideally we’d want:
○ Terminating pod: set weight to 0
○ Deleted pod: remove backend
> This would require an API change for Endpoints
IPVS connection tracking
● Works pretty well at large scale
● Not 100% feature parity (a few edge cases)
● Tuning the IPVS parameters took time
● Improvements under discussion
IPVS status
Common challenges
Control plane scalability
● On each update to a service (new pod, readiness change)
○ The full endpoint object is recomputed
○ The endpoint is sent to all kube-proxy
● Example of a single endpoint change
○ Service with 2000 endpoints, ~100B / endpoint, 5000 nodes
○ Each node gets 2000 x 100B = 200kB
○ Apiservers send 5000 x 200kB = 1GB
● Rolling update?
○ Total traffic => 2000 x 1GB = 2TB
=> Endpoint Slices
● Service potentially mapped to multiple EndpointSlices
● Maximum 100 endpoints in each slice
● Much more efficient for services with many endpoints
● Beta in 1.17
Conntrack size
● kube-proxy relies on the conntrack
○ Not great for services receiving a lot of connections
○ Pretty bad for the cluster DNS
● DNS strategy
○ Use node-local cache and force TCP for upstream
○ Much more manageable
Conntrack and UDP
● Very good surprise with Kernel 5.0+
Recent features
New features
● Dual-stack support
○ Alpha in 1.16
● Topology aware routing
○ Use topology keys to route to services
○ Alpha in 1.17
Conclusion
Conclusion
● kube-proxy works at significant scale
● A lot of ongoing efforts to improve kube-proxy
● iptables and IPVS are not perfect matches for the service abstraction
○ Not designed for client-side load-balancing
● Very promising eBPF alternative: Cilium
○ No need for iptables / IPVS
○ Works without conntrack
○ See Daniel Borkmann’s talk
Thank you

Contenu connexe

Tendances

Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceAltinity Ltd
 
Linux Native, HTTP Aware Network Security
Linux Native, HTTP Aware Network SecurityLinux Native, HTTP Aware Network Security
Linux Native, HTTP Aware Network SecurityThomas Graf
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelDivye Kapoor
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveMichal Rostecki
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsBrendan Gregg
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservicesThomas Graf
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In DeepMydbops
 
PostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLPostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLCockroachDB
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPThomas Graf
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersRaphaël PINSON
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/CoreShay Cohen
 
Using Wildcards with rsyslog's File Monitor imfile
Using Wildcards with rsyslog's File Monitor imfileUsing Wildcards with rsyslog's File Monitor imfile
Using Wildcards with rsyslog's File Monitor imfileRainer Gerhards
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceBrendan Gregg
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅NAVER D2
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File SystemAdrian Huang
 
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, Intel
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, IntelXPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, Intel
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, IntelThe Linux Foundation
 
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육Ji-Woong Choi
 
Effective Linux Development Using PetaLinux Tools 2017.4
Effective Linux Development Using PetaLinux Tools 2017.4Effective Linux Development Using PetaLinux Tools 2017.4
Effective Linux Development Using PetaLinux Tools 2017.4Zach Pfeffer
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developerRichárd Kovács
 

Tendances (20)

Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster PerformanceWebinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
Webinar: Strength in Numbers: Introduction to ClickHouse Cluster Performance
 
Linux Native, HTTP Aware Network Security
Linux Native, HTTP Aware Network SecurityLinux Native, HTTP Aware Network Security
Linux Native, HTTP Aware Network Security
 
The TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux KernelThe TCP/IP Stack in the Linux Kernel
The TCP/IP Stack in the Linux Kernel
 
Kubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep DiveKubernetes Networking with Cilium - Deep Dive
Kubernetes Networking with Cilium - Deep Dive
 
Linux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old SecretsLinux Performance Analysis: New Tools and Old Secrets
Linux Performance Analysis: New Tools and Old Secrets
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservices
 
eBPF - Observability In Deep
eBPF - Observability In DeepeBPF - Observability In Deep
eBPF - Observability In Deep
 
PostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQLPostgreSQL and CockroachDB SQL
PostgreSQL and CockroachDB SQL
 
Cilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDPCilium - Container Networking with BPF & XDP
Cilium - Container Networking with BPF & XDP
 
Ceph
CephCeph
Ceph
 
Cfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF SuperpowersCfgmgmtcamp 2023 — eBPF Superpowers
Cfgmgmtcamp 2023 — eBPF Superpowers
 
Linux Internals - Kernel/Core
Linux Internals - Kernel/CoreLinux Internals - Kernel/Core
Linux Internals - Kernel/Core
 
Using Wildcards with rsyslog's File Monitor imfile
Using Wildcards with rsyslog's File Monitor imfileUsing Wildcards with rsyslog's File Monitor imfile
Using Wildcards with rsyslog's File Monitor imfile
 
LISA2019 Linux Systems Performance
LISA2019 Linux Systems PerformanceLISA2019 Linux Systems Performance
LISA2019 Linux Systems Performance
 
[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅[232] 성능어디까지쥐어짜봤니 송태웅
[232] 성능어디까지쥐어짜봤니 송태웅
 
Linux Kernel - Virtual File System
Linux Kernel - Virtual File SystemLinux Kernel - Virtual File System
Linux Kernel - Virtual File System
 
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, Intel
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, IntelXPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, Intel
XPDDS18: EPT-Based Sub-page Write Protection On Xenc - Yi Zhang, Intel
 
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육
[오픈소스컨설팅] Ansible을 활용한 운영 자동화 교육
 
Effective Linux Development Using PetaLinux Tools 2017.4
Effective Linux Development Using PetaLinux Tools 2017.4Effective Linux Development Using PetaLinux Tools 2017.4
Effective Linux Development Using PetaLinux Tools 2017.4
 
eBPF in the view of a storage developer
eBPF in the view of a storage developereBPF in the view of a storage developer
eBPF in the view of a storage developer
 

Similaire à Evolution of kube-proxy (Brussels, Fosdem 2020)

Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discoveryDocker, Inc.
 
Scaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxScaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxthaond2
 
What's new in Kubernetes
What's new in KubernetesWhat's new in Kubernetes
What's new in KubernetesDaniel Smith
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to KubernetesPaul Czarkowski
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDocker, Inc.
 
Microservice bus tutorial
Microservice bus tutorialMicroservice bus tutorial
Microservice bus tutorialHuabing Zhao
 
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021WDDay
 
Nynog-K8s-networking-101.pptx
Nynog-K8s-networking-101.pptxNynog-K8s-networking-101.pptx
Nynog-K8s-networking-101.pptxDanielHertzberg4
 
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...KubeAcademy
 
Kubernetes workshop -_the_basics
Kubernetes workshop -_the_basicsKubernetes workshop -_the_basics
Kubernetes workshop -_the_basicsSjuul Janssen
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes NetworkingCJ Cullen
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Laurent Bernaille
 
Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov
Networking @Scale'19 - Getting a Taste of Your Network - Sergey FedorovNetworking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov
Networking @Scale'19 - Getting a Taste of Your Network - Sergey FedorovSergey Fedorov
 
Kubernetes on AWS
Kubernetes on AWSKubernetes on AWS
Kubernetes on AWSGrant Ellis
 
Kubernetes on AWS
Kubernetes on AWSKubernetes on AWS
Kubernetes on AWSGrant Ellis
 
Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)DongHyeon Kim
 
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADS
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADSKNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADS
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADSElad Hirsch
 
KubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
KubeCon EU 2016: Using Traffic Control to Test Apps in KubernetesKubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
KubeCon EU 2016: Using Traffic Control to Test Apps in KubernetesKubeAcademy
 
Zero downtime deployment of micro-services with Kubernetes
Zero downtime deployment of micro-services with KubernetesZero downtime deployment of micro-services with Kubernetes
Zero downtime deployment of micro-services with KubernetesWojciech Barczyński
 

Similaire à Evolution of kube-proxy (Brussels, Fosdem 2020) (20)

Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
 
Scaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptxScaling Kubernetes to Support 50000 Services.pptx
Scaling Kubernetes to Support 50000 Services.pptx
 
What's new in Kubernetes
What's new in KubernetesWhat's new in Kubernetes
What's new in Kubernetes
 
Introduction to Kubernetes
Introduction to KubernetesIntroduction to Kubernetes
Introduction to Kubernetes
 
DCEU 18: Docker Container Networking
DCEU 18: Docker Container NetworkingDCEU 18: Docker Container Networking
DCEU 18: Docker Container Networking
 
Microservice bus tutorial
Microservice bus tutorialMicroservice bus tutorial
Microservice bus tutorial
 
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
ОЛЕКСАНДР ЛИПКО «Graceful Shutdown Node.js + k8s» Online WDDay 2021
 
Nynog-K8s-networking-101.pptx
Nynog-K8s-networking-101.pptxNynog-K8s-networking-101.pptx
Nynog-K8s-networking-101.pptx
 
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
KubeCon EU 2016: Creating an Advanced Load Balancing Solution for Kubernetes ...
 
Scale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 servicesScale Kubernetes to support 50000 services
Scale Kubernetes to support 50000 services
 
Kubernetes workshop -_the_basics
Kubernetes workshop -_the_basicsKubernetes workshop -_the_basics
Kubernetes workshop -_the_basics
 
Kubernetes Networking
Kubernetes NetworkingKubernetes Networking
Kubernetes Networking
 
Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019Kubernetes the Very Hard Way. Lisa Portland 2019
Kubernetes the Very Hard Way. Lisa Portland 2019
 
Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov
Networking @Scale'19 - Getting a Taste of Your Network - Sergey FedorovNetworking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov
Networking @Scale'19 - Getting a Taste of Your Network - Sergey Fedorov
 
Kubernetes on AWS
Kubernetes on AWSKubernetes on AWS
Kubernetes on AWS
 
Kubernetes on AWS
Kubernetes on AWSKubernetes on AWS
Kubernetes on AWS
 
Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)Kubernetes internals (Kubernetes 해부하기)
Kubernetes internals (Kubernetes 해부하기)
 
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADS
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADSKNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADS
KNATIVE - DEPLOY, AND MANAGE MODERN CONTAINER-BASED SERVERLESS WORKLOADS
 
KubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
KubeCon EU 2016: Using Traffic Control to Test Apps in KubernetesKubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
KubeCon EU 2016: Using Traffic Control to Test Apps in Kubernetes
 
Zero downtime deployment of micro-services with Kubernetes
Zero downtime deployment of micro-services with KubernetesZero downtime deployment of micro-services with Kubernetes
Zero downtime deployment of micro-services with Kubernetes
 

Plus de Laurent Bernaille

How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceLaurent Bernaille
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesLaurent Bernaille
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logsLaurent Bernaille
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Laurent Bernaille
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...Laurent Bernaille
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!Laurent Bernaille
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networkingLaurent Bernaille
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayLaurent Bernaille
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksLaurent Bernaille
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksLaurent Bernaille
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesLaurent Bernaille
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksLaurent Bernaille
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Laurent Bernaille
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applicationsLaurent Bernaille
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Laurent Bernaille
 

Plus de Laurent Bernaille (16)

How the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My NamespaceHow the OOM Killer Deleted My Namespace
How the OOM Killer Deleted My Namespace
 
Kubernetes DNS Horror Stories
Kubernetes DNS Horror StoriesKubernetes DNS Horror Stories
Kubernetes DNS Horror Stories
 
Making the most out of kubernetes audit logs
Making the most out of kubernetes audit logsMaking the most out of kubernetes audit logs
Making the most out of kubernetes audit logs
 
Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019Kubernetes the Very Hard Way. Velocity Berlin 2019
Kubernetes the Very Hard Way. Velocity Berlin 2019
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you! ...
 
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
10 ways to shoot yourself in the foot with kubernetes, #9 will surprise you!
 
Optimizing kubernetes networking
Optimizing kubernetes networkingOptimizing kubernetes networking
Optimizing kubernetes networking
 
Kubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard wayKubernetes at Datadog the very hard way
Kubernetes at Datadog the very hard way
 
Deep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay NetworksDeep Dive in Docker Overlay Networks
Deep Dive in Docker Overlay Networks
 
Deeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay NetworksDeeper dive in Docker Overlay Networks
Deeper dive in Docker Overlay Networks
 
Discovering OpenBSD on AWS
Discovering OpenBSD on AWSDiscovering OpenBSD on AWS
Discovering OpenBSD on AWS
 
Operational challenges behind Serverless architectures
Operational challenges behind Serverless architecturesOperational challenges behind Serverless architectures
Operational challenges behind Serverless architectures
 
Deep dive in Docker Overlay Networks
Deep dive in Docker Overlay NetworksDeep dive in Docker Overlay Networks
Deep dive in Docker Overlay Networks
 
Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016Feedback on AWS re:invent 2016
Feedback on AWS re:invent 2016
 
Early recognition of encryted applications
Early recognition of encryted applicationsEarly recognition of encryted applications
Early recognition of encryted applications
 
Early application identification. CONEXT 2006
Early application identification. CONEXT 2006Early application identification. CONEXT 2006
Early application identification. CONEXT 2006
 

Dernier

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 

Dernier (20)

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 

Evolution of kube-proxy (Brussels, Fosdem 2020)

  • 1. Evolution of kube-proxy Laurent Bernaille @lbernail
  • 2. Datadog Monitoring service Over 350 integrations Over 1,200 employees Over 8,000 customers Runs on millions of hosts Trillions of data points per day 10000s hosts in our infra 10s of Kubernetes clusters Clusters from 50 to 3000 nodes Multi-cloud Very fast growth
  • 4. ● Network proxy running on each Kubernetes node ● Implements the Kubernetes service abstraction ○ Map service Cluster IPs (virtual IPs) to pods ○ Enable ingress traffic Kube-proxy
  • 5. Deployment and service apiVersion: apps/v1 kind: Deployment metadata: name: echodeploy labels: app: echo spec: replicas: 3 selector: matchLabels: app: echo template: metadata: labels: app: echo spec: containers: - name: echopod image: lbernail/echo:0.5 apiVersion: v1 kind: Service metadata: name: echo labels: app: echo spec: type: ClusterIP selector: app: echo ports: - name: http protocol: TCP port: 80 targetPort: 5000
  • 6. Created resources Deployment ReplicaSet Pod 1 label: app=echo Pod 2 label: app=echo Pod 3 label: app=echo Service Selector: app=echo kubectl get all NAME AGE deploy/echodeploy 16s NAME AGE rs/echodeploy-75dddcf5f6 16s NAME READY po/echodeploy-75dddcf5f6-jtjts 1/1 po/echodeploy-75dddcf5f6-r7nmk 1/1 po/echodeploy-75dddcf5f6-zvqhv 1/1 NAME TYPE CLUSTER-IP svc/echo ClusterIP 10.200.246.139
  • 7. The endpoint object Deployment ReplicaSet Pod 1 label: app=echo Pod 2 label: app=echo Pod 3 label: app=echo kubectl describe endpoints echo Name: echo Namespace: datadog Labels: app=echo Annotations: <none> Subsets: Addresses: 10.150.4.10,10.150.6.16,10.150.7.10 NotReadyAddresses: <none> Ports: Name Port Protocol ---- ---- -------- http 5000 TCP Endpoints Addresses: 10.150.4.10 10.150.6.16 10.150.7.10 Service Selector: app=echo
  • 8. Readiness readinessProbe: httpGet: path: /ready port: 5000 periodSeconds: 2 successThreshold: 2 failureThreshold: 2 ● A pod can be started but no ready to serve requests ○ Initialization ○ Connection to backends ● Kubernetes provides an abstraction for this: Readiness Probes
  • 9. Accessing the service kubectl run -it test --image appropriate/curl ash # while true ; do curl 10.200.246.139 ; sleep 1 ; done Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.7.10 | Source: 10.150.6.17 | Version: v2 Container: 10.150.6.16 | Source: 10.150.6.17 | Version: v2 Container: 10.150.4.10 | Source: 10.150.6.17 | Version: v2 Endpoint only consist of pods that are Ready
  • 10. How does it work?
  • 11. Pod statuses API Server Node kubelet pod HC Status updates Node kubelet pod HC ETCD pods
  • 12. Endpoints API Server Node kubelet pod HC Status updates Controller Manager Watch - pods - services endpoint controller Node kubelet pod HC Sync endpoints: - list pods matching selector - add IP to endpoints ETCD pods services endpoints
  • 13. Accessing services API Server Node A Controller Manager Watch - pods - services endpoint controller Sync endpoints: - list pods matching selector - add IP to endpoints ETCD pods services endpoints kube-proxy proxier Watch - services - endpoints
  • 14. Accessing services API Server Node A Controller Manager endpoint controller ETCD pods services endpoints kube-proxy proxier client Node Bpod 1 Node Cpod 2
  • 17. proxy-mode=userspace API Server Node A kube-proxy iptables client Node B Node C pod 1 pod 2 kube-proxy binds 1 port per service Client traffic redirected to kube-proxy kube-proxy connects to pods
  • 18. proxy-mode=userspace PREROUTING any / any => KUBE-PORTALS-CONTAINER All traffic is processed by kube chains
  • 19. proxy-mode=userspace PREROUTING any / any => KUBE-PORTALS-CONTAINER All traffic is processed by kube chains KUBE-PORTALS-CONTAINER any / VIP:PORT => NodeIP:PortX Portal chain One rule per service One port allocated for each service (and bound by kube-proxy)
  • 20. ● Performance (userland proxying) ● Source IP cannot be kept ● Not recommended anymore ● Since Kubernetes 1.2, default is iptables Limitations
  • 22. proxy-mode=iptables API Server Node A kube-proxy iptables client Node B Node C pod 1 pod 2 Outgoing traffic 1. Client to Service IP 2. DNAT: Client to Pod1 IP Reverse path 1. Pod1 IP to Client 2. Reverse NAT: Service IP to client
  • 23. proxy-mode=iptables PREROUTING / OUTPUT any / any => KUBE-SERVICES All traffic is processed by kube chains
  • 24. proxy-mode=iptables PREROUTING / OUTPUT any / any => KUBE-SERVICES KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX Global Service chain Identify service and jump to appropriate service chain
  • 25. proxy-mode=iptables PREROUTING / OUTPUT any / any => KUBE-SERVICES KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX KUBE-SVC-XXX any / any proba 33% => KUBE-SEP-AAA any / any proba 50% => KUBE-SEP-BBB any / any => KUBE-SEP-CCC Service chain (one per service) Use statistic iptables module (probability of rule being applied) Rules are evaluated sequentially (hence the 33%, 50%, 100%)
  • 26. proxy-mode=iptables PREROUTING / OUTPUT any / any => KUBE-SERVICES KUBE-SERVICES any / VIP:PORT => KUBE-SVC-XXX KUBE-SVC-XXX any / any proba 33% => KUBE-SEP-AAA any / any proba 50% => KUBE-SEP-BBB any / any => KUBE-SEP-CCC KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port Endpoint Chain Mark hairpin traffic (client = target) for SNAT DNAT to the endpoint
  • 27. Hairpin traffic? API Server Node A kube-proxy iptables pod 1 Node B Node C pod 2 pod 3 Client can also be a destination After DNAT: Src IP= Pod1, Dst IP= Pod1 No reverse NAT possible => SNAT on host for this traffic 1. Pod1 IP => SVC IP 2. SNAT: HostIP => SVC IP 3. DNAT: HostIP => Pod1 IP Reverse path 1. Pod1 IP => Host IP 2. Reverse NAT: SVC IP => Pod1IP
  • 28. Persistency spec: type: ClusterIP sessionAffinity: ClientIP sessionAffinityConfig: clientIP: timeoutSeconds: 600 KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port recent : set rsource KUBE-SEP-AAA Use “recent” module Add Source IP to set named KUBE-SEP-AAA
  • 29. Persistency KUBE-SEP-AAA endpoint IP / any => KUBE-MARK-MASQ any / any => DNAT endpoint IP:Port recent : set rsource KUBE-SEP-AAA Use “recent” module Add Source IP to set named KUBE-SEP-AAA KUBE-SVC-XXX any / any recent: rcheck KUBE-SEP-AAA => KUBE-SEP-AAA any / any recent: rcheck KUBE-SEP-BBB => KUBE-SEP-BBB any / any recent: rcheck KUBE-SEP-CCC => KUBE-SEP-CCC Load-balancing rules Use recent module If Source IP is in set named KUBE-SEP-AAA, jump to KUBE-SEP-AAA
  • 30. ● iptables was not designed for load-balancing ● hard to debug (very quickly, 10000s of rules) ● performance impact ○ control plane: syncing rules ○ data plane: going through rules Limitations
  • 31. “Scale Kubernetes to Support 50,000 Services” Haibin Xie & Quinton Hoole Kubecon Berlin 2017
  • 32. “Scale Kubernetes to Support 50,000 Services” Haibin Xie & Quinton Hoole Kubecon Berlin 2017
  • 33. ipvs
  • 34. ● L4 load-balancer build in the Linux Kernel ● Many load-balancing algorithms ● Native persistency ● Fast atomic update (add/remove endpoint) ● Still relies on iptables for some use cases (SNAT) ● GA since 1.11 proxy-mode=ipvs
  • 35. proxy-mode=ipvs Pod X Pod Y Pod Z Service ClusterIP:Port S Backed by pod:port X, Y, Z Virtual Server S Realservers - X - Y - Z kube-proxy apiserver Watches endpoints Updates IPVS
  • 36. New connection Pod X Pod Y Pod Z Virtual Server S Realservers - X - Y - Z kube-proxy apiserver IPVS conntrack App A => S >> A => X App establishes connection to S IPVS associates Realserver X
  • 37. Pod X deleted Pod X Pod Y Pod Z Virtual Server S Realservers - Y - Z kube-proxy apiserver IPVS conntrack App A => S >> A => X Apiserver removes X from S endpoints Kube-proxy removes X from realservers Kernel drops traffic (no realserver)
  • 38. Clean-up Pod Y Pod Z Virtual Server S Realservers - Y - Z kube-proxy apiserver IPVS conntrack App A => S >> A => X net/ipv4/vs/expire_nodest_conn Delete conntrack entry on next packet Forcefully terminate (RST) RST
  • 39. Limit? ● No graceful termination ● As soon as a pod is Terminating connections are destroyed ● Addressing this took time
  • 40. Graceful termination (1.12) Pod X Pod Y Pod Z Virtual Server S Realservers - X Weight:0 - Y - Z kube-proxy apiserver IPVS conntrack App A => S >> A => X Apiserver removes X from S endpoints Kube-proxy sets Weight to 0 No new connection
  • 41. Garbage collection Pod X Pod Y Pod Z Virtual Server S Realservers - Y - Z kube-proxy apiserver IPVS conntrack App A => S >> A => X Pod exits and sends FIN Kube-proxy removes realserver when it has no active connection FIN
  • 42. What if X doesn’t send FIN? Pod Y Pod Z Virtual Server S Realservers - X Weight:0 - Y - Z kube-proxy IPVS conntrack App A => S >> A => X Conntrack entries expires after 900s If A sends traffic, it never expires Traffic blackholed until App detects issue ~15mn > Mitigation: lower tcp_retries2
  • 43. ● Default timeouts ○ tcp / tcpfin : 900s / 120s ○ udp: 300s ● Under high load, ports can be reused fast and keep entry active ○ Especially with conn_reuse_mode=0 ● Very bad for DNS => No graceful termination for UDP IPVS connection tracking
  • 44. ● Very recent feature: allow for configurable timeouts ● Ideally we’d want: ○ Terminating pod: set weight to 0 ○ Deleted pod: remove backend > This would require an API change for Endpoints IPVS connection tracking
  • 45. ● Works pretty well at large scale ● Not 100% feature parity (a few edge cases) ● Tuning the IPVS parameters took time ● Improvements under discussion IPVS status
  • 47. Control plane scalability ● On each update to a service (new pod, readiness change) ○ The full endpoint object is recomputed ○ The endpoint is sent to all kube-proxy ● Example of a single endpoint change ○ Service with 2000 endpoints, ~100B / endpoint, 5000 nodes ○ Each node gets 2000 x 100B = 200kB ○ Apiservers send 5000 x 200kB = 1GB ● Rolling update? ○ Total traffic => 2000 x 1GB = 2TB
  • 48. => Endpoint Slices ● Service potentially mapped to multiple EndpointSlices ● Maximum 100 endpoints in each slice ● Much more efficient for services with many endpoints ● Beta in 1.17
  • 49. Conntrack size ● kube-proxy relies on the conntrack ○ Not great for services receiving a lot of connections ○ Pretty bad for the cluster DNS ● DNS strategy ○ Use node-local cache and force TCP for upstream ○ Much more manageable
  • 50. Conntrack and UDP ● Very good surprise with Kernel 5.0+
  • 52. New features ● Dual-stack support ○ Alpha in 1.16 ● Topology aware routing ○ Use topology keys to route to services ○ Alpha in 1.17
  • 54. Conclusion ● kube-proxy works at significant scale ● A lot of ongoing efforts to improve kube-proxy ● iptables and IPVS are not perfect matches for the service abstraction ○ Not designed for client-side load-balancing ● Very promising eBPF alternative: Cilium ○ No need for iptables / IPVS ○ Works without conntrack ○ See Daniel Borkmann’s talk