YouTube 16 July 2021
hosted by Duffie Cooley
- Detecting Kernel Hooking using eBPF
- Facebook Enforcing encryption at scale
- Halving the TezEdge node’s memory usage with an eBPF-based memory profiler More context: search for tezedge in https://ebpf.io/blog/ebpf-updates-2020-12
- eBPF Summit on Twitter and LinkedIn
netkat with Antonio Ojea
netkat is a netcat clone that uses raw sockets to avoid iptables and/or other OS filtering mechanisms.
Kubernetes environments have a LOT of iptables, hence the necessity to bypass iptables sometimes, for testing, debugging, troubleshooting, ...
![](https://i.imgur.com/ZwEMrbQ.jpg =400x)
sudo ./bin/netkat
Usage: nk [options] [hostname] [port]
-debug
Debug
-interface string
Specify interface to use. Default interface with the route to the specified [hostname]
-listen
Bind and listen for incoming connections
-source-port int
Specify source port to use
-udp
Use UDP instead of default TCP
netkat works as the original netcat, but using raw sockets, so it is not affected by the netfilter hooks.
A RAW socket receives a copy of all packets, however, we don't need all of them, just the ones used by the netkat connection.
On the socket we filter the packets that we don't WANT On ingress we filter the packets that we WANT, so the host doesn't close our connection
Netkat can obtain the tuple of the connection in advance, so it can filter using BFP and the socket SO_ATTACH_FILTER option, and the traffic control eBPF filtering capability.
The packets received on the socket, are raw packets (with ethernet headers), but since they are bypassing the kernel TCP/IP stack, something need to reassemble the packets and obtain a data stream. This is achieved using an userspace TCP/IP stack https://pkg.go.dev/gvisor.dev/gvisor/pkg/tcpip/stack
At high level, the code does:
- obtain the connection details: source and destination IP and port, and host interface
- parametrize the eBPF code with the connection details, bpf2go generate the eBPF code
- inject the BPF code to the raw socket and to the ingress interface
As result, we have a new version of netcat that bypass iptables (needs CAP_NET_RAW) It has a bonus, with the -d flag it has a sniffer XD
sudo ./netkat -d -l 127.0.0.1 80
2021/07/16 01:58:08 Creating raw socket
2021/07/16 01:58:08 Adding ebpf ingress filter on interface lo
2021/07/16 01:58:08 filter {LinkIndex: 1, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:58:08 Creating user TCP/IP stack
2021/07/16 01:58:08 Listening on 127.0.0.1:80
I0716 01:58:31.206864 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ee flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715804214 TSEcr:0 SACKPermitted:true}
I0716 01:58:31.206937 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ee flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715804214 TSEcr:0 SACKPermitted:true}
I0716 01:58:32.239938 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ef flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715805247 TSEcr:0 SACKPermitted:true}
I0716 01:58:32.240044 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5ef flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715805247 TSEcr:0 SACKPermitted:true}
I0716 01:58:34.287962 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5f0 flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715807295 TSEcr:0 SACKPermitted:true}
I0716 01:58:34.288096 22926 sniffer.go:418] recv tcp 127.0.0.1:44644 -> 127.0.0.1:80 len:0 id:a5f0 flags: S seqnum: 1321663292 ack: 0 win: 65495 xsum:0xfe30 options: {MSS:65495 WS:7 TS:true TSVal:2715807295 TSEcr:0 SACKPermitted:true}
^C2021/07/16 01:58:35 Exiting: received signal
2021/07/16 01:58:35 Done
I really like that Antonio wrote tests for this let's check them out.
- Create a multinode cluster with KIND
cat <<EOF | kind create cluster --config=-
kind: Cluster
apiVersion: kind.x-k8s.io/v1alpha4
nodes:
- role: control-plane
- role: worker
- role: worker
EOF
- Run a fake loadbalancer controller
go get -v github.com/aojea/networking-controllers
cd $GOPATH/github.com/aojea/networking-controllers
cd cmd/loadbalancer
go build .
kind get kubeconfig > kind.conf
./loadbalancer --kubeconfig kind.conf --iprange "10.111.111.0/24"
I0716 02:58:43.535494 15454 controller.go:80] Starting controller fake-loadbalancer-controller
I0716 02:58:43.535567 15454 controller.go:84] Waiting for informer caches to sync
I0716 02:58:43.535589 15454 shared_informer.go:240] Waiting for caches to sync for fake-loadbalancer-controller
I0716 02:58:43.636251 15454 shared_informer.go:247] Caches are synced for fake-loadbalancer-controller
I0716 02:58:43.636267 15454 controller.go:90] Starting workers
I0716 02:58:43.636285 15454 controller.go:149] Processing sync for service
- Create a deployment with one pod and expose it with a Loadbalancer
kubectl apply -f https://gist.githubusercontent.com/aojea/369ad6a5d4cbb6b0fbcdd3dd909d9887/raw/0ecd0564c1db4c7103de07accce99da1c7bf91c3/loadbalancer.yaml
- Obtain the LoadBalancer assigned IP
kubectl get service lb-service -o jsonpath='{.status.loadBalancer.ingress[0].ip}'
10.111.111.129
The loadbalancer has externalTrafficPolicy: Local
so it should only be able to work
if the traffic is directed to the nodes that contains pods belonging to that Service.
kubectl get pods -l app=MyApp -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
test-deployment-6f45696bd5-xlp96 1/1 Running 0 10m 10.244.2.2 kind-worker2 <none> <none>
- Emulate the external loadbalancer installing a route in the host to the loadbalancer IP through the node with the pod
Get the nodes IPs
kubectl get nodes -o wide
NAME STATUS ROLES AGE VERSION INTERNAL-IP EXTERNAL-IP OS-IMAGE KERNEL-VERSION CONTAINER-RUNTIME
kind-control-plane Ready control-plane,master 34m v1.21.2 172.18.0.4 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2
kind-worker Ready <none> 33m v1.21.2 172.18.0.3 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2
kind-worker2 Ready <none> 33m v1.21.2 172.18.0.2 <none> Ubuntu 21.04 4.18.0-301.1.el8.x86_64 containerd://1.5.2
example: pod is in kind-worker2 with ip 172.18.0.2 and loadbalancer IP 10.111.111.129
sudo ip route add 10.111.111.129 via 172.18.0.2
- Check it preserves the source ip ...
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc 10.111.111.129 80
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:30:35 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.1:34028
- and that it doesn't work if we target a node without pods backing that service
sudo ip route del 10.111.111.129 via 172.18.0.2
sudo ip route add 10.111.111.129 via 172.18.0.3
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
- However, if we use
externalTrafficPolicy: Cluster
it works ... but the source IP is not preserved
kubectl patch service lb-service -p '{"spec":{"externalTrafficPolicy":"Cluster"}}'
service/lb-service patched
$ echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connected to 10.111.111.129:80.
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:34:22 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.3:14569
Ncat: 36 bytes sent, 133 bytes received in 0.01 seconds.
- Set
externalTrafficPolicy: Local
again
kubectl patch service lb-service -p '{"spec":{"externalTrafficPolicy":"Local"}}'
service/lb-service patched
$ echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Ncat: Version 7.70 ( https://nmap.org/ncat )
Ncat: Connection timed out.
- Test the service from within node without the pod
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | nc -v -w 3 10.111.111.129 80
Connection to 10.111.111.129 80 port [tcp/http] succeeded!
HTTP/1.1 200 OK
Date: Thu, 15 Jul 2021 23:36:50 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
We think it shoudn't work because we are in a node without pods, however, since the traffic comes from a node within the cluster the traffic is considered internal and externalTrafficPolicy doesn't apply. The reason is that kube-proxy installs some iptables rules that capture the traffic from within the node.
This causes a problem to test LoadBalancers, because it requires traffic to be sent from an external node, something that is not easy: Kubernetes cluster use to be isolated, expose apiserver endpoint e2e should not assume direct connectivity from the e2e binary
- Run netkat in a pod (using hostnetwork) we are going to simulate an external connecting from inside a node
kubectl apply -f https://gist.githubusercontent.com/aojea/952c82a58da625fbd9b8aca35f0e63f1/raw/0c5c625e037eb6431947c93152a3182c47004aa3/netkat.yaml
pod/netkat created
- Check where is running
$ kubectl get pods -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
netkat 1/1 Running 0 9s 172.18.0.3 kind-worker <none> <none>
test-deployment-6f45696bd5-xlp96 1/1 Running 0 63m 10.244.2.2 kind-worker2 <none> <none>
- Login to the container
kubectl exec -it netkat ash
kubectl exec [POD] [COMMAND] is DEPRECATED and will be removed in a future version. Use kubectl kubectl exec [POD] -- [COMMAND] instead.
/ # ip route add 10.111.111.15 via 172.18.0.2
echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | ./netkat 10.111.111.15 80
2021/07/16 01:06:54 routes {Ifindex: 18 Dst: 10.111.111.15/32 Src: 172.18.0.3 Gw: 172.18.0.2 Flags: [] Table: 254}
2021/07/16 01:06:54 Creating raw socket
2021/07/16 01:06:54 Adding ebpf ingress filter on interface eth0
2021/07/16 01:06:54 filter {LinkIndex: 18, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:06:54 Creating user TCP/IP stack
2021/07/16 01:06:54 Dialing ...
2021/07/16 01:06:54 Connection established
2021/07/16 01:06:54 Connection error: <nil>
HTTP/1.1 200 OK
Date: Fri, 16 Jul 2021 01:06:54 GMT
Content-Length: 16
Content-Type: text/plain; charset=utf-8
172.18.0.3:26913
we can see it preserves the source ip
- Check that if we target a different node it doesn't work as expected because the service has externalTrafficPolicy: local # and it considers the connection as external
/ # ip route del 10.111.111.15 via 172.18.0.3
/ # ip route add 10.111.111.15 via 172.18.0.4
/ # echo -e "GET /clientip HTTP/1.1\nhost:myhost\n" | ./netkat 10.111.111.15 80
2021/07/16 01:11:38 routes {Ifindex: 18 Dst: 10.111.111.15/32 Src: 172.18.0.2 Gw: 172.18.0.4 Flags: [] Table: 254}
2021/07/16 01:11:38 Creating raw socket
2021/07/16 01:11:38 Adding ebpf ingress filter on interface eth0
2021/07/16 01:11:38 filter {LinkIndex: 18, Handle: 0:1, Parent: ffff:fff2, Priority: 0, Protocol: 3}
2021/07/16 01:11:38 Creating user TCP/IP stack
2021/07/16 01:11:38 Dialing ...
2021/07/16 01:11:43 Dialing error: context deadline exceeded
2021/07/16 01:11:43 Can't connect to server: context deadline exceeded
- https://en.wikipedia.org/wiki/Netcat
- https://github.com/aojea/netkat
- https://man7.org/linux/man-pages/man7/raw.7.html
- https://man7.org/linux/man-pages/man7/packet.7.html
- https://man7.org/linux/man-pages/man8/tc-bpf.8.html
- https://pkg.go.dev/github.com/cilium/ebpf/cmd/bpf2go
- https://docs.cilium.io/en/stable/bpf/
- https://developers.redhat.com/blog/2021/04/01/get-started-with-xdp
- https://blog.cloudflare.com/tag/ebpf/
- https://www.asykim.com/blog/deep-dive-into-kubernetes-external-traffic-policies