Using eBPF in Kubernetes

Introduction

Kubernetes provides a high-level API and a set of components that hides almost all of the intricate and—to some of us—interesting details of what happens at the systems level. Application developers are not required to have knowledge of the machines' IP tables, cgroups, namespaces, seccomp, or, nowadays, even the

This article focuses on a core Linux functionality increasingly used in networking, security and auditing, and tracing and monitoring tools. This functionality is called

Note: In this article we use both acronyms: eBPF and BPF. The former is used for the extended BPF functionality, and the latter for "classic" BPF functionality.

What is BPF?

BPF is a mini-VM residing in the Linux kernel that runs BPF programs. Before running, BPF programs are loaded with the

BPF Superpowers

BPF programs are event-driven by definition, an incredibly powerful concept, and executes code in the kernel when an event occurs. Netflix's Brendan Gregg refers to BPF as a .

The 'e' in eBPF

Traditionally, BPF could only be attached to sockets for socket filtering. BPF’s first use case was in tcpdump. When you run tcpdump the filter is compiled into a BPF program and attached to a raw AF_PACKET socket in order to print out filtered packets.

But over the years, eBPF added the ability to attach to

Existing Use Cases of eBPF with Kubernetes

Several open-source Kubernetes tools already use eBPF and many use cases warrant a closer look, especially in areas such as networking, monitoring and security tools.

Dynamic Network Control and Visibility with Cilium

The Cilium Agent runs on each host. Instead of managing IP tables, it translates network policy definitions to BPF programs that are loaded into the kernel and attached to a container's virtual ethernet device. These programs are executed—rules applied—on each packet that is sent or received.

This diagram shows how the Cilium project works:

Depending on what network rules are applied, BPF programs may be attached with

If you'd like to learn more about how Cilium uses eBPF, take a look at the project's .

Tracking TCP Connections in Weave Scope

Weave Scope employs an agent that runs on each node of a cluster. The agent monitors the system, generates a report and sends it to the app server. The app server compiles the reports it receives and presents the results in the Weave Scope UI.

To accurately draw connections between containers, the agent attaches a BPF program to kprobes that track socket events: opening and closing connections. The BPF program,

(As a side note, Weave Scope also has a plugin that make use of eBPF: .)

To learn more about how this works and why it's done this way, read a recent talk about the topic.

Limiting syscalls with seccomp-bpf

Linux has more than 300 system calls (read, write, open, close, etc.) available for use—or misuse. Most applications only need a small subset of syscalls to function properly.

The original implementation of seccomp was highly restrictive. Once applied, if an application attempted to do anything beyond reading and writing to files it had already opened, seccomp sent a SIGKILL signal.

seccomp-bpf is widely used in Kubernetes tools and exposed in Kubernetes itself. For example, seccomp-bpf is used in Docker to apply custom .

But in all of these cases the use of BPF is hidden behind

Potential Use Cases for eBPF with Kubernetes

eBPF is a relatively new Linux technology. As such, there are many uses that remain unexplored. eBPF itself is also evolving: new features are being added in eBPF that will enable new use cases that aren’t currently possible. In the following sections, we're going to look at some of these that have only recently become possible and ones on the horizon. Our hope is that these features will be leveraged by open source tooling.

Pod and container level network statistics

BPF socket filtering is nothing new, but BPF socket filtering per cgroup is. Introduced in Linux 4.10,

A

Generally, such statistics are collected by periodically checking the relevant file in Linux's /sys directory or using Netlink. By using BPF programs attached to cgroups for this, we can get much more detailed statistics: for example, how many packets/bytes on tcp port 443, or how many packets/bytes from IP 10.2.3.4. In general, because BPF programs have a kernel context, they can safely and efficiently deliver more detailed information to user space.

To explore the idea, the Kinvolk team implemented a proof-of-concept: Prometheus.

There are of course other interesting possibilities, like doing actual packet filtering. But the obstacle currently standing in the way of this is having cgroup v2 support—required by cgroup-bpf—in

Application-applied LSM

In some ways, it is similar to seccomp-bpf: using a BPF program, seccomp-bpf allows unprivileged processes to restrict what system calls they can perform. Landlock will be more powerful and provide more flexibility. Consider the following system call:

C  
fd = open(“myfile.txt”, O\_RDWR);

The first argument is a “char *”, a pointer to a memory address, such as 0xab004718.

With seccomp, a BPF program only has access to the parameters of the syscall but cannot dereference the pointers, making it impossible to make security decisions based on a file. seccomp also uses classic BPF, meaning it cannot make use of eBPF maps, the mechanism for interfacing with user space. This restriction means security policies cannot be changed in seccomp-bpf based on a configuration in an eBPF map.

BPF programs with Landlock don’t receive the arguments of the syscalls but a reference to a kernel object. In the example above, this means it will have a reference to the file, so it does not need to dereference a pointer, consider relative paths, or perform chroots.

Use Case: Landlock in Kubernetes-based serverless frameworks

In Kubernetes, the unit of deployment is a pod. Pods and containers are the main unit of isolation. In serverless frameworks, however, the main unit of deployment is a function. Ideally, the unit of deployment equals the unit of isolation. This puts serverless frameworks like

To achieve the best possible isolation, each function call would have to happen in its own container—ut what's good for isolation is not always good for performance. Inversely, if we run function calls within the same container, we increase the likelihood of collisions.

By using Landlock, we could isolate function calls from each other within the same container, making a temporary file created by one function call inaccessible to the next function call, for example. Integration between Landlock and technologies like Kubernetes-based serverless frameworks would be a ripe area for further exploration.

Auditing kubectl-exec with eBPF

In Kubernetes 1.7 the audit proposal started making its way in. It's currently pre-stable with plans to be stable in the 1.10 release. As the name implies, it allows administrators to log and audit events that take place in a Kubernetes cluster.

While these events log Kubernetes events, they don't currently provide the level of visibility that some may require. For example, while we can see that someone has used kubectl exec to enter a container, we are not able to see what commands were executed in that session. With eBPF one can attach a BPF program that would record any commands executed in the kubectl exec session and pass those commands to a user-space program that logs those events. We could then play that session back and know the exact sequence of events that took place.

Learn more about eBPF

If you're interested in learning more about eBPF, here are some resources:

Conclusion

We are just starting to see the Linux superpowers of eBPF being put to use in Kubernetes tools and technologies. We will undoubtedly see increased use of eBPF. What we have highlighted here is just a taste of what you might expect in the future. What will be really exciting is seeing how these technologies will be used in ways that we have not yet thought about. Stay tuned!

The Kinvolk team will be hanging out at the Kinvolk booth at KubeCon in Austin. Come by to talk to us about all things, Kubernetes, Linux, container runtimes and yeah, eBPF.