Quality-of-Service for Memory Resources
Authors: Tim Xu (Tencent Cloud)
Kubernetes v1.22, released in August 2021, introduced a new alpha feature that improves how Linux nodes implement memory resource requests and limits.
In prior releases, Kubernetes did not support memory quality guarantees. For example, if you set container resources as follows:
apiVersion: v1
kind: Pod
metadata:
name: example
spec:
containers:
- name: nginx
resources:
requests:
memory: "64Mi"
cpu: "250m"
limits:
memory: "64Mi"
cpu: "500m"
spec.containers[].resources.requests
(e.g. cpu, memory) is designed for scheduling. When you create a Pod, the Kubernetes scheduler selects a node for the Pod to run on. Each node has a maximum capacity for each of the resource types: the amount of CPU and memory it can provide for Pods. The scheduler ensures that, for each resource type, the sum of the resource requests of the scheduled Containers is less than the capacity of the node.
spec.containers[].resources.limits
is passed to the container runtime when the kubelet starts a container. CPU is considered a "compressible" resource. If your app starts hitting your CPU limits, Kubernetes starts throttling your container, giving your app potentially worse performance. However, it won’t be terminated. That is what "compressible" means.
In cgroup v1, and prior to this feature, the container runtime never took into account and effectively ignored spec.containers[].resources.requests["memory"]. This is unlike CPU, in which the container runtime consider both requests and limits. Furthermore, memory actually can't be compressed in cgroup v1. Because there is no way to throttle memory usage, if a container goes past its memory limit it will be terminated by the kernel with an OOM (Out of Memory) kill.
Fortunately, cgroup v2 brings a new design and implementation to achieve full protection on memory. The new feature relies on cgroups v2 which most current operating system releases for Linux already provide. With this experimental feature, quality-of-service for pods and containers extends to cover not just CPU time but memory as well.
How does it work?
Memory QoS uses the memory controller of cgroup v2 to guarantee memory resources in Kubernetes. Memory requests and limits of containers in pod are used to set specific interfaces memory.min
and memory.high
provided by the memory controller. When memory.min
is set to memory requests, memory resources are reserved and never reclaimed by the kernel; this is how Memory QoS ensures the availability of memory for Kubernetes pods. And if memory limits are set in the container, this means that the system needs to limit container memory usage, Memory QoS uses memory.high
to throttle workload approaching it's memory limit, ensuring that the system is not overwhelmed by instantaneous memory allocation.
The following table details the specific functions of these two parameters and how they correspond to Kubernetes container resources.
File | Description |
---|---|
memory.min | memory.min specifies a minimum amount of memory the cgroup must always retain, i.e., memory that can never be reclaimed by the system. If the cgroup's memory usage reaches this low limit and can’t be increased, the system OOM killer will be invoked.
We map it to the container's memory request |
memory.high | memory.high is the memory usage throttle limit. This is the main mechanism to control a cgroup's memory use. If a cgroup's memory use goes over the high boundary specified here, the cgroup’s processes are throttled and put under heavy reclaim pressure. The default is max, meaning there is no limit.
We use a formula to calculate memory.high , depending on container's memory limit or node allocatable memory (if container's memory limit is empty) and a throttling factor. Please refer to the KEP for more details on the formula.
|
When container memory requests are made, kubelet passes memory.min
to the back-end CRI runtime (possibly containerd, cri-o) via the Unified
field in CRI during container creation. The memory.min
in container level cgroup will be set to:
i: the ith container in one pod
Since the memory.min
interface requires that the ancestor cgroup directories are all set, the pod and node cgroup directories need to be set correctly.
memory.min
in pod level cgroup:
i: the ith container in one pod
memory.min
in node level cgroup:
i: the ith pod in one node, j: the jth container in one pod
Kubelet will manage the cgroup hierarchy of the pod level and node level cgroups directly using runc libcontainer library, while container cgroup limits are managed by the container runtime.
For memory limits, in addition to the original way of limiting memory usage, Memory QoS adds an additional feature of throttling memory allocation. A throttling factor is introduced as a multiplier (default is 0.8). If the result of multiplying memory limits by the factor is greater than memory requests, kubelet will set memory.high
to the value and use Unified
via CRI. And if the container does not specify memory limits, kubelet will use node allocatable memory instead. The memory.high
in container level cgroup is set to:
i: the ith container in one pod
This can can help improve stability when pod memory usage increases, ensuring that memory is throttled as it approaches the memory limit.
How do I use it?
Here are the prerequisites for enabling Memory QoS on your Linux node, some of these are related to .
- Kubernetes since v1.22
-
- Linux kernel minimum version: 4.15, recommended version: 5.2+
- Linux image with cgroupv2 enabled or enabling cgroupv2 unified_cgroup_hierarchy manually
OCI runtimes such as runc and crun already support cgroups v2
With those prerequisites met, you can enable the memory QoS feature gate (see Set kubelet parameters via a config file). You can find more details as follows: You can reach SIG Node by several means: You can also contact me directly:How can I learn more?
How do I get involved?