Kubernetes Memory Manager moves to beta
Authors: Artyom Lukianov (Red Hat), Cezary Zukowski (Samsung)
The blog post explains some of the internals of the Memory manager, a beta feature of Kubernetes 1.22. In Kubernetes, the Memory Manager is a Guaranteed .
This blog post covers:
- Why do you need it?
- The internal details of how the MemoryManager works
- Current limitations of the MemoryManager
- Future work for the MemoryManager
Why do you need it?
Some Kubernetes workloads run on nodes with
To get the best performance and latency for your workload, container CPUs,
peripheral devices, and memory should all be aligned to the same NUMA
locality.
Before Kubernetes v1.22, the kubelet already provided a set of managers to
align CPUs and PCI devices, but you did not have a way to align memory.
The Linux kernel was able to make best-effort attempts to allocate
memory for tasks from the same NUMA node where the container is
executing are placed, but without any guarantee about that placement. The memory manager is doing two main things: The overall sequence of the Memory Manager under the Kubelet During the Admission phase: During Pod creation: By default, the Memory Manager runs with the The value for The 1.22 release and promotion to beta brings along enhancements and fixes, but the Memory Manager still has several limitations. The NUMA node can not have both single and cross NUMA node allocations. When the container memory is pinned to two or more NUMA nodes, we can not know from which NUMA node the container will consume the memory. To prevent such issues, the Memory Manager will fail the admission of the The Memory Manager can not guarantee memory allocation for Burstable pods,
also when the Burstable pod has specified equal memory limit and request. Let's assume you have two Burstable pods: The sequence of Pods and containers that start and stop can fragment the memory on NUMA nodes.
The alpha implementation of the Memory Manager does not have any mechanism to balance pods and defragment memory back. We do not want to stop with the current state of the Memory Manager and are looking to
make improvements, including in the following areas. The current algorithm ignores distances between NUMA nodes during the
calculation of the allocation. If same-node placement isn't available, we can still
provide better performance compared to the current implementation, by changing the
Memory Manager to prefer the closest NUMA nodes for cross-node allocation. The default Kubernetes scheduler is not aware of the node's NUMA topology, and it can be a reason for many admission errors during the pod start.
We're hoping to add a KEP (Kubernetes Enhancement Proposal) to cover improvements in this area.
Follow
With the promotion of the Memory Manager to beta in 1.22, we encourage everyone to give it a try and look forward to any feedback you may have. While there are still several limitations, we have a set of enhancements planned to address them and look forward to providing you with many new features in upcoming releases.
If you have ideas for additional enhancements or a desire for certain features, please let us know. The team is always open to suggestions to enhance and improve the Memory Manager.
We hope you have found this blog informative and helpful! Let us know if you have any questions or comments. You can contact us via:How does it work?
Admit()
method.GetTopologyHints()
for every hint provider including the Memory Manager.Allocate()
for every hint provider including the Memory Manager.
PreCreateContainer()
.Let's talk about the configuration
None
policy, meaning it will just
relax and not do anything. To make use of the Memory Manager, you should set
two command line options for the kubelet:
--memory-manager-policy=Static
--reserved-memory="<numaNodeID>:<resourceName>=<quantity>"
--memory-manager-policy
is straightforward: Static
. Deciding what to specify for --reserved-memory
takes more thought. To configure it correctly, you should follow two main rules:
memory
resource must be greater than zero.kube-reserved + system-reserved + eviction-hard
) for the resource.
You can read more about memory reservations in Reserve Compute Resources for System Daemons.Current limitations
Single vs Cross NUMA node allocation
container1
started on the NUMA node 0 and requests 5Gi of the memory but currently is consuming only 3Gi of the memory.container2
consumes 3.5Gi of the memory from the NUMA node 0, but once the container1
will require more memory, it will not have it, and the kernel will kill one of the containers with the OOM error.container2
until the machine has two NUMA nodes without a single NUMA node allocation.Works only for Guaranteed pods
pod1
has containers with
equal memory request and limits, and pod2
has containers only with a
memory request set. You want to guarantee memory allocation for the pod1
.
To the Linux kernel, processes in either pod have the same OOM score,
once the kernel finds that it does not have enough memory, it can kill
processes that belong to pod pod1
.Memory fragmentation
Future work for the Memory Manager
Make the Memory Manager allocation algorithm smarter
Reduce the number of admission errors
Conclusion