The Essentials of Resource Management in Kubernetes

The resource requests and limits that we set for containers in a Pod spec are the key settings that we can use to influence how Kubernetes schedule the pod and manage the computational resources, typically CPU and memory, of nodes. Understanding how resource requests and limits work is essential to understand the resource management in Kubernetes.

Resource requests and limits

The Kubernetes scheduler uses the resource requests as one of factors to decide on which node the pod can be scheduled. Rather than looking at the actual resource usage on each node, the scheduler uses the node allocatable and the sum of the resource requests of all pods running on the node to make the decision. If you don’t set the resource requests, your pod may be scheduled on any node which still has the unallocated resource. But on the other side, your pod may not get enough resource to run or may even be terminated if the node is under resource pressure. Setting resource requests ensures containers get the minimum amount of resources that they need. It also helps the kubelet determine the eviction order when necessary.

With the resource limits, you set the hard limits of the resources a container can use. Hard limits mean the container cannot use more resource than its limits. If it attempts to do it, there will be consequences. If it attempts to use more CPU which is compressible, its CPU time will be throttled; If it attempts to use more memory which is incompressible, it will be terminated with an OOMKilled error. If you don’t set resource limits, the container could use all available resources on the node. But on the other side, it could become a noisy neighbor and could be terminated when the node is under resource pressure. Setting resource limits ensures the maximum amount of resources a container can use.

If you specify the resource limits for a container, but don’t specify the resource requests, Kubernetes automatically assigns the requests that matches the limits. The different combination of these two settings also defines the QoS class of the pod.

Since the scheduler only uses the resource requests when scheduling pods, a node could be overcommitted, which means the sum of the resource limits of all pods on the node could be more than the node allocatable of the node. The node could be under the resource pressure. If that happens, especially if the node is under memory pressure, the pods running on it could be evicted.

Eviction of pods

From the resource management perspective, there are 2 situations where pods could be evicted:

  1. a pod attempts to use more memory than its limit.
  2. a node is under resource pressure.

Pods could also be evicted because of other reasons, such as pod priority and preemption etc. I won’t discuss them in this post. When a pod is evicted, if it can be restarted, Kubernetes will restart it.

When a pod with resource limits is scheduled on a node, the kubelet passes the resource limits to the container runtime as the CPU/memory constraints. The container runtime sets these constraints on the cgroup of the container. When the memory usage of the container is over its limit, the OOM killer of the Linux kernel kicks in and kills it. You will see OOMKilled error in the status of the pod. The kernel takes care of the resource usage of cgroups. Whether the node is under resource pressure or not doesn’t matter.

On the other hand, the kubelet monitors the resource usage of the node. When the resource usage of the node reaches certain level, it marks the node’s condition, tries to reclaim node level resources, and eventually evicts pods running on the node to reclaim resources. When the kubelet has to evict pods, it uses the following order to select which pod should be evicted first:

  1. Whether the pod’s resource usage exceeds its requests
  2. Pod priority
  3. The pod’s resource usage relative to its requests

The kubelet doesn’t use the pod’s QoS class directly to determine the eviction order. The QoS class is more like a tool to help us, humans, estimate the potential pod eviction order. The key factor here is the resource requests of the pod. From the above list we know that, apart from the pod priority:

  • BestEffort pods would be evicted first as its resource usage always exceeds its requests and its usage relative to requests is huge, since there are no requests defined at all.
  • Burstable pods could be evicted secondly if its resource usage exceeds its requests.
  • Guaranteed pods and Burstable pods of which the usage doesn’t exceed its requests are the last in the eviction order.

Although QoS class doesn’t affect how the kubelet determines the pod eviction order, it affects the oom_score that the OOM killer of the Linux kernel uses to determine the order of containers it kills in case if the node is out of memory. The oom_score_adj value of each QoS class is in the table below.

QoS Classoom_score_adj
Guaranteed-997
BestEffort1000
Burstable2 – 999

Takeaways

Now we know how resource requests and limits works in Kubernetes. Here are some best practices you can use when defining pods.

  • All pods should have resources requests and limits specified. You can leverage Kubernetes features such as resource quota and limit ranges to enforce it on namespaces. If you are on AKS, you can also use Azure Policy to enforce it.
  • For critical pods where you want to minimize its chances of being evicted, make sure its QoS class is Guaranteed.
  • To reduce the side effect of a user pod to the system pods, separate system pods and user pods on different nodes/node pools. If you are on AKS, create system and user node pools in the cluster.
  • If computational resources of your Kubernetes cluster are not a constraint, enable HPA and cluster autoscaler for the workloads.
  • On a node of a Kubernetes cluster, you should not deploy any components/software outside of Kubernetes. If you have to install additional components/software, use a Kubernetes native way, such as via DaemonSet.

Reference