Schedule GPUs
Kubernetes v1.10 [beta]
Kubernetes includes experimental support for managing AMD and NVIDIA GPUs (graphical processing units) across several nodes.
This page describes how users can consume GPUs across different Kubernetes versions and the current limitations.
Using device plugins
Kubernetes implements Device Plugins to let Pods access specialized hardware features such as GPUs.
As an administrator, you have to install GPU drivers from the corresponding hardware vendor on the nodes and run the corresponding device plugin from the GPU vendor:
When the above conditions are true, Kubernetes will expose amd.com/gpu
or
nvidia.com/gpu
as a schedulable resource.
You can consume these GPUs from your containers by requesting
<vendor>.com/gpu
the same way you request cpu
or memory
.
However, there are some limitations in how you specify the resource requirements
when using GPUs:
- GPUs are only supposed to be specified in the
limits
section, which means:- You can specify GPU
limits
without specifyingrequests
because Kubernetes will use the limit as the request value by default. - You can specify GPU in both
limits
andrequests
but these two values must be equal. - You cannot specify GPU
requests
without specifyinglimits
.
- You can specify GPU
- Containers (and Pods) do not share GPUs. There's no overcommitting of GPUs.
- Each container can request one or more GPUs. It is not possible to request a fraction of a GPU.
Here's an example:
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1 # requesting 1 GPU
Deploying AMD GPU device plugin
The has the following requirements:
- Kubernetes nodes have to be pre-installed with AMD GPU Linux driver.
To deploy the AMD device plugin once your cluster is running and the above requirements are satisfied:
kubectl create -f https://raw.githubusercontent.com/RadeonOpenCompute/k8s-device-plugin/v1.10/k8s-ds-amdgpu-dp.yaml
You can report issues with this third-party device plugin by logging an issue in
There are currently two device plugin implementations for NVIDIA GPUs: The
has the following requirements: To deploy the NVIDIA device plugin once your cluster is running and the above
requirements are satisfied: You can report issues with this third-party device plugin by logging an issue in
The
doesn't require using nvidia-docker and should work with any container runtime
that is compatible with the Kubernetes Container Runtime Interface (CRI). It's tested
on
and has experimental code for Ubuntu from 1.9 onwards. You can use the following commands to install the NVIDIA drivers and device plugin: You can report issues with using or deploying this third-party device plugin by logging an issue in
Google publishes its own
If different nodes in your cluster have different types of GPUs, then you
can use Node Labels and Node Selectors
to schedule pods to appropriate nodes. For example: If you're using AMD GPU devices, you can deploy
.
Node Labeller is a controller that automatically
labels your nodes with GPU device properties. At the moment, that controller can add labels for: With the Node Labeller in use, you can specify the GPU type in the Pod spec: This will ensure that the Pod will be scheduled to a node that has the GPU type
you specified.Deploying NVIDIA GPU device plugin
Official NVIDIA GPU device plugin
nvidia-container-runtime
must be configured as the
for Docker, instead of runc.kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/1.0.0-beta4/nvidia-device-plugin.yml
NVIDIA GPU device plugin used by GCE
# Install NVIDIA drivers on Container-Optimized OS:
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/daemonset.yaml
# Install NVIDIA drivers on Ubuntu (experimental):
kubectl create -f https://raw.githubusercontent.com/GoogleCloudPlatform/container-engine-accelerators/stable/nvidia-driver-installer/ubuntu/daemonset.yaml
# Install the device plugin:
kubectl create -f https://raw.githubusercontent.com/kubernetes/kubernetes/release-1.14/cluster/addons/device-plugins/nvidia-gpu/daemonset.yaml
Clusters containing different types of GPUs
# Label your nodes with the accelerator type they have.
kubectl label nodes <node-with-k80> accelerator=nvidia-tesla-k80
kubectl label nodes <node-with-p100> accelerator=nvidia-tesla-p100
Automatic node labelling
kubectl describe node cluster-node-23
Name: cluster-node-23
Roles: <none>
Labels: beta.amd.com/gpu.cu-count.64=1
beta.amd.com/gpu.device-id.6860=1
beta.amd.com/gpu.family.AI=1
beta.amd.com/gpu.simd-count.256=1
beta.amd.com/gpu.vram.16G=1
beta.kubernetes.io/arch=amd64
beta.kubernetes.io/os=linux
kubernetes.io/hostname=cluster-node-23
Annotations: kubeadm.alpha.kubernetes.io/cri-socket: /var/run/dockershim.sock
node.alpha.kubernetes.io/ttl: 0
…
apiVersion: v1
kind: Pod
metadata:
name: cuda-vector-add
spec:
restartPolicy: OnFailure
containers:
- name: cuda-vector-add
# https://github.com/kubernetes/kubernetes/blob/v1.7.11/test/images/nvidia-cuda/Dockerfile
image: "k8s.gcr.io/cuda-vector-add:v0.1"
resources:
limits:
nvidia.com/gpu: 1
nodeSelector:
accelerator: nvidia-tesla-p100 # or nvidia-tesla-k80 etc.