Kubernetes 1.14: Local Persistent Volumes GA
Authors: Michelle Au (Google), Matt Schallert (Uber), Celina Ward (Uber)
The feature has been promoted to GA in Kubernetes 1.14. It was first introduced as alpha in Kubernetes 1.7, and then deprecation policy.
What is a Local Persistent Volume?
A local persistent volume represents a local disk directly-attached to a single Kubernetes Node.
Kubernetes provides a powerful volume plugin system that enables Kubernetes workloads to use a wide variety of block and file storage to persist data. Most of these plugins enable remote storage -- these remote storage systems persist data independent of the Kubernetes node where the data originated. Remote storage usually can not offer the consistent high performance guarantees of local directly-attached storage. With the Local Persistent Volume plugin, Kubernetes workloads can now consume high performance local storage using the same volume APIs that app developers have become accustomed to.
How is it different from a HostPath Volume?
To better understand the benefits of a Local Persistent Volume, it is useful to compare it to a . HostPath volumes mount a file or directory from the host node’s filesystem into a Pod. Similarly a Local Persistent Volume mounts a local disk or partition into a Pod.
The biggest difference is that the Kubernetes scheduler understands which node a Local Persistent Volume belongs to. With HostPath volumes, a pod referencing a HostPath volume may be moved by the scheduler to a different node resulting in data loss. But with Local Persistent Volumes, the Kubernetes scheduler ensures that a pod using a Local Persistent Volume is always scheduled to the same node.
While HostPath volumes may be referenced via a Persistent Volume Claim (PVC) or directly inline in a pod definition, Local Persistent Volumes can only be referenced via a PVC. This provides additional security benefits since Persistent Volume objects are managed by the administrator, preventing Pods from being able to access any path on the host.
Additional benefits include support for formatting of block devices during mount, and volume ownership using fsGroup.
What's New With GA?
Since 1.10, we have mainly focused on improving stability and scalability of the feature so that it is production ready.
The only major feature addition is the ability to specify a raw block device and have Kubernetes automatically format and mount the filesystem. This reduces the previous burden of having to format and mount devices before giving it to Kubernetes.
Limitations of GA
At GA, Local Persistent Volumes do not support . However there is an external controller available to help manage the local PersistentVolume lifecycle for individual disks on your nodes. This includes creating the PersistentVolume objects, cleaning up and reusing disks once they have been released by the application.
How to Use a Local Persistent Volume?
Workloads can request a local persistent volume using the same PersistentVolumeClaim interface as remote storage backends. This makes it easy to swap out the storage backend across clusters, clouds, and on-prem environments.
First, a StorageClass should be created that sets volumeBindingMode: WaitForFirstConsumer
to enable .
This mode instructs Kubernetes to wait to bind a PVC until a Pod using it is scheduled.
kind: StorageClass
apiVersion: storage.k8s.io/v1
metadata:
name: local-storage
provisioner: kubernetes.io/no-provisioner
volumeBindingMode: WaitForFirstConsumer
Then, the external static provisioner can be to create PVs for all the local disks on your nodes.
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-27c0f084 368Gi RWO Delete Available local-storage 8s
local-pv-3796b049 368Gi RWO Delete Available local-storage 7s
local-pv-3ddecaea 368Gi RWO Delete Available local-storage 7s
Afterwards, workloads can start using the PVs by creating a PVC and Pod or a StatefulSet with volumeClaimTemplates.
apiVersion: apps/v1
kind: StatefulSet
metadata:
name: local-test
spec:
serviceName: "local-service"
replicas: 3
selector:
matchLabels:
app: local-test
template:
metadata:
labels:
app: local-test
spec:
containers:
- name: test-container
image: k8s.gcr.io/busybox
command:
- "/bin/sh"
args:
- "-c"
- "sleep 100000"
volumeMounts:
- name: local-vol
mountPath: /usr/test-pod
volumeClaimTemplates:
- metadata:
name: local-vol
spec:
accessModes: [ "ReadWriteOnce" ]
storageClassName: "local-storage"
resources:
requests:
storage: 368Gi
Once the StatefulSet is up and running, the PVCs are all bound:
$ kubectl get pvc
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
local-vol-local-test-0 Bound local-pv-27c0f084 368Gi RWO local-storage 3m45s
local-vol-local-test-1 Bound local-pv-3ddecaea 368Gi RWO local-storage 3m40s
local-vol-local-test-2 Bound local-pv-3796b049 368Gi RWO local-storage 3m36s
When the disk is no longer needed, the PVC can be deleted. The external static provisioner will clean up the disk and make the PV available for use again.
$ kubectl patch sts local-test -p '{"spec":{"replicas":2}}'
statefulset.apps/local-test patched
$ kubectl delete pvc local-vol-local-test-2
persistentvolumeclaim "local-vol-local-test-2" deleted
$ kubectl get pv
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
local-pv-27c0f084 368Gi RWO Delete Bound default/local-vol-local-test-0 local-storage 11m
local-pv-3796b049 368Gi RWO Delete Available local-storage 7s
local-pv-3ddecaea 368Gi RWO Delete Bound default/local-vol-local-test-1 local-storage 19m
You can find full for the feature on the Kubernetes website.
What Are Suitable Use Cases?
The primary benefit of Local Persistent Volumes over remote persistent storage is performance: local disks usually offer higher IOPS and throughput and lower latency compared to remote storage systems.
However, there are important limitations and caveats to consider when using Local Persistent Volumes:
- Using local storage ties your application to a specific node, making your application harder to schedule. Applications which use local storage should specify a high priority so that lower priority pods, that don’t require local storage, can be preempted if necessary.
- If that node or local volume encounters a failure and becomes inaccessible, then that pod also becomes inaccessible. Manual intervention, external controllers, or operators may be needed to recover from these situations.
- While most remote storage systems implement synchronous replication, most local disk offerings do not provide data durability guarantees. Meaning loss of the disk or node may result in loss of all the data on that disk
For these reasons, local persistent storage should only be considered for workloads that handle data replication and backup at the application layer, thus making the applications resilient to node or data failures and unavailability despite the lack of such guarantees at the individual disk level.
Examples of good workloads include software defined storage systems and replicated databases. Other types of applications should continue to use highly available, remotely accessible, durable storage.
How Uber Uses Local Storage
Prior to the pilot with local persistent volumes, M3DB ran exclusively in
Uber-managed environments. Over time, internal use cases arose that required the
ability to run M3DB in environments with fewer dependencies. So the team began
to explore options. As an open-source project, we wanted to provide the
community with a way to run M3DB as easily as possible, with an open-source
stack, while meeting M3DB’s requirements for high throughput, low-latency
storage, and the ability to scale itself out. The Kubernetes Local Persistent Volume interface, with its high-performance,
low-latency guarantees, quickly emerged as the perfect abstraction to build on
top of. With Local Persistent Volumes, individual M3DB instances can comfortably
handle up to 600k writes per-second. This leaves plenty of headroom for spikes
on clusters that typically process a few million metrics per-second. Because M3DB also gracefully handles losing a single node or volume, the limited
data durability guarantees of Local Persistent Volumes are not an issue. If a
node fails, M3DB finds a suitable replacement and the new node begins streaming
data from its two peers. Thanks to the Kubernetes scheduler’s intelligent handling of volume topology,
M3DB is able to programmatically evenly disperse its replicas across multiple
local persistent volumes in all available cloud zones, or, in the case of
on-prem clusters, across all available server racks. As mentioned above, while Local Persistent Volumes provide many benefits, they
also require careful planning and careful consideration of constraints before
committing to them in production. When thinking about our local volume strategy
for M3DB, there were a few things Uber had to consider. For one, we had to take into account the hardware profiles of the nodes in our
Kubernetes cluster. For example, how many local disks would each node cluster
have? How would they be partitioned? When first testing local volumes, we wanted to have a thorough understanding of
the effect
disruptions
(voluntary and involuntary) would have on pods using
local storage, and so we began testing some failure scenarios. We found that
when a local volume becomes unavailable while the node remains available (such
as when performing maintenance on the disk), a pod using the local volume will
be stuck in a ContainerCreating state until it can mount the volume. If a node
becomes unavailable, for example if it is removed from the cluster or is
drained,
then pods using local volumes on that node are stuck in an Unknown or
Pending state depending on whether or not the node was removed gracefully. Recovering pods from these interim states means having to delete the PVC binding
the pod to its local volume and then delete the pod in order for it to be
rescheduled (or wait until the node and disk are available again). We took this
into account when building our
for M3DB, which makes changes to the
cluster topology when a pod is rescheduled such that the new one gracefully
streams data from the remaining two peers. Eventually we plan to automate the
deletion and rescheduling process entirely. Alerts on pod states can help call attention to stuck local volumes, and
workload-specific controllers or operators can remediate them automatically.
Because of these constraints, it’s best to exclude nodes with local volumes from
automatic upgrades or repairs, and in fact some cloud providers explicitly
mention this as a best practice. Local Volumes played a big role in Uber’s decision to build orchestration for
M3DB using Kubernetes, in part because it is a storage abstraction that works
the same across on-prem and cloud environments. Remote storage solutions have
different characteristics across cloud providers, and some users may prefer not
to use networked storage at all in their own data centers. On the other hand,
local disks are relatively ubiquitous and provide more predictable performance
characteristics. By orchestrating M3DB using local disks in the cloud, where it was easier to get
up and running with Kubernetes, we gained confidence that we could still use our
operator to run M3DB in our on-prem environment without any modifications. As we
continue to work on how we’d run Kubernetes on-prem, having solved such an
important pending question is a big relief. As we’ve seen with Uber’s M3DB, local persistent volumes have successfully been
used in production environments. As adoption of local persistent volumes
continues to increase, SIG Storage continues to seek feedback for ways to
improve the feature. One of the most frequent asks has been for a controller that can help with
recovery from failed nodes or disks, which is currently a manual process (or
something that has to be built into an operator). SIG Storage is investigating
creating a common controller that can be used by workloads with simple and
similar recovery processes. Another popular ask has been to support dynamic provisioning using lvm. This can
simplify disk management, and improve disk utilization. SIG Storage is
evaluating the performance tradeoffs for the viability of this feature. If you have feedback for this feature or are interested in getting involved with
the design and development, join the
(SIG). We’re rapidly growing and always welcome new contributors. Special thanks to all the contributors that helped bring this feature to GA,
including Chuqiang Li (lichuqiang), Dhiraj Hedge (dhirajh), Ian Chakeres
(ianchakeres), Jan Šafránek (jsafrane), Michelle Au (msau42), Saad Ali
(saad-ali), Yecheng Fu (cofyc) and Yuquan Ren (nickrenren).Uber's Operational Experience
Portability Between On-Prem and Cloud
What's Next for Local Persistent Volumes?
Getting Involved