Gardener Project Update

Monday, December 02, 2019

Authors: Rafael Franzke (SAP), Vasu Chandrasekhara (SAP)

Last year, we introduced Kubernetes Community Meeting and in a post on the Kubernetes Blog. At SAP, we have been running Gardener for more than two years, and are successfully managing thousands of clusters in various versions on all major hyperscalers as well as in numerous infrastructures and private clouds that typically join an enterprise via acquisitions.

We are often asked why a handful of dynamically scalable clusters would not suffice. We also started our journey into Kubernetes with a similar mindset. But we realized that applying the architecture and principles of Kubernetes to productive scenarios, our internal and external customers very quickly required the rational separation of concerns and ownership, which in most circumstances led to the use of multiple clusters. Therefore, a scalable and managed Kubernetes as a service solution is often also the basis for adoption. Particularly, when a larger organization runs multiple products on different providers and in different regions, the number of clusters will quickly rise to the hundreds or even thousands.

Today, we want to give an update on what we have implemented in the past year regarding extensibility and customizability, and what we plan to work on for our next milestone.

Short Recap: What Is Gardener?

Gardener's main principle is to leverage Kubernetes primitives for all of its operations, commonly described as inception or kubeception. The feedback from the community was that initially our architecture diagram looks "overwhelming", but after some little digging into the material, everything we do is the "Kubernetes way". One can re-use all learnings with respect to APIs, control loops, etc.
The essential idea is that so-called seed clusters are used to host the control planes of end-user clusters (botanically named shoots).
Gardener provides vanilla Kubernetes clusters as a service independent of the underlying infrastructure provider in a homogenous way, utilizing the upstream provided k8s.gcr.io/* images as open distribution. The project is built entirely on top of Kubernetes extension concepts, and as such adds a custom API server, a controller-manager, and a scheduler to create and manage the lifecycle of Kubernetes clusters. It extends the Kubernetes API with custom resources, most prominently the Gardener cluster specification (Shoot resource), that can be used to "order" a Kubernetes cluster in a declarative way (for day-1, but also reconcile all management activities for day-2).

By leveraging Kubernetes as base infrastructure, we were able to devise a combined that, when configured with custom heuristics, scales all control plane components up/down or out/in automatically. This enables a fast scale-out, even beyond the capacity of typically some fixed number of master nodes. This architectural feature is one of the main differences compared to many other Kubernetes cluster provisioning tools. But in our production, Gardener does not only effectively reduce the total costs of ownership by bin-packing control planes. It also simplifies implementation of "day-2 operations" (like cluster updates or robustness qualities). Again, essentially by relying on all the mature Kubernetes features and capabilities.

The newly introduced extension concepts for Gardener now enable providers to only maintain their specific extension without the necessity to develop inside the core source tree.

Extensibility

As result of its growth over the past years, the Kubernetes code base contained a numerous amount of provider-specific code that is now being externalized from its core source tree. The same has happened with Project Gardener: over time, lots of specifics for cloud providers, operating systems, network plugins, etc. have been accumulated. Generally, this leads to a significant increase of efforts when it comes to maintainability, testability, or to new releases. Our community member Gardener support for their infrastructure in-tree, and suffered from the mentioned downsides.

Consequently, similar to how the Kubernetes community decided to move their cloud-controller-managers out-of-tree, or volumes plugins to CSI, etc., the Gardener community and implemented likewise extension concepts. The Gardener core source-tree is now devoid of any provider specifics, allowing vendors to solely focus on their infrastructure specifics, and enabling core contributors becoming more agile again.

Typically, setting up a cluster requires a flow of interdependent steps, beginning with the generation of certificates and preparation of the infrastructure, continuing with the provisioning of the control plane and the worker nodes, and ending with the deployment of system components. We would like to emphasize here that all these steps are necessary (cf. ) and all Kubernetes cluster creation tools implement the same steps (automated to some degree) in one way or another.

The general idea of Gardener's extensibility concept was to make this flow more generic and to carve out custom resources for each step which can serve as ideal extension points.

Figure 1 Cluster reconciliation flow with extension points.

With Gardener's flow framework we implicitly have a reproducible state machine for all infrastructures and all possible states of a cluster.

The Gardener extensibility approach defines custom resources that serve as ideal extension points for the following categories:

DNS providers (e.g., Route53, CloudDNS, ...),
Blob storage providers (e.g., S3, GCS, ABS,...),
Infrastructure providers (e.g., AWS, GCP, Azure, ...),
Operating systems (e.g., CoreOS Container Linux, Ubuntu, FlatCar Linux, ...),
Network plugins (e.g., Calico, Flannel, Cilium, ...),
Non-essential extensions (e.g., Let's Encrypt certificate service).

Extension Points

Besides leveraging custom resource definitions, we also effectively use mutating / validating webhooks in the seed clusters. Extension controllers themselves run in these clusters and react on CRDs and workload resources (like Deployment, StatefulSet, etc.) they are responsible for. Similar to the Cluster API's approach, these CRDs may also contain provider specific information.

The steps 2. - 10. [cf. Figure 1] involve infrastructure specific meta data referring to infrastructure specific implementations, e.g. for DNS records there might be aws-route53, google-clouddns, or for isolated networks even openstack-designate, and many more. We are going to examine the steps 4 and 6 in the next paragraphs as examples for the general concepts (based on the implementation for AWS). If you're interested you can read up the fully documented API contract in our extensibility documents.

Example: `Infrastructure` CRD

Kubernetes clusters on AWS require a certain infrastructure preparation before they can be used. This includes, for example, the creation of a VPC, subnets, etc. The purpose of the Infrastructure CRD is to trigger this preparation:

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--foobar--aws
spec:
  type: aws
  region: eu-west-1
  secretRef:
    name: cloudprovider
    namespace: shoot--foobar—aws
  sshPublicKey: c3NoLXJzYSBBQUFBQ...
  providerConfig:
    apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureConfig
    networks:
      vpc:
        cidr: 10.250.0.0/16
      zones:
      - name: eu-west-1a
        internal: 10.250.112.0/22
        public: 10.250.96.0/22
        workers: 10.250.0.0/19

Based on the Shoot resource, Gardener creates this Infrastructure resource as part of its reconciliation flow. The AWS-specific providerConfig is part of the end-user's configuration in the Shoot resource and not evaluated by Gardener but just passed to the extension controller in the seed cluster.

In its current implementation, the AWS extension creates a new VPC and three subnets in the eu-west-1a zones. Also, it creates a NAT and an internet gateway, elastic IPs, routing tables, security groups, IAM roles, instances profiles, and an EC2 key pair.

After it has completed its tasks it will report the status and some provider-specific output:

apiVersion: extensions.gardener.cloud/v1alpha1
kind: Infrastructure
metadata:
  name: infrastructure
  namespace: shoot--foobar--aws
spec: ...
status:
  lastOperation:
    type: Reconcile
    state: Succeeded
  providerStatus:
    apiVersion: aws.provider.extensions.gardener.cloud/v1alpha1
    kind: InfrastructureStatus
    ec2:
      keyName: shoot--foobar--aws-ssh-publickey
    iam:
      instanceProfiles:
      - name: shoot--foobar--aws-nodes
        purpose: nodes
      roles:
      - arn: "arn:aws:iam::<accountID>:role/shoot..."
        purpose: nodes
    vpc:
      id: vpc-0815
      securityGroups:
      - id: sg-0246
        purpose: nodes
      subnets:
      - id: subnet-1234
        purpose: nodes
        zone: eu-west-1b
      - id: subnet-5678
        purpose: public
        zone: eu-west-1b

The information inside the providerStatus can be used in subsequent steps, e.g. to configure the cloud-controller-manager or to instrument the machine-controller-manager.

Example: Deployment of the Cluster Control Plane

One of the major features of Gardener is the homogeneity of the clusters it manages across different infrastructures. Consequently, it is still in charge of deploying the provider-independent control plane components into the seed cluster (like etcd, kube-apiserver). The deployment of provider-specific control plane components like cloud-controller-manager or CSI controllers is triggered by a dedicated ControlPlane CRD. In this paragraph, however, we want to focus on the customization of the standard components.

Let's focus on both the kube-apiserver and the kube-controller-manager Deployments. Our AWS extension for Gardener is not yet using CSI but relying on the in-tree EBS volume plugin. Hence, it needs to enable the PersistentVolumeLabel admission plugin and to provide the cloud provider config to the kube-apiserver. Similarly, the kube-controller-manager will be instructed to use its in-tree volume plugin.

The kube-apiserver Deployment incorporates the kube-apiserver container and is deployed by Gardener like this:

containers:
- command:
  - /hyperkube
  - apiserver
  - --enable-admission-plugins=Priority,...,NamespaceLifecycle
  - --allow-privileged=true
  - --anonymous-auth=false
  ...

Using a MutatingWebhookConfiguration the AWS extension injects the mentioned flags and modifies the spec as follows:

containers:
- command:
  - /hyperkube
  - apiserver
  - --enable-admission-plugins=Priority,...,NamespaceLifecycle,PersistentVolumeLabel
  - --allow-privileged=true
  - --anonymous-auth=false
  ...
  - --cloud-provider=aws
  - --cloud-config=/etc/kubernetes/cloudprovider/cloudprovider.conf
  - --endpoint-reconciler-type=none
  ...
  volumeMounts:
  - mountPath: /etc/kubernetes/cloudprovider
    name: cloud-provider-config
volumes:
- configMap:
    defaultMode: 420
    name: cloud-provider-config
  name: cloud-provider-config

The kube-controller-manager Deployment is handled in a similar way.

Webhooks in the seed cluster can be used to mutate anything related to the shoot cluster control plane deployed by Gardener or any other extension. There is a similar webhook concept for resources in shoot clusters in case extension controllers need to customize system components deployed by Gardener.

Registration of Extension Controllers

The Gardener API uses two special resources to register and install extensions. The registration itself is declared via the ControllerRegistration resource. The easiest option is to define the Helm chart as well as some values to render the chart, however, any other deployment mechanism is supported via custom code as well.

Gardener determines whether an extension controller is required in a specific seed cluster, and creates a ControllerInstallation that is used to trigger the deployment.

To date, every registered extension controller is deployed to every seed cluster which is not necessary in general. In the future, Gardener will become more selective to only deploy those extensions required on the specific seed clusters.

Our dynamic registration approach allows to add or remove extensions in the running system - without the necessity to rebuild or restart any component.

Figure 2 Gardener architecture with extension controllers.

Status Quo

We have recently introduced the new core.gardener.cloud API group that incorporates fully forwards and backwards compatible Shoot resources, and that allows providers to use Gardener without modifying anything in its core source tree.

We have already adapted all controllers to use this new API group and have deprecated the old API. Eventually, after a few months we will remove it, so end-users are advised to start migrating to the new API soon.

Apart from that, we have enabled all relevant extensions to contribute to the shoot health status and implemented the respective contract. The basic idea is that the CRDs may have .status.conditions that are picked up by Gardener and merged with its standard health checks into the Shoot status field.

Also, we want to implement some easy-to-use library functions facilitating defaulting and validation webhooks for the CRDs in order to validate the providerConfig field controlled by end-users.

Finally, we will split the repository into separate repositories and keep it only for the generic library functions that can be used to write extension controllers.

Next Steps

Kubernetes has externalized many of the infrastructural management challenges. The inception design solves most of them by delegating lifecycle operations to a separate management plane (seed clusters). But what if the garden cluster or a seed cluster goes down? How do we scale beyond tens of thousands of managed clusters that need to be reconciled in parallel? We are further investing into hardening the Gardener scalability and disaster recovery features. Let's briefly highlight three of the features in more detail:

Gardenlet

Right from the beginning of the Gardener Project we started implementing the operator pattern: We have a custom controller-manager that acts on our own custom resources. Now, when you start thinking about the Gardener architecture, you will recognize some interesting similarity with respect to the Kubernetes architecture: Shoot clusters can be compared with pods, and seed clusters can be seen as worker nodes. Guided by this observation we introduced the gardener-scheduler. Its main task is to find an appropriate seed cluster to host the control-plane for newly ordered clusters, similar to how the kube-scheduler finds an appropriate node for newly created pods. By providing multiple seed clusters for a region (or provider) and distributing the workload, we reduce the blast-radius of potential hick-ups as well.

Figure 3 Similarities between Kubernetes and Gardener architecture.

Yet, there is still a significant difference between the Kubernetes and the Gardener architectures: Kubernetes runs a primary "agent" on every node, the kubelet, which is mainly responsible for managing pods and containers on its particular node. Gardener uses its controller-manager which is responsible for all shoot clusters on all seed clusters, and it is performing its reconciliation loops centrally from the garden cluster.

While this works well at scale for thousands of clusters today, our goal is to enable true scalability following the Kubernetes principles (beyond the capacity of a single controller-manager): We are now working on distributing the logic (or the Gardener operator) into the seed cluster and will introduce a corresponding component, adequately named the gardenlet. It will be Gardener's primary "agent" on every seed cluster and will be only responsible for shoot clusters located in its particular seed cluster.

The gardener-controller-manager will still keep its control loops for other resources of the Gardener API, however, it will no longer talk to seed/shoot clusters.

Reversing the control flow will even allow placing seed/shoot clusters behind firewalls without the necessity of direct accessibility (via VPN tunnels) anymore.

Figure 4 Detailed architecture with Gardenlet.

Control Plane Migration between Seed Clusters

When a seed cluster fails, the user's static workload will continue to operate. However, administrating the cluster won't be possible anymore because the shoot cluster's API server running in the failed seed is no longer reachable.

We have implemented the relocation of failed control planes hit by some seed disaster to another seed and are now working on fully automating this unique capability. In fact, this approach is not only feasible, we have performed the fail-over procedure multiple times in our production.

The automated failover capability will enable us to implement even more comprehensive disaster recovery and scalability qualities, e.g., the automated provisioning and re-balancing of seed clusters or automated migrations for all non-foreseeable cases. Again, think about the similarities with Kubernetes with respect to pod eviction and node drains.

Gardener Ring

The Gardener Ring is our novel approach for provisioning and managing Kubernetes clusters without relying on an external provision tool for the initial cluster. By using Kubernetes in a recursive manner, we can drastically reduce the management complexity by avoiding imperative tool sets, while creating new qualities with a self-stabilizing circular system.

The Ring approach is conceptually different from self-hosting and static pod based deployments. The idea is to create a ring of three (or more) shoot clusters that each host the control plane of its successor.

An outage of one cluster will not affect the stability and availability of the Ring, and as the control plane is externalized the failed cluster can be automatically recovered by Gardener's self-healing capabilities. As long as there is a quorum of at least n/2+1 available clusters the Ring will always stabilize itself. Running these clusters on different cloud providers (or at least in different regions / data centers) reduces the potential for quorum losses.

Figure 5 Self-stabilizing ring of Kubernetes clusters.

The way how the distributed instances of Gardener can share the same data is by deploying separate kube-apiserver instances talking to the same etcd cluster. These kube-apiservers are forming a node-less Kubernetes cluster that can be used as "data container" for Gardener and its associated applications.

We are running test landscapes internally protected by the ring and it has saved us from manual interventions. With the automated control plane migration in place we can easily bootstrap the Ring and will solve the "initial cluster problem" as well as improve the overall robustness.