The Machines Can Do the Work, a Story of Kubernetes Testing, CI, and Automating the Contributor Experience

Wednesday, August 29, 2018

Author: Aaron Crickenberger (Google) and Benjamin Elder (Google)

“Large projects have a lot of less exciting, yet, hard work. We value time spent automating repetitive work more highly than toil. Where that work cannot be automated, it is our culture to recognize and reward all types of contributions. However, heroism is not sustainable.” -

Like many open source projects, Kubernetes is hosted on GitHub. We felt the barrier to participation would be lowest if the project lived where developers already worked, using tools and processes developers already knew. Thus the project embraced the service fully: it was the basis of our workflow, our issue tracker, our documentation, our blog platform, our team structure, and more.

This strategy worked. It worked so well that the project quickly scaled past its contributors’ capacity as humans. What followed was an incredible journey of automation and innovation. We didn’t just need to rebuild our airplane mid-flight without crashing, we needed to convert it into a rocketship and launch into orbit. We needed machines to do the work.

The Work

Initially, we focused on the fact that we needed to support the sheer volume of tests mandated by a complex distributed system such as Kubernetes. Real world failure scenarios had to be exercised via end-to-end (e2e) tests to ensure proper functionality. Unfortunately, e2e tests were susceptible to flakes (random failures) and took anywhere from an hour to a day to complete.

Further experience revealed other areas where machines could do the work for us:

PR Workflow
- Did the contributor sign our CLA?
- Did the PR pass tests?
- Is the PR mergeable?
- Did the merge commit pass tests?
Triage
- Who should be reviewing PRs?
- Is there enough information to route an issue to the right people?
- Is an issue still relevant?
Project Health
- What is happening in the project?
- What should we be paying attention to?

As we developed automation to improve our situation, we followed a few guiding principles:

Follow the push/poll control loop patterns that worked well for Kubernetes
Prefer stateless loosely coupled services that do one thing well
Prefer empowering the entire community over empowering a few core contributors
Eat our own dogfood and avoid reinventing wheels

Enter Prow

This led us to create

Prow lets us do things like:

Allow our community to triage issues/PRs by commenting commands such as “/priority critical-urgent”, “/assign mary” or “/close”

Auto-label PRs based on how much code they change, or which files they touch

Age out issues/PRs that have remained inactive for too long

Auto-merge PRs that meet our PR workflow requirements

Run CI jobs defined as
Enforce org-wide and per-repo GitHub policies like

Prow was initially developed by the engineering productivity team building Google Kubernetes Engine, and is actively contributed to by multiple members of Kubernetes SIG Testing. Prow has been adopted by several other open source projects, including Istio, JetStack, Knative and OpenShift.

Once we had Prow in place, we began to hit other scaling bottlenecks, and so produced additional tooling to support testing at the scale required by Kubernetes, including:

Scaling Project Health

With workflow automation addressed, we turned our attention to project health. We chose to use Google Cloud Storage (GCS) as our source of truth for all test data, allowing us to lean on established infrastructure, and allowed the community to contribute results. We then built a variety of tools to help individuals and the project as a whole make sense of this data, including:

We approached the Cloud Native Computing Foundation (CNCF) to develop DevStats to glean insights from our GitHub events such as:

Into the Beyond

Today, the Kubernetes project spans over 125 repos across five orgs. There are 31 Special Interests Groups and 10 Working Groups coordinating development within the project. In the last year the project has had participation from over 13,800 unique developers on GitHub.

On any given weekday our Prow instance runs over 10,000 CI jobs; from March 2017 to March 2018 it ran 4.3 million jobs. Most of these jobs involve standing up an entire Kubernetes cluster, and exercising it using real world scenarios. They allow us to ensure all supported releases of Kubernetes work across cloud providers, container engines, and networking plugins. They make sure the latest releases of Kubernetes work with various optional features enabled, upgrade safely, meet performance requirements, and work across architectures.

With today’s

Want to find out more? Come check out these resources:

Automation and the Kubernetes Contributor Experience

←Previous