SIG Node CI Subproject Celebrates Two Years of Test Improvements

Authors: Sergey Kanzhelev (Google), Elana Hashman (Red Hat)

Ensuring the reliability of SIG Node upstream code is a continuous effort that takes a lot of behind-the-scenes effort from many contributors. There are frequent releases of Kubernetes, base operating systems, container runtimes, and test infrastructure that result in a complex matrix that requires attention and steady investment to "keep the lights on." In May 2020, the Kubernetes node special interest group ("SIG Node") organized a new subproject for continuous integration (CI) for node-related code and tests. Since its inauguration, the SIG Node CI subproject has run a weekly meeting, and even the full hour is often not enough to complete triage of all bugs, test-related PRs and issues, and discuss all related ongoing work within the subgroup.

Over the past two years, we've fixed merge-blocking and release-blocking tests, reducing time to merge Kubernetes contributors' pull requests thanks to reduced test flakes. When we started, Node test jobs only passed 42% of the time, and through our efforts, we now ensure a consistent >90% job pass rate. We've closed 144 test failure issues and merged 176 pull requests just in kubernetes/kubernetes. And we've helped subproject participants ascend the Kubernetes contributor ladder, with 3 new org members, 6 new reviewers, and 2 new approvers.

The Node CI subproject is an approachable first stop to help new contributors get started with SIG Node. There is a low barrier to entry for new contributors to address high-impact bugs and test fixes, although there is a long road before contributors can climb the entire contributor ladder: it took over a year to establish two new approvers for the group. The complexity of all the different components that power Kubernetes nodes and its test infrastructure requires a sustained investment over a long period for developers to deeply understand the entire system, both at high and low levels of detail.

We have several regular contributors at our meetings, however; our reviewers and approvers pool is still small. It is our goal to continue to grow contributors to ensure a sustainable distribution of work that does not just fall to a few key approvers.

It's not always obvious how subprojects within SIGs are formed, operate, and work. Each is unique to its sponsoring SIG and tailored to the projects that the group is intended to support. As a group that has welcomed many first-time SIG Node contributors, we'd like to share some of the details and accomplishments over the past two years, helping to demystify our inner workings and celebrate the hard work of all our dedicated contributors!

Timeline

May 2020. SIG Node CI group was formed on May 11, 2020, with more than 30 volunteers signed up, to improve SIG Node CI signal and overall observability. Victor Pickard focused on getting original group charter document. The SIG Node chairs sponsored group creation with Victor as a subproject lead. Sergey Kanzhelev joined Victor shortly after as a co-lead.

At the kick-off meeting, we discussed which tests to concentrate on fixing first and discussed merge-blocking and release-blocking tests, many of which were failing due to infrastructure issues or buggy test code.

The subproject launched weekly hour-long meetings to discuss ongoing work discussion and triage.

June 2020. Morgan Bauer, Karan Goel, and Jorge Alarcon Ochoa were recognized as reviewers for the SIG Node CI group for their contributions, helping significantly with the early stages of the subproject. David Porter and Roy Yang also joined the SIG test failures GitHub team.

August 2020. All merge-blocking and release-blocking tests were passing, with some flakes. However, only 42% of all SIG Node test jobs were green, as there were many flakes and failing tests.

October 2020. Amim Knabben becomes a Kubernetes org member for his contributions to the subproject.

January 2021. With healthy presubmit and critical periodic jobs passing, the subproject discussed its goal for cleaning up the rest of periodic tests and ensuring they passed without flakes.

Elana Hashman joined the subproject, stepping up to help lead it after Victor's departure.

February 2021. Artyom Lukianov becomes a Kubernetes org member for his contributions to the subproject.

August 2021. After SIG Node successfully ran a to clean up its bug backlog, the scope of the meeting was extended to include bug triage to increase overall reliability, anticipating issues before they affect the CI signal.

Subproject leads Elana Hashman and Sergey Kanzhelev are both recognized as approvers on all node test code, supported by SIG Node and SIG Testing.

September 2021. After significant deflaking progress with serial tests in the 1.22 release spearheaded by Francesco Romani, the subproject set a goal for getting the serial job fully passing by the 1.23 release date.

Mike Miranda becomes a Kubernetes org member for his contributions to the subproject.

November 2021. Throughout 2021, SIG Node had no merge or release-blocking test failures. Many flaky tests from past releases are removed from release-blocking dashboards as they had been fully cleaned up.

Danielle Lancashire was recognized as a reviewer for SIG Node's subgroup, test code.

The final node serial tests were completely fixed. The serial tests consist of many disruptive and slow tests which tend to be flakey and are hard to troubleshoot. By the 1.23 release freeze, the last serial tests were fixed and the job was passing without flakes.

The 1.23 release got a special shout out for the tests quality and CI signal. The SIG Node CI subproject was proud to have helped contribute to such a high-quality release, in part due to our efforts in identifying and fixing flakes in Node and beyond.

December 2021. An estimated 90% of test jobs were passing at the time of the 1.23 release (up from 42% in August 2020).

Dockershim code was removed from Kubernetes. This affected nearly half of SIG Node's test jobs and the SIG Node CI subproject reacted quickly and retargeted all the tests. SIG Node was the first SIG to complete test migrations off dockershim, providing examples for other affected SIGs. The vast majority of new jobs passed at the time of introduction without further fixes required. The ) from Kubernetes is ongoing. There are still some wrinkles from the dockershim removal as we uncover more dependencies on dockershim, but we plan to stabilize all test jobs by the 1.24 release.

Statistics

Our regular meeting attendees and subproject participants for the past few months:

  • Aditi Sharma
  • Artyom Lukianov
  • Arnaud Meukam
  • Danielle Lancashire
  • David Porter
  • Davanum Srinivas
  • Elana Hashman
  • Francesco Romani
  • Matthias Bertschy
  • Mike Miranda
  • Paco Xu
  • Peter Hunt
  • Ruiwen Zhao
  • Ryan Phillips
  • Sergey Kanzhelev
  • Skyler Clark
  • Swati Sehgal
  • Wenjun Wu

The

Triaged issues and PRs on CI board (including triaging away from the subgroup scope):

  • 2020 (since May): 132
  • 2021: 532

Future

Just "keeping the lights on" is a bold task and we are committed to improving this experience. We are working to simplify the triage and review processes for SIG Node.

Specifically, we are working on better test organization, naming, and tracking:

We are also constantly making progress on improved tests debuggability and de-flaking.

If any of this interests you, we'd love for you to join us! There's plenty to learn in debugging test failures, and it will help you gain familiarity with the code that SIG Node maintains.

You can always find information about the group on the KubeCon + CloudNative North America 2021. Join us in our mission to keep the kubelet and other SIG Node components reliable and ensure smooth and uneventful releases!