Reporting Errors from Control Plane to Applications Using Kubernetes Events

At .

One use case of Box’s control plane is

| | | Figure1: Block Diagram of the PKI flow |

If an application needs a new certificate, the application owner explicitly adds a Custom Resource Definition (CRD) to the application’s Kubernetes config [1]. This CRD specifies parameters for the SSL certificate: name, common name, and others. A microservice in the control plane watches CRDs and triggers some processing for SSL certificate generation [2]. Once the certificate is ready, the same control plane service sends it to the API server in a Kubernetes Secret [3]. After that, the application containers access their certificates using Kubernetes Secret VolumeMounts [4]. You can see a working demo of this system in our

The rest of this post covers the error scenarios in this “triggered” processing in the control plane. In particular, we are especially concerned with user input errors. Because the SSL certificate parameters come from the application’s config file in a CRD format, what should happen if there is an error in that CRD specification? Even a typo results in a failure of the SSL certificate creation. The error information is available in the control plane even though the root cause is most probably inside the application’s config file. The application owner does not have access to the control plane’s state or logs.

Providing the right diagnosis to the application owner so she can fix the mistake becomes a serious productivity problem at scale. Box’s rapid migration to microservices results in several new deployments every week. Numerous first time users, who do not know every detail of the infrastructure, need to succeed in deploying their services and troubleshooting problems easily. As the owners of the infrastructure, we do not want to be the bottleneck while reading the errors from the control plane logs and passing them on to application owners. If something in an owner’s configuration causes an error somewhere else, owners need a fully empowering diagnosis. This error data must flow automatically without any human involvement.

After considerable thought and experimentation, we found that

We experimented with a status web page for the control plane service as an alternative to Kubernetes Events. We determined that the status page could update every time after processing an SSL certificate, and that application owners could probe the status page and get the diagnosis from there. After experimenting with a status page initially, we have seen that this does not work as effectively as the Kubernetes Events solution. The status page becomes a new interface to learn for the application owner, a new web address to remember, and one more context switch to a distinct tool during troubleshooting efforts. On the other hand, Kubernetes Events show up cleanly at the kubectl describe output, which is easily recognized by the developers.

Here is a simplified example showing how we used Kubernetes Events for error reporting across distinct services. We have open sourced a

The sample application executes this

// eventRecorder returns an EventRecorder type that can be  
// used to post Events to different object's lifecycles.  
func eventRecorder(  
   kubeClient \*kubernetes.Clientset) (record.EventRecorder, error) {  
   eventBroadcaster := record.NewBroadcaster()  
   eventBroadcaster.StartLogging(glog.Infof)  
   eventBroadcaster.StartRecordingToSink(  
      &typedcorev1.EventSinkImpl{  
         Interface: kubeClient.CoreV1().Events("")})  
   recorder := eventBroadcaster.NewRecorder(  
      scheme.Scheme,  
      v1.EventSource{Component: "controlplane"})  
   return recorder, nil  
}

After the one-time setup, the following

ref, err := reference.GetReference(scheme.Scheme, &pod)  
if err != nil {  
   glog.Fatalf("Could not get reference for pod %v: %v\n",  
      pod.Name, err)  
}  
recorder.Event(ref, v1.EventTypeWarning, "pki ServiceName error",  
   fmt.Sprintf("ServiceName: %s in pki: %s is not found in"+  
      " allowedNames: %s", pki.Spec.ServiceName, pki.Name,  
      allowedNames))

Further implementation details can be understood by running the sample application.

As mentioned previously, here is the relevant kubectl describe output for the application owner.

Events:  
  FirstSeen   LastSeen   Count   From         SubObjectPath   Type      Reason         Message  
  ---------   --------   -----   ----         -------------   --------   ------     
  ....  
  1d      1m      24   controlplane            Warning      pki ServiceName error   ServiceName: appp1 in pki: app1-pki is not found in allowedNames: [app1 app2]  
  ....  

We have demonstrated a practical use case with Kubernetes Events. The automated feedback to programmers in the case of configuration errors has significantly improved our troubleshooting efforts. In the future, we plan to use Kubernetes Events in various other applications under similar use cases. The recently created

If you have a Kubernetes experience you’d like to share,

Special thanks for Greg Lyons and Mohit Soni for their contributions.
Hakan Baba, Sr. Software Engineer, Box