K8s KPIs with Kuberhealthy
Authors: Joshulyne Park (Comcast), Eric Greer (Comcast)
Building Onward from Kuberhealthy v2.0.0
Last November at KubeCon San Diego 2019, we announced the release of
Kuberhealthy 2.0.0 - transforming Kuberhealthy into a Kubernetes operator
for synthetic monitoring. This new ability granted developers the means to create their own Kuberhealthy check
containers to synthetically monitor their applications and clusters. The community was quick to adopt this new feature and we're grateful for everyone who implemented and tested Kuberhealthy 2.0.0 in their clusters. Thanks to all of you who reported
issues and contributed to discussions on the #kuberhealthy Slack channel. We quickly set to work to address all your feedback
with a newer version of Kuberhealthy. Additionally, we created a guide on how to easily install and use Kuberhealthy in order to capture some helpful synthetic
To install Kuberhealthy, make sure you have . See additional details about configuring the appropriate scrape annotations in the section Prometheus Integration Details below. Running the Helm command should automatically install the newest version of Kuberhealthy (v2.2.0) along with a few basic checks. If you run Next, you can run To view other available external checks, check out the
Kuberhealthy check pods should start running shortly after Kuberhealthy starts running (1-2 minutes). Additionally, the check-reaper cronjob runs every few minutes to ensure there are no more than 5 completed checker pods left lying around at a time. To get status page view of these checks, you'll need to either expose the This JSON page displays all Kuberhealthy checks running in your cluster. If you have Kuberhealthy checks running in different namespaces, you can filter them by adding the Kuberhealthy is designed to be extended with custom check containers that can be written by anyone to check anything. These checks can be written in any language as long as they are packaged in a container. This makes Kuberhealthy an excellent platform for creating your own synthetic checks! Creating your own check is a great way to validate your client library, simulate real user workflow, and create a high level of confidence in your service or system uptime. To learn more about writing your own checks, along with simple examples, check the
When enabling Prometheus (not the operator), the Kuberhealthy service gets the following annotations added: In your prometheus configuration, add the following example scrape_config that scrapes the Kuberhealthy service given the added prometheus annotation: You can also specify the target endpoint to be scraped using this example job: Once the appropriate prometheus configurations are applied, you should be able to see the following Kuberhealthy metrics: Using these Kuberhealthy metrics, our team has been able to collect KPIs based on the following definitions, calculations, and PromQL queries. Availability We define availability as the K8s cluster control plane being up and functioning as expected. This is measured by our ability to create a deployment, do a rolling update, and delete the deployment within a set period of time. We calculate this by measuring Kuberhealthy's
Availability = Uptime / (Uptime * Downtime) Uptime = Number of Deployment Check Passes * Check Run Interval Downtime = Number of Deployment Check Fails * Check Run Interval Check Run Interval = how often the check runs ( PromQL Query (Availability % over the past 30 days): Utilization We define utilization as user uptake of product (k8s) and its resources (pods, services, etc.). This is measured by how many nodes, deployments, statefulsets, persistent volumes, services, pods, and jobs are being utilized by our customers.
We calculate this by counting the total number of nodes, deployments, statefulsets, persistent volumes, services, pods, and jobs. Duration (Latency) We define duration as the control plane's capacity and utilization of throughput. We calculate this by capturing the average run duration of a Kuberhealthy
Errors / Alerts We define errors as all k8s cluster and Kuberhealthy related alerts. Every time one of our Kuberhealthy check fails, we are alerted of this failure. Thanks again to everyone in the community for all of your contributions and help! We are excited to see what you build. As always, if you find an issue, have a feature request, or need to open a pull request, please
Deploying Kuberhealthy
To install using Helm 3:
1. Create namespace "kuberhealthy" in the desired Kubernetes cluster/context:
kubectl create namespace kuberhealthy
2. Set your current namespace to "kuberhealthy":
kubectl config set-context --current --namespace=kuberhealthy
3. Add the kuberhealthy repo to Helm:
helm repo add kuberhealthy https://comcast.github.io/kuberhealthy/helm-repos
4. Depending on your Prometheus implementation, install Kuberhealthy using the appropriate command for your cluster:
helm install kuberhealthy kuberhealthy/kuberhealthy --set prometheus.enabled=true,prometheus.enableAlerting=true,prometheus.enableScraping=true,prometheus.serviceMonitor=true
helm install kuberhealthy kuberhealthy/kuberhealthy --set prometheus.enabled=true,prometheus.enableAlerting=true,prometheus.enableScraping=true
helm install kuberhealthy kuberhealthy/kuberhealthy
kubectl get pods
, you should see two Kuberhealthy pods. These are the pods that create, coordinate, and track test pods. These two Kuberhealthy pods also serve a JSON status page as well as a /metrics
endpoint. Every other pod you see created is a checker pod designed to execute and shut down when done.Configuring Additional Checks
kubectl get khchecks
. You should see three Kuberhealthy checks installed by default:kuberhealthy
service externally by editing the service kuberhealthy
and setting Type: LoadBalancer
or use kubectl port-forward service/kuberhealthy 8080:80
. When viewed, the service endpoint will display a JSON status page that looks like this:{
"OK": true,
"Errors": [],
"CheckDetails": {
"kuberhealthy/daemonset": {
"OK": true,
"Errors": [],
"RunDuration": "22.512278967s",
"Namespace": "kuberhealthy",
"LastRun": "2020-04-06T23:20:31.7176964Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "9abd3ec0-b82f-44f0-b8a7-fa6709f759cd"
},
"kuberhealthy/deployment": {
"OK": true,
"Errors": [],
"RunDuration": "29.142295647s",
"Namespace": "kuberhealthy",
"LastRun": "2020-04-06T23:20:31.7176964Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "5f0d2765-60c9-47e8-b2c9-8bc6e61727b2"
},
"kuberhealthy/dns-status-internal": {
"OK": true,
"Errors": [],
"RunDuration": "2.43940936s",
"Namespace": "kuberhealthy",
"LastRun": "2020-04-06T23:20:44.6294547Z",
"AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
"uuid": "c85f95cb-87e2-4ff5-b513-e02b3d25973a"
}
},
"CurrentMaster": "kuberhealthy-7cf79bdc86-m78qr"
}
GET
variable namespace
parameter: ?namespace=kuberhealthy,kube-system
onto the status page URL.Writing Your Own Checks
Prometheus Integration Details
prometheus.io/path: /metrics
prometheus.io/port: "80"
prometheus.io/scrape: "true"
- job_name: 'kuberhealthy'
scrape_interval: 1m
honor_labels: true
metrics_path: /metrics
kubernetes_sd_configs:
- role: service
namespaces:
names:
- kuberhealthy
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- job_name: kuberhealthy
scrape_interval: 1m
honor_labels: true
metrics_path: /metrics
static_configs:
- targets:
- kuberhealthy.kuberhealthy.svc.cluster.local:80
kuberhealthy_check
kuberhealthy_check_duration_seconds
kuberhealthy_cluster_states
kuberhealthy_running
Creating Key Performance Indicators
runInterval
set in your KuberhealthyCheck Spec)1 - (sum(count_over_time(kuberhealthy_check{check="kuberhealthy/deployment", status="0"}[30d])) OR vector(0)) / sum(count_over_time(kuberhealthy_check{check="kuberhealthy/deployment", status="1"}[30d]))
avg(kuberhealthy_check_duration_seconds{check="kuberhealthy/deployment"})
Thank You!