The highest tagged major version is v2.

kuberhealthy

module

v1.0.2 Latest Latest Go to latest Published: Feb 27, 2019 License: Apache-2.0

Details

Valid go.mod file

The Go module system was introduced in Go 1.11 and is the official dependency management solution for Go.
Redistributable license

Redistributable licenses place minimal restrictions on how software can be used, modified, and redistributed.
Tagged version

Modules with tagged versions give importers more predictable builds.
Stable version

When a project reaches major version v1 it is considered stable.
Learn more about best practices

Repository

github.com/Comcast/kuberhealthy

Links

Open Source Insights

README ¶

Easy synthetic testing for Kubernetes clusters. Supplements other solutions like Prometheus nicely.

Installation

To install with the Helm chart without Prometheus: helm install stable/kuberhealthy

To install the Helm chart with Prometheus: helm install stable/kuberhealthy --set prometheus.enabled=true

To install the Helm chart with Prometheus Operator: helm install stable/kuberhealthy --set prometheus.enabled=true --set prometheus.serviceMonitor=true

After installation, Kuberhealthy will only be available from within the cluster (Type: ClusterIP) at the service URL kuberhealthy.kuberhealthy. To expose Kuberhealthy to an external checking service, you must edit the service kuberhealthy and set Type: LoadBalancer.

RBAC bindings and roles are included in all configurations.

Kuberhealthy is currently tested on Kubernetes 1.9.x, 1.10.x, and 1.11.x.

Prometheus Alerts

A ServiceMonitor configuration is available at deploy/servicemonitor.yaml.

Grafana Dashboard

A Grafana dashboard is available at deploy/grafana/dashboard.json

To install this dashboard, follow the instructions here.

What is Kuberhealthy?

Kuberhealthy performs synthetic tests from within Kubernetes clusters in order to catch issues that would otherwise go unnoticed. Instead of trying to identify all the things that could potentially go wrong, Kuberhealthy replicates real workflow and watches carefully for the expected Kubernetes behavior to occur. Kuberhealthy serves both a JSON status page and a Prometheus metrics endpoint for integration into your choice of alerting solution. More checks will be added in future versions to better cover service provisioning, DNS resolution, disk provisioning, and more.

Some examples of errors Kuberhealthy has detected in production:

Nodes where new pods get stuck in Terminating due to CNI communication failures
Nodes where new pods get stuck in ContainerCreating due to disk scheduler errors
Nodes where new pods get stuck in Pending due to Docker daemon errors
Nodes where Docker or Kubelet crashes or has restarted
A node that cannot provision or terminate pods quickly enough due to high IO wait
A pod in the kube-system namespace that is restarting too quickly
A Kubernetes component that is in a non-ready state
Intermittent failures to access or create custom resources
Kubernetes system services remaining technically "healthy" while their underlying pods are crashing too much
- kube-scheduler
- kube-apiserver
- kube-dns

Deployment and Status Page

Deploying Kuberhealthy is as simple as applying the helm chart file in this repository:

cd helm
helm install .

Status Page

If you choose to alert from the JSON status page, you can access the status on http://kuberhealthy.kuberhealthy.svc.cluster.local. The status page displays server status in the format shown below. The boolean OK field can be used to indicate up/down status, while the Errors array will contain a list of potential error descriptions. Granular, per-check information, including the last time a check was run, and the Kuberhealthy pod that ran that specific check is available under the CheckDetails object.

  {
  "OK": true,
  "Errors": [],
  "CheckDetails": {
    "ComponentStatusChecker": {
      "OK": true,
      "Errors": [],
      "LastRun": "2018-06-21T17:32:16.921733843Z",
      "AuthorativePod": "kuberhealthy-7cf79bdc86-m78qr"
    },
    "DaemonSetChecker": {
      "OK": true,
      "Errors": [],
      "LastRun": "2018-06-21T17:31:33.845218901Z",
      "AuthorativePod": "kuberhealthy-7cf79bdc86-m78qr"
    },
    "PodRestartChecker namespace kube-system": {
      "OK": true,
      "Errors": [],
      "LastRun": "2018-06-21T17:31:16.45395092Z",
      "AuthorativePod": "kuberhealthy-7cf79bdc86-m78qr"
    },
    "PodStatusChecker namespace kube-system": {
      "OK": true,
      "Errors": [],
      "LastRun": "2018-06-21T17:32:16.453911089Z",
      "AuthorativePod": "kuberhealthy-7cf79bdc86-m78qr"
    }
  },
  "CurrentMaster": "kuberhealthy-7cf79bdc86-m78qr"
}

High Availability

Kuberhealthy scales horizontally in order to be fault tolerant. By default, two instances are used with a pod disruption budget and RollingUpdate strategy to ensure high availability.

Centralized Check State State

The state of checks is centralized as custom resource records for each check. This allows Kuberhealthy to always serve the same result, no matter which node in the pool you hit. The current master running checks is calculated by all nodes in the deployment by simply querying the Kubernetes API for 'Ready' Kuberhealthy pods of the correct label, and sorting them alphabetically by name. The node that comes first is master.

Checks

Kuberhealthy performs the following checks in parallel at all times:

Daemonset Deployment and Termination

Deploys a daemonset to the kuberhealthy namespace, waits for all pods to be in the 'Ready' state, then terminates them and ensures all pod terminations were successful. Containers are deployed with their resource requirements set to 0 cores and 0 memory and use the pause container from Google (gcr.io/google_containers/pause:0.8.0), which is likely already cached on your nodes. The node-role.kubernetes.io/master NoSchedule taint is tolerated by daemonset testing pods. The pause container is already used by kubelet to do various tasks and should be cached at all times. If a failure occurs anywhere in the daemonset deployment or tear down, an error is shown on the status page describing the issue.

Namespace: kuberhealthy
Timeout: 5 minutes
Check Interval: 15 minutes
Check name: daemonSet

Component Health

Checks for the state of cluster componentstatuses. Kubernetes components include the ETCD and ETCD-event deployments, the Kubernetes scheduler, and the Kubernetes controller manager. This is almost the same as running kubectl get componentstatuses. If a componentstatus status is down for 5 minutes, an alert is shown on the status page.

Timeout: 1 minute
Check Interval: 2 minute
Downtime toleration: 5 minutes
Check name: componentStatus

Excessive Pod Restarts

Checks for excessive pod restarts in the kube-system namespace. If a pod has restarted more than five times in an hour, an error is indicated on the status page. The exact pod's name will be shown as one of the Error field's strings.

A command line flag exists --podCheckNamespaces which can optionally contain a comma-separated list of namespaces on which to run the podRestarts checks. The default value is kube-system. Each namespace for which the check is configured will require the get and list verbs on the pods resource within that namespace.

Namespace: kube-system
Timeout: 3 minutes
Check Interval: 5 minutes
Tolerated restarts per pod over 1 hour: 5
Check name: podRestarts

Pod Status

Checks for pods older than ten minutes in the kube-system namespace that are in an incorrect lifecycle phase (anything that is not 'Ready'). If a podStatus detects a pod down for 5 minutes, an alert is shown on the status page. When a pod is found to be in error, the exact pod's name will be shown as one of the Error field's strings.

A command line flag exists --podCheckNamespaces which can optionally contain a comma-separated list of namespaces on which to run the podStatus checks. The default value is kube-system. Each namespace for which the check is configured will require the get and list RBAC verbs on the pods resource within that namespace.

Namespace: kube-system
Timeout: 1 minutes
Check Interval: 2 minutes
Error state toleration: 5 minutes
Check name: podStatus

Security Considerations

By default, Kuberhealthy exposes an insecure (non-HTTPS) status endpoint without authentication. You should never expose this endpoint to the public internet. Exposing Kuberhealthy's status page to the public internet could result in private cluster information being exposed to the public internet when errors occur and are displayed on the page.

Directories ¶

Path	Synopsis
cmd
kuberhealthy Kuberhealthy is an enhanced health check for Kubernetes clusters.	Kuberhealthy is an enhanced health check for Kubernetes clusters.
pkg module
checks/componentStatus Package componentStatus implements a componentstatus checker.	Package componentStatus implements a componentstatus checker.
checks/daemonSet Package daemonSet contains a Kuberhealthy check for the ability to roll out a daemonset to a cluster.	Package daemonSet contains a Kuberhealthy check for the ability to roll out a daemonset to a cluster.
checks/podRestarts
checks/podStatus Package podStatus implements a pod health checker for Kuberhealthy.	Package podStatus implements a pod health checker for Kuberhealthy.
health
khstatecrd
kubeClient
masterCalculation Package masterCalculation determines the master pod in multi pod kuberhealthy deployments	Package masterCalculation determines the master pod in multi pod kuberhealthy deployments
metrics

?	: This menu
/	: Search site
f or F	: Jump to
y or Y	: Canonical URL