This feature is currently in beta. It is enabled by default.
Multi-Cluster Diagnostics offers a single view into the health of all the clusters you currently monitor with Kubecost.
Health checks include, but are not limited to:
Whether Kubecost is correctly emitting metrics
Whether Kubecost is being scraped by Prometheus
Whether Prometheus has scraped the required metrics
Whether Kubecost's ETL files are healthy
Configuration
# This is an abridged example. Full example in link below.diagnostics:enabled:trueisDiagnosticsPrimary:enabled:true# Only enable this on your primary cluster# Ensure you have configured a unique CLUSTER_ID.prometheus:server:global:external_labels:cluster_id:YOUR_CLUSTER_ID# Ensure you have configured a storage config secret. Using `.Values.thanos.storeSecretName` would also work here.kubecostModel:federatedStorageConfigSecret:federated-store
Additional configuration options can found in the values.yaml under diagnostics:.
Architecture
The multi-cluster diagnostics feature is run as an independent deployment (i.e. deployment/kubecost-diagnostics). Each diagnostics deployment monitors the health of Kubecost and sends that health data to the central object store at the /diagnostics filepath.
The below diagram depicts these interactions. This diagram is specific to the requests required for diagnostics only. For additional diagrams, see our multi-cluster guide.
API usage
The diagnostics API can be accessed through /model/multi-cluster-diagnostics?window=2d (or /model/mcd for short)
The window query parameter is required, which will return all diagnostics within the specified time window.
The Multi-cluster Diagnostics API provides a single view into the health of all the clusters you currently monitor with Kubecost.
Path Parameters
Name
Type
Description
window*
string
Duration of time over which to query. Accepts words like today, week, month, yesterday, lastweek, lastmonth; durations like 30m, 12h, 7d; comma-separated RFC3339 date pairs like 2021-01-02T15:04:05Z,2021-02-02T15:04:05Z; comma-separated Unix timestamp (seconds) pairs like 1578002645,1580681045.
{"code":200,"data": {"overview": {"kubecostEmittingMetricDiagnosticPassed":true,"prometheusHasKubecostMetricDiagnosticPassed":true,"prometheusHasCadvisorMetricDiagnosticPassed":true,"prometheusHasKSMMetricDiagnosticPassed":true,"dailyAllocationEtlHealthyDiagnosticPassed":true,"dailyAssetEtlHealthyDiagnosticPassed":true,"kubecostPodsNotOOMKilledDiagnosticPassed":true,"kubecostPodsNotPendingDiagnosticPassed":false },"clusters": {"cluster_one": {"latestRun":"2023-12-12T22:42:32Z","kubecostEmittingMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKubecostMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasCadvisorMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKSMMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAllocationEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAssetEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotOOMKilled": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotPending": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" } },"cluster_two": {"latestRun":"2023-12-12T22:40:17Z","kubecostEmittingMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKubecostMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasCadvisorMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKSMMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAllocationEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAssetEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotOOMKilled": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotPending": {"diagnosticPassed":false,"numFailures":52,"firstFailureDate":"2023-12-12T18:25:09Z","diagnosticOutput":"RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0" } },"cluster_three": {"latestRun":"2023-12-12T22:40:15Z","kubecostEmittingMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKubecostMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasCadvisorMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"prometheusHasKSMMetric": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAllocationEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"dailyAssetEtlHealthy": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotOOMKilled": {"diagnosticPassed":true,"numFailures":0,"firstFailureDate":"","diagnosticOutput":"" },"kubecostPodsNotPending": {"diagnosticPassed":false,"numFailures":52,"firstFailureDate":"2023-12-12T18:24:42Z","diagnosticOutput":"RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0" } } } }}