Mulit-Cluster Diagnostics

This feature is currently in beta. It is enabled by default.

Multi-Cluster Diagnostics offers a single view into the health of all the clusters you currently monitor with Kubecost.

Health checks include, but are not limited to:

  1. Whether Kubecost is correctly emitting metrics

  2. Whether Kubecost is being scraped by Prometheus

  3. Whether Prometheus has scraped the required metrics

  4. Whether Kubecost's ETL files are healthy

Configuration

# This is an abridged example. Full example in link below.
diagnostics:
  enabled: true
  isDiagnosticsPrimary:
    enabled: true  # Only enable this on your primary cluster

# Ensure you have configured a unique CLUSTER_ID.
prometheus:
  server:
    global:
      external_labels:
        cluster_id: YOUR_CLUSTER_ID

# Ensure you have configured a storage config secret. Using `.Values.thanos.storeSecretName` would also work here.
kubecostModel:
  federatedStorageConfigSecret: federated-store

Additional configuration options can found in the values.yaml under diagnostics:.

Architecture

The multi-cluster diagnostics feature is run as an independent deployment (i.e. deployment/kubecost-diagnostics). Each diagnostics deployment monitors the health of Kubecost and sends that health data to the central object store at the /diagnostics filepath.

The below diagram depicts these interactions. This diagram is specific to the requests required for diagnostics only. For additional diagrams, see our multi-cluster guide.

API usage

The diagnostics API can be accessed through /model/multi-cluster-diagnostics?window=2d (or /model/mcd for short)

The window query parameter is required, which will return all diagnostics within the specified time window.

Multi-cluster Diagnostics API

GET http://<your-kubecost-address>/model/multi-cluster-diagnostics

The Multi-cluster Diagnostics API provides a single view into the health of all the clusters you currently monitor with Kubecost.

Path Parameters

Name
Type
Description

window*

string

Duration of time over which to query. Accepts words like today, week, month, yesterday, lastweek, lastmonth; durations like 30m, 12h, 7d; comma-separated RFC3339 date pairs like 2021-01-02T15:04:05Z,2021-02-02T15:04:05Z; comma-separated Unix timestamp (seconds) pairs like 1578002645,1580681045.

{
    "code": 200,
    "data": {
        "overview": {
            "kubecostEmittingMetricDiagnosticPassed": true,
            "prometheusHasKubecostMetricDiagnosticPassed": true,
            "prometheusHasCadvisorMetricDiagnosticPassed": true,
            "prometheusHasKSMMetricDiagnosticPassed": true,
            "dailyAllocationEtlHealthyDiagnosticPassed": true,
            "dailyAssetEtlHealthyDiagnosticPassed": true,
            "kubecostPodsNotOOMKilledDiagnosticPassed": true,
            "kubecostPodsNotPendingDiagnosticPassed": false
        },
        "clusters": {
            "cluster_one": {
                "latestRun": "2023-12-12T22:42:32Z",
                "kubecostEmittingMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKubecostMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasCadvisorMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKSMMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAllocationEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAssetEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotOOMKilled": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotPending": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                }
            },
            "cluster_two": {
                "latestRun": "2023-12-12T22:40:17Z",
                "kubecostEmittingMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKubecostMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasCadvisorMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKSMMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAllocationEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAssetEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotOOMKilled": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotPending": {
                    "diagnosticPassed": false,
                    "numFailures": 52,
                    "firstFailureDate": "2023-12-12T18:25:09Z",
                    "diagnosticOutput": "RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0"
                }
            },
            "cluster_three": {
                "latestRun": "2023-12-12T22:40:15Z",
                "kubecostEmittingMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKubecostMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasCadvisorMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "prometheusHasKSMMetric": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAllocationEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "dailyAssetEtlHealthy": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotOOMKilled": {
                    "diagnosticPassed": true,
                    "numFailures": 0,
                    "firstFailureDate": "",
                    "diagnosticOutput": ""
                },
                "kubecostPodsNotPending": {
                    "diagnosticPassed": false,
                    "numFailures": 52,
                    "firstFailureDate": "2023-12-12T18:24:42Z",
                    "diagnosticOutput": "RunDiagnostic: checkKubecostPodsNotPending: queryPrometheusCheckResultEmpty: the following query returned a non-empty result sum(kube_pod_status_phase{namespace='kubecost-etl-fed', phase='Pending'}) by (pod,namespace) > 0"
                }
            }
        }
    }
}

Last updated