1 of 5

Prometheus Configuration Guide

Bring your own Prometheus

There are several considerations when disabling the Kubecost included Prometheus deployment. Kubecost strongly recommends installing Kubecost with the bundled Prometheus in most environments.

The Kubecost Prometheus deployment is optimized to not interfere with other observability instrumentation and by default only contains metrics that are useful to the Kubecost product. This results in 70-90% fewer metrics than a Prometheus deployment using default settings.

Additionally, if multi-cluster metric aggregation is required, Kubecost provides a turnkey solution that is highly tuned and simple to support using the included Prometheus deployment.

This feature is accessible to all users. However, please note that comprehensive support is provided with a paid support plan.

Dependency requirements

Kubecost requires the following minimum versions:

Prometheus: v2.18 (v2.13-2.17 supported with limited functionality)
kube-state-metrics: v1.6.0+
cAdvisor: kubelet v1.11.0+
node-exporter: v0.16+ (Optional)

Instructions

Disable node-exporter and kube-state-metrics (recommended)

If you have node-exporter and/or KSM running on your cluster, follow this step to disable the Kubecost included versions. Additional detail on KSM requirements.

In contrast to our recommendation above, we do recommend disabling the Kubecost's node-exporter and kube-state-metrics if you already have them running in your cluster.

helm upgrade --install kubecost \
  --repo https://kubecost.github.io/cost-analyzer/ cost-analyzer \
  --namespace kubecost --create-namespace \
  --set prometheus.nodeExporter.enabled=false \
  --set prometheus.serviceAccounts.nodeExporter.create=false \
  --set prometheus.kubeStateMetrics.enabled=false

Disabling Kubecost's Prometheus deployment

This process is not recommended. Before continuing, review the Bring your own Prometheus section if you haven't already.

Pass the following parameters in your Helm install:

```
helm upgrade --install kubecost \
  --repo https://kubecost.github.io/cost-analyzer/ cost-analyzer \
  --namespace kubecost --create-namespace \
  --set global.prometheus.fqdn=http://<prometheus-server-service-name>.<prometheus-server-namespace>.svc:<port> \
  --set global.prometheus.enabled=false
```

The FQDN can be a full path via https://prometheus-prod-us-central-x.grafana.net/api/prom/ if you use Grafana Cloud-managed Prometheus. Learn more in the Grafana Cloud Integration for Kubecost doc.

Have your Prometheus scrape the cost-model /metrics endpoint. These metrics are needed for reporting accurate pricing data. Here is an example scrape config:

- job_name: kubecost
      honor_labels: true
      scrape_interval: 1m
      scrape_timeout: 10s
      metrics_path: /metrics
      scheme: http
      dns_sd_configs:
      - names:
        - kubecost-cost-analyzer.<namespace-of-your-kubecost>
        type: 'A'
        port: 9003

This config needs to be added to extraScrapeConfigs in the Prometheus configuration. See the example extraScrapeConfigs.yaml.

By default, the Prometheus chart included with Kubecost (bundled-Prometheus) contains scrape configs optimized for Kubecost-required metrics. You need to add those scrape configs jobs into your existing Prometheus setup to allow Kubecost to provide more accurate cost data and optimize the required resources for your existing Prometheus.

You can find the full scrape configs of our bundled-Prometheus here. You can check Prometheus documentation for more information about the scrape config, or read this documentation if you are using Prometheus Operator.

Recording rules

This step is optional. If you do not set up Kubecost's CPU usage recording rule, Kubecost will fall back to a PromQL subquery which may put unnecessary load on your Prometheus.

Kubecost-bundled Prometheus includes a recording rule used to calculate CPU usage max, a critical component of the request right-sizing recommendation functionality. Add the recording rules to reduce query load here.

Alternatively, if your environment supports serviceMonitors and prometheusRules, pass these values to your Helm install:

global:
  prometheus:
    enabled: false
serviceMonitor:
  enabled: true
  # additionalLabels:
  #   label-key: label-value
  networkCosts:
    enabled: true
    # additionalLabels:
    #   label-key: label-value
prometheusRule:
  enabled: true
  # additionalLabels:
  #   label-key: label-value

To confirm this job is successfully scraped by Prometheus, you can view the Targets page in Prometheus and look for a job named kubecost.

Node exporter metric labels

This step is optional, and only impacts certain efficiency metrics. View issue/556 for a description of what will be missing if this step is skipped.

You'll need to add the following relabel config to the job that scrapes the node exporter DaemonSet.

  - job_name: 'kubernetes-service-endpoints'

    kubernetes_sd_configs:
      - role: endpoints

    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: kubernetes_node

This does not override the source label. It creates a new label called kubernetes_node and copies the value of pod into it.

Distinguishing clusters

In order to distinguish between multiple clusters, Kubecost needs to know the label used in prometheus to identify the name. Use the .Values.kubecostModel.promClusterIDLabel. The default cluster label is cluster_id, though many environments use the key of cluster.

Data retention

By default, metric retention is 91 days, however the retention of data can be further increased with a configurable value for a property etlDailyStoreDurationDays. You can find this value here.

Increasing the default etlDailyStorageDurationDays value will naturally result in greater memory usage. At higher values, this can cause errors when trying to display this information in the Kubecost UI. You can remedy this by increasing the Step size when using the Allocations dashboard.

Troubleshooting

The Diagnostics page (Settings > View Full Diagnostics) provides diagnostic info on your integration. Scroll down to Prometheus Status to verify that your configuration is successful.

Below you can find solutions to common Prometheus configuration problems. View the Kubecost Diagnostics doc for more information.

Misconfigured Prometheus FQDN

Evidenced by the following pod error message No valid prometheus config file at ... and the init pods hanging. We recommend running curl <your_prometheus_url>/api/v1/status/config from a pod in the cluster to confirm that your Prometheus config is returned. Here is an example, but this needs to be updated based on your pod name and Prometheus address:

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl http://<your_prometheus_url>/api/v1/status/config

In the above example, <your_prometheus_url> may include a port number and/or namespace, example: http://prometheus-operator-kube-p-prometheus.monitoring:9090/api/v1/status/config

If the config file is not returned, this is an indication that an incorrect Prometheus address has been provided. If a config file is returned from one pod in the cluster but not the Kubecost pod, then the Kubecost pod likely has its access restricted by a network policy, service mesh, etc.

Context deadline exceeded

Network policies, Mesh networks, or other security related tooling can block network traffic between Prometheus and Kubecost which will result in the Kubecost scrape target state as being down in the Prometheus targets UI. To assist in troubleshooting this type of error you can use the curl command from within the cost-analyzer container to try and reach the Prometheus target. Note the "namespace" and "deployment" name in this command may need updated to match your environment, this example uses the default Kubecost Prometheus deployment.

When successful, this command should return all of the metrics that Kubecost uses. Failures may be indicative of the network traffic being blocked.

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://<your_prometheus_url>/metrics"

Prometheus throttling

Ensure Prometheus isn't being CPU throttled due to a low resource request.

Wrong dependency version

Review the Dependency Requirements section above

Missing scrape configs

Visit Prometheus Targets page (screenshot above)

Data incorrectly is a single namespace

Make sure that honor_labels is enabled

Negative idle reported

Single cluster tests

Ensure results are not null for both queries below.

Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://localhost:9003/prometheusQuery?query=node_total_hourly_cost"

Ensure kube-state-metrics are available: kube_node_status_capacity

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://localhost:9003/prometheusQuery?query=kube_node_status_capacity"

For both queries, verify nodes are returned. A successful response should look like:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"node_total_hourly_cost","instance":"aks-agentpool-81479558-vmss000001","instance_type":"Standard_B4ms","job":"kubecost","node":"aks-agentpool-81479558-vmss000001","provider_id":"azure:///.../virtualMachines/1","region":"eastus"},"value":[1673020150,"0.16599565032196045"]}]}}

An error will look like:

{"status":"success","data":{"resultType":"vector","result":[]}}

Enterprise multi-cluster test

Ensure that all clusters and nodes have values- output should be similar to the above Single Cluster Tests

Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl -G http://localhost:9003/thanosQuery \
  -d time=`date -d '1 day ago' "+%Y-%m-%dT%H:%M:%SZ"` \
  --data-urlencode "query=avg (sum_over_time(node_total_hourly_cost[1d])) by (cluster_id, node)" \
  | jq

On macOS, change date -d '1 day ago' to date -v '-1d'

Ensure kube-state-metrics are available: kube_node_status_capacity

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl -G http://localhost:9003/thanosQuery \
  -d time=`date -d '1 day ago' "+%Y-%m-%dT%H:%M:%SZ"` \
  --data-urlencode "query=avg (sum_over_time(kube_node_status_capacity[1d])) by (cluster_id, node)" \
  | jq

For both queries, verify nodes are returned. A successful response should look like:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"node_total_hourly_cost","instance":"aks-agentpool-81479558-vmss000001","instance_type":"Standard_B4ms","job":"kubecost","node":"aks-agentpool-81479558-vmss000001","provider_id":"azure:///.../virtualMachines/1","region":"eastus"},"value":[1673020150,"0.16599565032196045"]}]}}

An error will look like:

{"status":"success","data":{"resultType":"vector","result":[]}}

Amazon Managed Service for Prometheus

Overview

Kubecost leverages the open-source Prometheus project as a time series database and post-processes the data in Prometheus to perform cost allocation calculations and provide optimization insights for your Kubernetes clusters such as Amazon Elastic Kubernetes Service (Amazon EKS). Prometheus is a single machine statically-resourced container, so depending on your cluster size or when your cluster scales out, it could exceed the scraping capabilities of a single Prometheus server. In collaboration with Amazon Web Services (AWS), Kubecost integrates with Amazon Managed Service for Prometheus (AMP), a managed Prometheus-compatible monitoring service, to enable the customer to easily monitor Kubernetes cost at scale.

Reference resources

Architecture

The architecture of this integration is similar to Amazon EKS cost monitoring with Kubecost, which is described in the previous blog post, with some enhancements as follows:

In this integration, an additional AWS SigV4 container is added to the cost-analyzer pod, acting as a proxy to help query metrics from Amazon Managed Service for Prometheus using the AWS SigV4 signing process. It enables passwordless authentication to reduce the risk of exposing your AWS credentials.

When the Amazon Managed Service for Prometheus integration is enabled, the bundled Prometheus server in the Kubecost Helm Chart is configured in the remote_write mode. The bundled Prometheus server sends the collected metrics to Amazon Managed Service for Prometheus using the AWS SigV4 signing process. All metrics and data are stored in Amazon Managed Service for Prometheus, and Kubecost queries the metrics directly from Amazon Managed Service for Prometheus instead of the bundled Prometheus. It helps customers not worry about maintaining and scaling the local Prometheus instance.

There are two architectures you can deploy:

The Quick-Start architecture supports a small multi-cluster setup of up to 100 clusters.
The Federated architecture supports a large multi-cluster setup for over 100 clusters.

Quick-Start architecture

The infrastructure can manageup to 100 clusters. The following architecture diagram illustrates the small-scale infrastructure setup:

Federated architecture

To support the large-scale infrastructure of over 100 clusters, Kubecost leverages a Federated ETL architecture. In addition to Amazon Prometheus Workspace, Kubecost stores its extract, transform, and load (ETL) data in a central S3 bucket. Kubecost's ETL data is a computed cache based on Prometheus's metrics, from which users can perform all possible Kubecost queries. By storing the ETL data on an S3 bucket, this integration offers resiliency to your cost allocation data, improves the performance and enables high availability architecture for your Kubecost setup.

The following architecture diagram illustrates the large-scale infrastructure setup:

Instructions

Prerequisites

You have an existing AWS account. You have IAM credentials to create Amazon Managed Service for Prometheus and IAM roles programmatically. You have an existing Amazon EKS cluster with OIDC enabled. Your Amazon EKS clusters have Amazon EBS CSI driver installed

Create Amazon Managed Service for Prometheus workspace:

Step 1: Run the following command to get the information of your current EKS cluster:

kubectl config current-context

The example output should be in this format:

arn:aws:eks:${AWS_REGION}:${YOUR_AWS_ACCOUNT_ID}:cluster/${YOUR_CLUSTER_NAME}

Step 2: Run the following command to create new a Amazon Managed Service for Prometheus workspace

export AWS_REGION=<YOUR_AWS_REGION>
aws amp create-workspace --alias kubecost-amp --region $AWS_REGION

The Amazon Managed Service for Prometheus workspace should be created in a few seconds. Run the following command to get the workspace ID:

export AMP_WORKSPACE_ID=$(aws amp list-workspaces --region ${AWS_REGION} --output json --query 'workspaces[?alias==`kubecost-amp`].workspaceId | [0]' | cut -d'"' -f 2)
echo $AMP_WORKSPACE_ID

Setting up the environment:

Step 1: Set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus

Run the following command to set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus:

export RELEASE="kubecost"
export YOUR_CLUSTER_NAME=<YOUR_EKS_CLUSTER_NAME>
export AWS_REGION=${AWS_REGION}
export VERSION="{X.XXX.X}"
export KC_BUCKET="kubecost-etl-metrics" # Remove this line if you want to set up small-scale infrastructure
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REMOTEWRITEURL="https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write"
export QUERYURL="http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}"

Step 2: Set up S3 bucket, IAM policy and Kubernetes secret for storing Kubecost ETL files

Note: You can ignore Step 2 for the small-scale infrastructure setup.

a. Create Object store S3 bucket to store Kubecost ETL metrics. Run the following command in your workspace:

aws s3 mb s3://${KC_BUCKET}

b. Create IAM Policy to grant access to the S3 bucket. The following policy is for demo purposes only. You may need to consult your security team and make appropriate changes depending on your organization's requirements.

Run the following command in your workspace:

# create policy-kubecost-aws-s3.json file
cat <<EOF>policy-kubecost-aws-s3.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::${KC_BUCKET}"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::${KC_BUCKET}",
                "arn:aws:s3:::${KC_BUCKET}/*"
            ]
        }
    ]
}
EOF
# create the AWS IAM policy
aws iam create-policy \
 --policy-name kubecost-s3-federated-policy-$YOUR_CLUSTER_NAME \
 --policy-document file://policy-kubecost-aws-s3.json

c. Create Kubernetes secret to allow Kubecost to write ETL files to the S3 bucket. Run the following command in your workspace:

# create manifest file for the secret
cat <<EOF>federated-store.yaml
type: S3
config:
  bucket: "${KC_BUCKET}"
  endpoint: "s3.amazonaws.com"
  region: "${AWS_REGION}"
  insecure: false
  signature_version2: false
  put_user_metadata:
      "X-Amz-Acl": "bucket-owner-full-control"
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m
    insecure_skip_verify: false
  trace:
    enable: true
  part_size: 134217728
EOF
# create Kubecost namespace and the secret from the manifest file
kubectl create namespace ${RELEASE}
kubectl create secret generic \
  kubecost-object-store -n ${RELEASE} \
  --from-file federated-store.yaml

Step 3: Set up IRSA to allow Kubecost and Prometheus to read & write metrics from Amazon Managed Service for Prometheus

These following commands help to automate the following tasks:

Create an IAM role with the AWS-managed IAM policy and trusted policy for the following service accounts: kubecost-cost-analyzer-amp, kubecost-prometheus-server-amp.
Modify current K8s service accounts with annotation to attach a new IAM role.

Run the following command in your workspace:

eksctl create iamserviceaccount \
    --name kubecost-cost-analyzer-amp \
    --namespace ${RELEASE} \
    --cluster ${YOUR_CLUSTER_NAME} --region ${AWS_REGION} \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
   --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/kubecost-s3-federated-policy-${YOUR_CLUSTER_NAME} \ # Remove this line if you want to set up small-scale infrastructure
    --override-existing-serviceaccounts \
    --approve

eksctl create iamserviceaccount \
    --name kubecost-prometheus-server-amp \
    --namespace ${RELEASE} \
    --cluster ${YOUR_CLUSTER_NAME} --region ${AWS_REGION} \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
    --override-existing-serviceaccounts \
    --approve

For more information, you can check AWS documentation at IAM roles for service accounts and learn more about Amazon Managed Service for Prometheus managed policy at Identity-based policy examples for Amazon Managed Service for Prometheus

Integrating Kubecost with Amazon Managed Service for Prometheus

Preparing the configuration file

Run the following command to create a file called config-values.yaml, which contains the defaults that Kubecost will use for connecting to your Amazon Managed Service for Prometheus workspace.

cat << EOF > config-values.yaml
global:
  amp:
    enabled: true
    prometheusServerEndpoint: http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}
    remoteWriteService: https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write
    sigv4:
      region: ${AWS_REGION}

sigV4Proxy:
  region: ${AWS_REGION}
  host: aps-workspaces.${AWS_REGION}.amazonaws.com
EOF

Primary cluster

Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the primary:

helm upgrade -i ${RELEASE} \
oci://public.ecr.aws/kubecost/cost-analyzer --version $VERSION \
--namespace ${RELEASE} --create-namespace \
-f https://tinyurl.com/kubecost-amazon-eks \
-f config-values.yaml \
-f https://raw.githubusercontent.com/kubecost/poc-common-configurations/main/etl-federation/primary-federator.yaml \ # Remove this line if you want to set up small-scale infrastructure
--set global.amp.prometheusServerEndpoint=${QUERYURL} \
--set global.amp.remoteWriteService=${REMOTEWRITEURL} \
--set kubecostProductConfigs.clusterName=${YOUR_CLUSTER_NAME} \
--set kubecostProductConfigs.projectID=${AWS_ACCOUNT_ID} \
--set prometheus.server.global.external_labels.cluster_id=${YOUR_CLUSTER_NAME} \
--set federatedETL.federator.primaryClusterID=${YOUR_CLUSTER_NAME} \ # Remove this line if you want to set up small-scale infrastructure
--set serviceAccount.create=false \
--set prometheus.serviceAccounts.server.create=false \
--set serviceAccount.name=kubecost-cost-analyzer-amp \
--set prometheus.serviceAccounts.server.name=kubecost-prometheus-server-amp \
--set federatedETL.federator.useMultiClusterDB=true \

Additional clusters

These installation steps are similar to those for a primary cluster setup, except you do not need to follow the steps in the section "Create Amazon Managed Service for Prometheus workspace", and you need to update these environment variables below to match with your additional clusters. Please note that the AMP_WORKSPACE_ID and KC_BUCKET are the same as the primary cluster.

export RELEASE="kubecost"
export YOUR_CLUSTER_NAME=<YOUR_EKS_CLUSTER_NAME>
export AWS_REGION="<YOUR_AWS_REGION>"
export VERSION="1.103.4"
export KC_BUCKET="kubecost-etl-metrics"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REMOTEWRITEURL="https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write"
export QUERYURL="http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}"

Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the additional cluster:

helm upgrade -i ${RELEASE} \
oci://public.ecr.aws/kubecost/cost-analyzer --version $VERSION \
--namespace ${RELEASE}  --create-namespace \
-f https://tinyurl.com/kubecost-amazon-eks \
-f config-values.yaml \
-f https://raw.githubusercontent.com/kubecost/poc-common-configurations/main/etl-federation/agent-federated.yaml \ # Remove this line if you want to set up small-scale infrastructure
--set global.amp.prometheusServerEndpoint=${QUERYURL} \
--set global.amp.remoteWriteService=${REMOTEWRITEURL} \
--set kubecostProductConfigs.clusterName=${YOUR_CLUSTER_NAME} \
--set kubecostProductConfigs.projectID=${AWS_ACCOUNT_ID} \
--set prometheus.server.global.external_labels.cluster_id=${YOUR_CLUSTER_NAME} \
--set serviceAccount.create=false \
--set prometheus.serviceAccounts.server.create=false \
--set serviceAccount.name=kubecost-cost-analyzer-amp \
--set prometheus.serviceAccounts.server.name=kubecost-prometheus-server-amp \
--set federatedETL.useMultiClusterDB=true

Your Kubecost setup is now writing and collecting data from AMP. Data should be ready for viewing within 15 minutes.

To verify that the integration is set up, go to Settings in the Kubecost UI, and check the Prometheus Status section.

Read our Custom Prometheus integration troubleshooting guide if you run into any errors while setting up the integration. For support from AWS, you can submit a support request through your existing AWS support contract.

Add recording rules (optional)

You can add these recording rules to improve the performance. Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Querying the precomputed result is often much faster than running the original expression every time it is needed. Follow these instructions to add the following rules:

    groups:
      - name: CPU
        rules:
          - expr: sum(rate(container_cpu_usage_seconds_total{container_name!=""}[5m]))
            record: cluster:cpu_usage:rate5m
          - expr: rate(container_cpu_usage_seconds_total{container_name!=""}[5m])
            record: cluster:cpu_usage_nosum:rate5m
          - expr: avg(irate(container_cpu_usage_seconds_total{container_name!="POD", container_name!=""}[5m])) by (container_name,pod_name,namespace)
            record: kubecost_container_cpu_usage_irate
          - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""}) by (container_name,pod_name,namespace)
            record: kubecost_container_memory_working_set_bytes
          - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""})
            record: kubecost_cluster_memory_working_set_bytes
      - name: Savings
        rules:
          - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod))
            record: kubecost_savings_cpu_allocation
            labels:
              daemonset: "false"
          - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod)) / sum(kube_node_info)
            record: kubecost_savings_cpu_allocation
            labels:
              daemonset: "true"
          - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod))
            record: kubecost_savings_memory_allocation_bytes
            labels:
              daemonset: "false"
          - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod)) / sum(kube_node_info)
            record: kubecost_savings_memory_allocation_bytes
            labels:
              daemonset: "true"

Troubleshooting

The below queries must return data for Kubecost to calculate costs correctly.

For the queries below to work, set the environment variables:

KUBECOST_NAMESPACE=kubecost
KUBECOST_DEPLOYMENT=kubecost-cost-analyzer
CLUSTER_ID=YOUR_CLUSTER_NAME

Verify connection to AMP and that the metric for container_memory_working_set_bytes is available:

If you have set kubecostModel.promClusterIDLabel, you will need to change the query (CLUSTER_ID) to match the label (typically cluster or alpha_eksctl_io_cluster_name).

kubectl exec -i -t -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=container_memory_working_set_bytes\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 |jq

The output should contain a JSON entry similar to the following.

The value of cluster_id should match the value of kubecostProductConfigs.clusterName.

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "container_memory_working_set_bytes",
          "cluster_id": "qa-eks1",
          "alpha_eksctl_io_cluster_name": "qa-eks1",
          "alpha_eksctl_io_nodegroup_name": "qa-eks1-nodegroup",
          "beta_kubernetes_io_arch": "amd64",
          "beta_kubernetes_io_instance_type": "t3.medium",
          "beta_kubernetes_io_os": "linux",
          "eks_amazonaws_com_capacityType": "ON_DEMAND",
          "eks_amazonaws_com_nodegroup": "qa-eks1-nodegroup",
          "id": "/",
          "instance": "ip-10-10-8-66.us-east-2.compute.internal",
          "job": "kubernetes-nodes-cadvisor"
        },
        "value": [
          1697630036,
          "3043811328"
        ]
      }
    ]
  }
}

Verify Kubecost metrics are available in AMP:

kubectl exec -i -t -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=node_total_hourly_cost\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 |jq

The output should contain a JSON entry similar to:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "node_total_hourly_cost",
          "cluster_id": "qa-eks1",
          "alpha_eksctl_io_cluster_name": "qa-eks1",
          "arch": "amd64",
          "instance": "ip-192-168-47-226.us-east-2.compute.internal",
          "instance_type": "t3.medium",
          "job": "kubecost"
        },
        "value": [
          1697630306,
          "0.04160104542160034"
        ]
      }
    ]
  }
}

If the above queries fail, check the following:

Check logs of the sigv4proxy container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c sigv4proxy --tail -1

In a working sigv4proxy, there will be very few logs.

Correctly working log output:

time="2023-09-21T17:40:15Z" level=info msg="Stripping headers []" StripHeaders="[]"
time="2023-09-21T17:40:15Z" level=info msg="Listening on :8005" port=":8005"

Check logs in the `cost-model`` container for Prometheus connection issues:

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c cost-model --tail -1 |grep -i err

Example errors:

ERR CostModel.ComputeAllocation: pod query 1 try 2 failed: avg(kube_pod_container_status_running...
Prometheus communication error: 502 (Bad Gateway) ...

Grafana Cloud Integration for Kubecost

Grafana Cloud is a composable observability platform, integrating metrics, traces and logs with Grafana. Customers can leverage the best open source observability software without the overhead of installing, maintaining, and scaling your observability stack.

This document will show you how to integrate the Grafana Cloud Prometheus metrics service with Kubecost.

Prerequisites

You have access to a running Kubernetes cluster
You have created a Grafana Cloud account
You have permissions to create Grafana Cloud API keys

Step 1: Install the Grafana Agent on your cluster

Install the Grafana Agent for Kubernetes on your cluster. On the existing K8s cluster that you intend to install Kubecost, run the following commands to install the Grafana Agent to scrape the metrics from Kubecost /metrics endpoint. The script below installs the Grafana Agent with the necessary scraping configuration for Kubecost, you may want to add additional scrape configuration for your setup. Please remember to replace the following values with your actual Grafana cloud's values:

REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-ENDPOINT
REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME
REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY
REPLACE-WITH-YOUR-CLUSTER-NAME

cat <<'EOF' |

kind: ConfigMap
metadata:
  name: grafana-agent
apiVersion: v1
data:
  agent.yaml: |
    metrics:
      wal_directory: /var/lib/agent/wal
      global:
        scrape_interval: 60s
        external_labels:
          cluster: <REPLACE-WITH-YOUR-CLUSTER-NAME>
      configs:
      - name: integrations
        remote_write:
        - url: https://<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-ENDPOINT>
          basic_auth:
            username: <REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME>
            password: <REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY>
        scrape_configs: #Need further scrape config update
        - job_name: kubecost
          honor_labels: true
          scrape_interval: 1m
          scrape_timeout: 10s
          metrics_path: /metrics
          scheme: http
          dns_sd_configs:
          - names:
            - kubecost-cost-analyzer.kubecost
            type: 'A'
            port: 9003
        - job_name: kubecost-networking
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
          # Scrape only the the targets matching the following metadata
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex:  'kubecost-network-costs'
  
EOF
(export NAMESPACE=kubecost && kubectl apply -n $NAMESPACE -f -)

MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/v0.24.2/production/kubernetes/agent-bare.yaml NAMESPACE=kubecost /bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/v0.24.2/production/kubernetes/install-bare.sh)" | kubectl apply -f -

You can also verify if grafana-agent is scraping data with the following command (optional):

kubectl -n kubecost logs grafana-agent-0

To learn more about how to install and config Grafana agent as well as additional scrape configuration, please refer to Grafana Agent documentation or you can check Kubecost Prometheus scrape config at this GitHub repository.

Step 2: Create `dbsecret` to allow Kubecost to query the metrics from Grafana Cloud Prometheus:

Create two files in your working directory, called USERNAME and PASSWORD respectively

export PASSWORD=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY>
export USERNAME=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME>
printf "${PASSWORD}" > PASSWORD
printf "${USERNAME}" > USERNAME

Verify that you can run queries against your Grafana Cloud Prometheus query endpoint (optional):

cred="$( echo $NAME:$PASSWORD | base64 )"; curl -H "Authorization: Basic $cred" https://<REPLACE-WITH-GRAFANA-PROM-QUERY-ENDPOINT>/api/v1/query?query=up

Create K8s secret name dbsecret:

kubectl create secret generic dbsecret \
    --namespace kubecost \
    --from-file=USERNAME \
    --from-file=PASSWORD

Verify if the credentials appear correctly (optional):

kubectl -n kubecost get secret dbsecret -o json | jq '.data | map_values(@base64d)'

Step 3 (optional): Configure Kubecost recording rules for Grafana Cloud using Cortextool

To set up recording rules in Grafana Cloud, download the Cortextool CLI utility. While they are optional, they offer improved performance.

After installing the tool, create a file called kubecost_rules.yaml with the following command:

cat << EOF > kubecost-rules.yaml
namespace: "kubecost"
groups:
  - name: CPU
    rules:
      - expr: sum(rate(container_cpu_usage_seconds_total{container_name!=""}[5m]))
        record: cluster:cpu_usage:rate5m
      - expr: rate(container_cpu_usage_seconds_total{container_name!=""}[5m])
        record: cluster:cpu_usage_nosum:rate5m
      - expr: avg(irate(container_cpu_usage_seconds_total{container_name!="POD", container_name!=""}[5m])) by (container_name,pod_name,namespace)
        record: kubecost_container_cpu_usage_irate
      - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""}) by (container_name,pod_name,namespace)
        record: kubecost_container_memory_working_set_bytes
      - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""})
        record: kubecost_cluster_memory_working_set_bytes
  - name: Savings
    rules:
      - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod))
        record: kubecost_savings_cpu_allocation
        labels:
          daemonset: "false"
      - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod)) / sum(kube_node_info)
        record: kubecost_savings_cpu_allocation
        labels:
          daemonset: "true"
      - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod))
        record: kubecost_savings_memory_allocation_bytes
        labels:
          daemonset: "false"
      - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod)) / sum(kube_node_info)
        record: kubecost_savings_memory_allocation_bytes
        labels:
          daemonset: "true"
EOF

Then, make sure you are in the same directory as your _kubecost\_rules.yaml_, and load the rules using Cortextool. Replace the address with your Grafana Cloud’s Prometheus endpoint (Remember to omit the /api/prom path from the endpoint URL).

cortextool rules load \
--address=<REPLACE-WITH-GRAFANA-PROM-ENDPOINT> \
--id=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME> \
--key=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY> \
kubecost_rules.yaml

Print out the rules to verify that they’ve been loaded correctly:

cortextool rules print \
--address=<REPLACE-WITH-GRAFANA-PROM-ENDPOINT> \
--id=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME> \
--key=<REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY>

Step 4: Install Kubecost on the cluster

Install Kubecost on your K8s cluster with Grafana Cloud Prometheus query endpoint and dbsecret you created in Step 2.

helm upgrade -i -n kubecost kubecost kubecost/cost-analyzer \
    --namespace kubecost \
    --set global.prometheus.fqdn=https://<REPLACE-WITH-GRAFANA-PROM-QUERY-ENDPOINT> \
    --set global.prometheus.enabled=false \
    --set global.prometheus.queryServiceBasicAuthSecretName=dbsecret

The process is complete. By now, you should have successfully completed the Kubecost integration with Grafana Cloud.

Optionally, you can also add our Kubecost Dashboard for Grafana Cloud to your organization to visualize your cloud costs in Grafana.

Grafana Mimir Integration for Kubecost

In the standard deployment of Kubecost, Kubecost is deployed with a bundled Prometheus instance to collect and store metrics of your Kubernetes cluster. Kubecost also provides the flexibility to connect with your time series database or storage. Grafana Mimir is an open-source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.

This document will show you how to integrate the Grafana Mimir with Kubecost for long-term metrics retention. In this setup, you need to use Grafana Agent to collect metrics from Kubecost and your Kubernetes cluster. The metrics will be re-written to your existing Grafana Mimir setup without an authenticating reverse proxy

Prerequisites

You have access to a running Kubernetes cluster
You have an existing Grafana Mimir setup

Step 1: Install the Grafana Agent on your cluster

export CLUSTER_NAME="YOUR_CLUSTER_NAME"
export MIMIR_ENDPOINT="http://example-mimir.com/api/v1/push"

cat <<EOF |

kind: ConfigMap
metadata:
  name: grafana-agent
apiVersion: v1
data:
  agent.yaml: |
    metrics:
      wal_directory: /var/lib/agent/wal
      global:
        scrape_interval: 60s
        external_labels:
          cluster: ${CLUSTER_NAME}
      configs:
      - name: integrations
        remote_write:
        - headers:
            X-Scope-OrgID: kubecost_mimir
          url: ${MIMIR_ENDPOINT}
        - url: ${MIMIR_ENDPOINT}
        scrape_configs: #Need further scrape config update
        - job_name: kubecost
          honor_labels: true
          scrape_interval: 1m
          scrape_timeout: 10s
          metrics_path: /metrics
          scheme: http
          dns_sd_configs:
          - names:
            - kubecost-cost-analyzer.kubecost
            type: 'A'
            port: 9003
        - job_name: kubecost-networking
          kubernetes_sd_configs:
            - role: pod
          relabel_configs:
          # Scrape only the the targets matching the following metadata
            - source_labels: [__meta_kubernetes_pod_label_app]
              action: keep
              regex:  'kubecost-network-costs'
        - job_name: prometheus
          static_configs:
            - targets:
              - localhost:9090
        - job_name: 'kubernetes-nodes-cadvisor'
          scheme: https
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token
          kubernetes_sd_configs:
            - role: node
          relabel_configs:
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - target_label: __address__
              replacement: kubernetes.default.svc:443
            - source_labels: [__meta_kubernetes_node_name]
              regex: (.+)
              target_label: __metrics_path__
              replacement: /api/v1/nodes/$1/proxy/metrics/cadvisor
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: (container_cpu_usage_seconds_total|container_memory_working_set_bytes|container_network_receive_errors_total|container_network_transmit_errors_total|container_network_receive_packets_dropped_total|container_network_transmit_packets_dropped_total|container_memory_usage_bytes|container_cpu_cfs_throttled_periods_total|container_cpu_cfs_periods_total|container_fs_usage_bytes|container_fs_limit_bytes|container_cpu_cfs_periods_total|container_fs_inodes_free|container_fs_inodes_total|container_fs_usage_bytes|container_fs_limit_bytes|container_cpu_cfs_throttled_periods_total|container_cpu_cfs_periods_total|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_fs_inodes_free|container_fs_inodes_total|container_fs_usage_bytes|container_fs_limit_bytes|container_spec_cpu_shares|container_spec_memory_limit_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total|container_fs_reads_bytes_total|container_network_receive_bytes_total|container_fs_writes_bytes_total|container_fs_reads_bytes_total|cadvisor_version_info|kubecost_pv_info)
              action: keep
            - source_labels: [ container ]
              target_label: container_name
              regex: (.+)
              action: replace
            - source_labels: [ pod ]
              target_label: pod_name
              regex: (.+)
              action: replace
        - job_name: 'kubernetes-nodes'
          tls_config:
            ca_file: /var/run/secrets/kubernetes.io/serviceaccount/ca.crt
            insecure_skip_verify: true
          bearer_token_file: /var/run/secrets/kubernetes.io/serviceaccount/token

          kubernetes_sd_configs:
            - role: node

          relabel_configs:
            - action: labelmap
              regex: __meta_kubernetes_node_label_(.+)
            - target_label: __address__
              replacement: kubernetes.default.svc:443
            - source_labels: [__meta_kubernetes_node_name]
              regex: (.+)
              target_label: __metrics_path__
              replacement: /api/v1/nodes/$1/proxy/metrics

          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: (kubelet_volume_stats_used_bytes) # this metric is in alpha 
              action: keep

        - job_name: 'kubernetes-service-endpoints'

          kubernetes_sd_configs:
            - role: endpoints

          relabel_configs:
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
              action: keep
              regex: true
            - source_labels: [__meta_kubernetes_endpoints_name]
              action: keep
              regex: (.*kube-state-metrics|.*prometheus-node-exporter|kubecost-network-costs)
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
              action: replace
              target_label: __scheme__
              regex: (https?)
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
              action: replace
              target_label: __metrics_path__
              regex: (.+)
            - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - source_labels: [__meta_kubernetes_namespace]
              action: replace
              target_label: kubernetes_namespace
            - source_labels: [__meta_kubernetes_service_name]
              action: replace
              target_label: kubernetes_name
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: kubernetes_node
          metric_relabel_configs:
            - source_labels: [ __name__ ]
              regex: (container_cpu_allocation|container_cpu_usage_seconds_total|container_fs_limit_bytes|container_fs_writes_bytes_total|container_gpu_allocation|container_memory_allocation_bytes|container_memory_usage_bytes|container_memory_working_set_bytes|container_network_receive_bytes_total|container_network_transmit_bytes_total|DCGM_FI_DEV_GPU_UTIL|deployment_match_labels|kube_daemonset_status_desired_number_scheduled|kube_daemonset_status_number_ready|kube_deployment_spec_replicas|kube_deployment_status_replicas|kube_deployment_status_replicas_available|kube_job_status_failed|kube_namespace_annotations|kube_namespace_labels|kube_node_info|kube_node_labels|kube_node_status_allocatable|kube_node_status_allocatable_cpu_cores|kube_node_status_allocatable_memory_bytes|kube_node_status_capacity|kube_node_status_capacity_cpu_cores|kube_node_status_capacity_memory_bytes|kube_node_status_condition|kube_persistentvolume_capacity_bytes|kube_persistentvolume_status_phase|kube_persistentvolumeclaim_info|kube_persistentvolumeclaim_resource_requests_storage_bytes|kube_pod_container_info|kube_pod_container_resource_limits|kube_pod_container_resource_limits_cpu_cores|kube_pod_container_resource_limits_memory_bytes|kube_pod_container_resource_requests|kube_pod_container_resource_requests_cpu_cores|kube_pod_container_resource_requests_memory_bytes|kube_pod_container_status_restarts_total|kube_pod_container_status_running|kube_pod_container_status_terminated_reason|kube_pod_labels|kube_pod_owner|kube_pod_status_phase|kube_replicaset_owner|kube_statefulset_replicas|kube_statefulset_status_replicas|kubecost_cluster_info|kubecost_cluster_management_cost|kubecost_cluster_memory_working_set_bytes|kubecost_load_balancer_cost|kubecost_network_internet_egress_cost|kubecost_network_region_egress_cost|kubecost_network_zone_egress_cost|kubecost_node_is_spot|kubecost_pod_network_egress_bytes_total|node_cpu_hourly_cost|node_cpu_seconds_total|node_disk_reads_completed|node_disk_reads_completed_total|node_disk_writes_completed|node_disk_writes_completed_total|node_filesystem_device_error|node_gpu_count|node_gpu_hourly_cost|node_memory_Buffers_bytes|node_memory_Cached_bytes|node_memory_MemAvailable_bytes|node_memory_MemFree_bytes|node_memory_MemTotal_bytes|node_network_transmit_bytes_total|node_ram_hourly_cost|node_total_hourly_cost|pod_pvc_allocation|pv_hourly_cost|service_selector_labels|statefulSet_match_labels|kubecost_pv_info|up)
              action: keep

        - job_name: 'kubernetes-service-endpoints-slow'

          scrape_interval: 5m
          scrape_timeout: 30s

          kubernetes_sd_configs:
            - role: endpoints

          relabel_configs:
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]
              action: keep
              regex: true
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
              action: replace
              target_label: __scheme__
              regex: (https?)
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
              action: replace
              target_label: __metrics_path__
              regex: (.+)
            - source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
              action: replace
              target_label: __address__
              regex: ([^:]+)(?::\d+)?;(\d+)
              replacement: $1:$2
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - source_labels: [__meta_kubernetes_namespace]
              action: replace
              target_label: kubernetes_namespace
            - source_labels: [__meta_kubernetes_service_name]
              action: replace
              target_label: kubernetes_name
            - source_labels: [__meta_kubernetes_pod_node_name]
              action: replace
              target_label: kubernetes_node

        - job_name: 'prometheus-pushgateway'
          honor_labels: true

          kubernetes_sd_configs:
            - role: service

          relabel_configs:
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
              action: keep
              regex: pushgateway

        - job_name: 'kubernetes-services'

          metrics_path: /probe
          params:
            module: [http_2xx]

          kubernetes_sd_configs:
            - role: service

          relabel_configs:
            - source_labels: [__meta_kubernetes_service_annotation_prometheus_io_probe]
              action: keep
              regex: true
            - source_labels: [__address__]
              target_label: __param_target
            - target_label: __address__
              replacement: blackbox
            - source_labels: [__param_target]
              target_label: instance
            - action: labelmap
              regex: __meta_kubernetes_service_label_(.+)
            - source_labels: [__meta_kubernetes_namespace]
              target_label: kubernetes_namespace
            - source_labels: [__meta_kubernetes_service_name]
              target_label: kubernetes_name
  
EOF
(export NAMESPACE=kubecost && kubectl apply -n $NAMESPACE -f -)

MANIFEST_URL=https://raw.githubusercontent.com/grafana/agent/v0.24.2/production/kubernetes/agent-bare.yaml NAMESPACE=kubecost /bin/sh -c "$(curl -fsSL https://raw.githubusercontent.com/grafana/agent/v0.24.2/production/kubernetes/install-bare.sh)" | kubectl apply -f -

You can also verify if grafana-agent is scraping data with the following command (optional):

kubectl -n kubecost logs grafana-agent-0

To learn more about how to install and configure the Grafana agent, as well as additional scrape configuration, please refer to Grafana Agent documentation, or you can view the Kubecost Prometheus scrape config at this GitHub repository.

Step 2: Deploy Kubecost

Run the following command to deploy Kubecost. Please remember to update the environment variables values with your Mimir setup information.

export MIMIR_ENDPOINT="http://example-mimir.com/"
export MIMIR_ORG_ID="YOUR_MIMIR_ORG_ID"
helm upgrade -i kubecost cost-analyzer/ -n kubecost --create-namespace \
--set global.mimirProxy.enabled=true \
--set global.prometheus.enabled=false \
--set global.prometheus.fqdn=http://kubecost-cost-analyzer-mimir-proxy.kubecost.svc:8085/prometheus \
--set global.mimirProxy.mimirEndpoint=http://${MIMIR_ENDPOINT} \
--set global.mimirProxy.orgIdentifier=${MIMIR_ORG_ID}

The process is complete. By now, you should have successfully completed the Kubecost integration with your Grafana Mimir setup.

Google Cloud Managed Service for Prometheus

Overview

Kubecost leverages the open-source Prometheus project as a time series database and post-processes the data in Prometheus to perform cost allocation calculations and provide optimization insights for your Kubernetes clusters. Prometheus is a single machine statically-resourced container, so depending on your cluster size or when your cluster scales out, your cluster could exceed the scraping capabilities of a single Prometheus server. In this doc, you will learn how Kubecost integrates with Google Cloud Managed Service for Prometheus (GMP), a managed Prometheus-compatible monitoring service, to enable the customer to monitor Kubernetes costs at scale easily.

For this integration, GMP is required to be enabled for your GKE cluster with the managed collection. Next, Kubecost is installed in your GKE cluster and leverages GMP Prometheus binary to ingest metrics into GMP database seamlessly. In this setup, Kubecost deployment also automatically creates a Prometheus proxy that allows Kubecost to query the metrics from the GMP database for cost allocation calculation.

This integration is currently in beta.

Reference resources

Instructions

Prerequisites

You have a GCP account/subscription.
You have permission to manage GKE clusters and GCP monitoring services.
You have an existing GKE cluster with GMP enabled. You can learn more here.

Installation

You can use the following command to install Kubecost on your GKE cluster and integrate with GMP:

export GCP_PROJECT_ID="YOUR_GCP_PROJECT_ID"
export CLUSTER_NAME="YOUR_GKE_CLUSTER_NAME"
helm upgrade -i kubecost cost-analyzer/ \
--namespace kubecost --create-namespace \
--set prometheus.server.image.repository="gke.gcr.io/prometheus-engine/prometheus" \
--set prometheus.server.image.tag="v2.35.0-gmp.2-gke.0" \
--set global.gmp.enabled="true" \
--set global.gmp.gmpProxy.projectId="${GCP_PROJECT_ID}" \
--set prometheus.server.global.external_labels.cluster_id="${CLUSTER_NAME}" \
--set kubecostProductConfigs.clusterName="${CLUSTER_NAME}" \
--set federatedETL.useMultiClusterDB=true

In this installation command, these additional flags are added to have Kubecost work with GMP:

prometheus.server.image.repository and prometheus.server.image.tag replace the standard Prometheus image with GMP specific image.
global.gmp.enabled and global.gmp.gmpProxy.projectId are for enabling the GMP integration.
prometheus.server.global.external_labels.cluster_id and kubecostProductConfigs.clusterName helps to set the name for your Kubecost setup.

You can find additional configurations at our main values.yaml file.

Your Kubecost setup now writes and collects data from GMP. Data should be ready for viewing within 15 minutes.

Verification

Run the following command to enable port-forwarding to expose the Kubecost dashboard:

kubectl port-forward --namespace kubecost deployment/kubecost-cost-analyzer 9090

To verify that the integration is set up, go to Settings in the Kubecost UI, and check the Prometheus Status section.

From your GCP Monitoring - Metrics explorer console, You can run the following query to verify if Kubecost metrics are collected:

avg(node_cpu_hourly_cost) by (cluster_id)

Troubleshooting

Cluster efficiency displaying as 0%, or efficieny only displaying for most recent cluster

The below queries must return data for Kubecost to calculate costs correctly. For the queries to work, set the environment variables:

KUBECOST_NAMESPACE=kubecost
KUBECOST_DEPLOYMENT=kubecost-cost-analyzer
CLUSTER_ID=YOUR_CLUSTER_NAME

Verify connection to GMP and that the metric for container_memory_working_set_bytes is available:

If you have set kubecostModel.promClusterIDLabel in the Helm chart, you will need to change the query (CLUSTER_ID) to match the label.

kubectl exec -it -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=container_memory_working_set_bytes\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 | jq

Verify Kubecost metrics are available in GMP:

kubectl exec -it -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=node_total_hourly_cost\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 | jq

You should receive an output similar to:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "",
          "cluster": "",
          "id": "/",
          "instance": "",
          "job": "kubelet",
          "location": "",
          "node": "",
          "project_id": ""
        },
        "value": [
          1697627020,
          "2358820864"
        ]
      },

If id returns as a blank value, you can set the following Helm value to force set cluster as the Prometheus cluster ID label:

kubecostModel:
  promClusterIDLabel: cluster

If the above queries fail, check the following:

Check logs of the sigv4proxy container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c sigv4proxy --tail -1

In a working sigv4proxy, there will be very few logs.

Correctly working log output:

time="2023-09-21T17:40:15Z" level=info msg="Stripping headers []" StripHeaders="[]"
time="2023-09-21T17:40:15Z" level=info msg="Listening on :8005" port=":8005"

Check logs in the cost-model container for Prometheus connection issues:

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c cost-model --tail -1 | grep -i err

Example errors:

ERR CostModel.ComputeAllocation: pod query 1 try 2 failed: avg(kube_pod_container_status_running...
Prometheus communication error: 502 (Bad Gateway) ...

Additionally, read our Custom Prometheus integration troubleshooting guide if you run into any other errors while setting up the integration. For support from GCP, you can submit a support request at the GCP support hub.

Prometheus Configuration Guide

Bring your own Prometheus

There are several considerations when disabling the Kubecost included Prometheus deployment. Kubecost strongly recommends installing Kubecost with the bundled Prometheus in most environments.

Additionally, if multi-cluster metric aggregation is required, Kubecost provides a turnkey solution that is highly tuned and simple to support using the included Prometheus deployment.

This feature is accessible to all users. However, please note that comprehensive support is provided with a paid support plan.

Dependency requirements

Kubecost requires the following minimum versions:

Prometheus: v2.18 (v2.13-2.17 supported with limited functionality)
kube-state-metrics: v1.6.0+
cAdvisor: kubelet v1.11.0+
node-exporter: v0.16+ (Optional)

Instructions

Disable node-exporter and kube-state-metrics (recommended)

If you have node-exporter and/or KSM running on your cluster, follow this step to disable the Kubecost included versions. Additional detail on KSM requirements.

In contrast to our recommendation above, we do recommend disabling the Kubecost's node-exporter and kube-state-metrics if you already have them running in your cluster.

helm upgrade --install kubecost \
  --repo https://kubecost.github.io/cost-analyzer/ cost-analyzer \
  --namespace kubecost --create-namespace \
  --set prometheus.nodeExporter.enabled=false \
  --set prometheus.serviceAccounts.nodeExporter.create=false \
  --set prometheus.kubeStateMetrics.enabled=false

Disabling Kubecost's Prometheus deployment

This process is not recommended. Before continuing, review the Bring your own Prometheus section if you haven't already.

Pass the following parameters in your Helm install:

```
helm upgrade --install kubecost \
  --repo https://kubecost.github.io/cost-analyzer/ cost-analyzer \
  --namespace kubecost --create-namespace \
  --set global.prometheus.fqdn=http://<prometheus-server-service-name>.<prometheus-server-namespace>.svc:<port> \
  --set global.prometheus.enabled=false
```

The FQDN can be a full path via https://prometheus-prod-us-central-x.grafana.net/api/prom/ if you use Grafana Cloud-managed Prometheus. Learn more in the Grafana Cloud Integration for Kubecost doc.

Have your Prometheus scrape the cost-model /metrics endpoint. These metrics are needed for reporting accurate pricing data. Here is an example scrape config:

- job_name: kubecost
      honor_labels: true
      scrape_interval: 1m
      scrape_timeout: 10s
      metrics_path: /metrics
      scheme: http
      dns_sd_configs:
      - names:
        - kubecost-cost-analyzer.<namespace-of-your-kubecost>
        type: 'A'
        port: 9003

This config needs to be added to extraScrapeConfigs in the Prometheus configuration. See the example extraScrapeConfigs.yaml.

By default, the Prometheus chart included with Kubecost (bundled-Prometheus) contains scrape configs optimized for Kubecost-required metrics. You need to add those scrape configs jobs into your existing Prometheus setup to allow Kubecost to provide more accurate cost data and optimize the required resources for your existing Prometheus.

Recording rules

This step is optional. If you do not set up Kubecost's CPU usage recording rule, Kubecost will fall back to a PromQL subquery which may put unnecessary load on your Prometheus.

Alternatively, if your environment supports serviceMonitors and prometheusRules, pass these values to your Helm install:

global:
  prometheus:
    enabled: false
serviceMonitor:
  enabled: true
  # additionalLabels:
  #   label-key: label-value
  networkCosts:
    enabled: true
    # additionalLabels:
    #   label-key: label-value
prometheusRule:
  enabled: true
  # additionalLabels:
  #   label-key: label-value

To confirm this job is successfully scraped by Prometheus, you can view the Targets page in Prometheus and look for a job named kubecost.

Node exporter metric labels

This step is optional, and only impacts certain efficiency metrics. View issue/556 for a description of what will be missing if this step is skipped.

You'll need to add the following relabel config to the job that scrapes the node exporter DaemonSet.

  - job_name: 'kubernetes-service-endpoints'

    kubernetes_sd_configs:
      - role: endpoints

    relabel_configs:
      - source_labels: [__meta_kubernetes_pod_node_name]
        action: replace
        target_label: kubernetes_node

This does not override the source label. It creates a new label called kubernetes_node and copies the value of pod into it.

Distinguishing clusters

Data retention

By default, metric retention is 91 days, however the retention of data can be further increased with a configurable value for a property etlDailyStoreDurationDays. You can find this value here.

Troubleshooting

The Diagnostics page (Settings > View Full Diagnostics) provides diagnostic info on your integration. Scroll down to Prometheus Status to verify that your configuration is successful.

Below you can find solutions to common Prometheus configuration problems. View the Kubecost Diagnostics doc for more information.

Misconfigured Prometheus FQDN

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl http://<your_prometheus_url>/api/v1/status/config

In the above example, <your_prometheus_url> may include a port number and/or namespace, example: http://prometheus-operator-kube-p-prometheus.monitoring:9090/api/v1/status/config

Context deadline exceeded

When successful, this command should return all of the metrics that Kubecost uses. Failures may be indicative of the network traffic being blocked.

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://<your_prometheus_url>/metrics"

Prometheus throttling

Ensure Prometheus isn't being CPU throttled due to a low resource request.

Wrong dependency version

Review the Dependency Requirements section above

Missing scrape configs

Visit Prometheus Targets page (screenshot above)

Data incorrectly is a single namespace

Make sure that honor_labels is enabled

Negative idle reported

Single cluster tests

Ensure results are not null for both queries below.

Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://localhost:9003/prometheusQuery?query=node_total_hourly_cost"

Ensure kube-state-metrics are available: kube_node_status_capacity

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl "http://localhost:9003/prometheusQuery?query=kube_node_status_capacity"

For both queries, verify nodes are returned. A successful response should look like:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"node_total_hourly_cost","instance":"aks-agentpool-81479558-vmss000001","instance_type":"Standard_B4ms","job":"kubecost","node":"aks-agentpool-81479558-vmss000001","provider_id":"azure:///.../virtualMachines/1","region":"eastus"},"value":[1673020150,"0.16599565032196045"]}]}}

An error will look like:

{"status":"success","data":{"resultType":"vector","result":[]}}

Enterprise multi-cluster test

Ensure that all clusters and nodes have values- output should be similar to the above Single Cluster Tests

Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl -G http://localhost:9003/thanosQuery \
  -d time=`date -d '1 day ago' "+%Y-%m-%dT%H:%M:%SZ"` \
  --data-urlencode "query=avg (sum_over_time(node_total_hourly_cost[1d])) by (cluster_id, node)" \
  | jq

On macOS, change date -d '1 day ago' to date -v '-1d'

Ensure kube-state-metrics are available: kube_node_status_capacity

kubectl exec -i -t --namespace kubecost \
  deployment/kubecost-cost-analyzer -c cost-analyzer-frontend -- \
  curl -G http://localhost:9003/thanosQuery \
  -d time=`date -d '1 day ago' "+%Y-%m-%dT%H:%M:%SZ"` \
  --data-urlencode "query=avg (sum_over_time(kube_node_status_capacity[1d])) by (cluster_id, node)" \
  | jq

For both queries, verify nodes are returned. A successful response should look like:

{"status":"success","data":{"resultType":"vector","result":[{"metric":{"__name__":"node_total_hourly_cost","instance":"aks-agentpool-81479558-vmss000001","instance_type":"Standard_B4ms","job":"kubecost","node":"aks-agentpool-81479558-vmss000001","provider_id":"azure:///.../virtualMachines/1","region":"eastus"},"value":[1673020150,"0.16599565032196045"]}]}}

An error will look like:

{"status":"success","data":{"resultType":"vector","result":[]}}

Amazon Managed Service for Prometheus

Overview

Reference resources

Architecture

The architecture of this integration is similar to Amazon EKS cost monitoring with Kubecost, which is described in the previous blog post, with some enhancements as follows:

There are two architectures you can deploy:

The Quick-Start architecture supports a small multi-cluster setup of up to 100 clusters.
The Federated architecture supports a large multi-cluster setup for over 100 clusters.

Quick-Start architecture

The infrastructure can manageup to 100 clusters. The following architecture diagram illustrates the small-scale infrastructure setup:

Federated architecture

The following architecture diagram illustrates the large-scale infrastructure setup:

Instructions

Prerequisites

Create Amazon Managed Service for Prometheus workspace:

Step 1: Run the following command to get the information of your current EKS cluster:

kubectl config current-context

The example output should be in this format:

arn:aws:eks:${AWS_REGION}:${YOUR_AWS_ACCOUNT_ID}:cluster/${YOUR_CLUSTER_NAME}

Step 2: Run the following command to create new a Amazon Managed Service for Prometheus workspace

export AWS_REGION=<YOUR_AWS_REGION>
aws amp create-workspace --alias kubecost-amp --region $AWS_REGION

The Amazon Managed Service for Prometheus workspace should be created in a few seconds. Run the following command to get the workspace ID:

export AMP_WORKSPACE_ID=$(aws amp list-workspaces --region ${AWS_REGION} --output json --query 'workspaces[?alias==`kubecost-amp`].workspaceId | [0]' | cut -d'"' -f 2)
echo $AMP_WORKSPACE_ID

Setting up the environment:

Step 1: Set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus

Run the following command to set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus:

export RELEASE="kubecost"
export YOUR_CLUSTER_NAME=<YOUR_EKS_CLUSTER_NAME>
export AWS_REGION=${AWS_REGION}
export VERSION="{X.XXX.X}"
export KC_BUCKET="kubecost-etl-metrics" # Remove this line if you want to set up small-scale infrastructure
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REMOTEWRITEURL="https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write"
export QUERYURL="http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}"

Step 2: Set up S3 bucket, IAM policy and Kubernetes secret for storing Kubecost ETL files

Note: You can ignore Step 2 for the small-scale infrastructure setup.

a. Create Object store S3 bucket to store Kubecost ETL metrics. Run the following command in your workspace:

aws s3 mb s3://${KC_BUCKET}

Run the following command in your workspace:

# create policy-kubecost-aws-s3.json file
cat <<EOF>policy-kubecost-aws-s3.json
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::${KC_BUCKET}"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": [
                "arn:aws:s3:::${KC_BUCKET}",
                "arn:aws:s3:::${KC_BUCKET}/*"
            ]
        }
    ]
}
EOF
# create the AWS IAM policy
aws iam create-policy \
 --policy-name kubecost-s3-federated-policy-$YOUR_CLUSTER_NAME \
 --policy-document file://policy-kubecost-aws-s3.json

c. Create Kubernetes secret to allow Kubecost to write ETL files to the S3 bucket. Run the following command in your workspace:

# create manifest file for the secret
cat <<EOF>federated-store.yaml
type: S3
config:
  bucket: "${KC_BUCKET}"
  endpoint: "s3.amazonaws.com"
  region: "${AWS_REGION}"
  insecure: false
  signature_version2: false
  put_user_metadata:
      "X-Amz-Acl": "bucket-owner-full-control"
  http_config:
    idle_conn_timeout: 90s
    response_header_timeout: 2m
    insecure_skip_verify: false
  trace:
    enable: true
  part_size: 134217728
EOF
# create Kubecost namespace and the secret from the manifest file
kubectl create namespace ${RELEASE}
kubectl create secret generic \
  kubecost-object-store -n ${RELEASE} \
  --from-file federated-store.yaml

Step 3: Set up IRSA to allow Kubecost and Prometheus to read & write metrics from Amazon Managed Service for Prometheus

These following commands help to automate the following tasks:

Create an IAM role with the AWS-managed IAM policy and trusted policy for the following service accounts: kubecost-cost-analyzer-amp, kubecost-prometheus-server-amp.
Modify current K8s service accounts with annotation to attach a new IAM role.

Run the following command in your workspace:

eksctl create iamserviceaccount \
    --name kubecost-cost-analyzer-amp \
    --namespace ${RELEASE} \
    --cluster ${YOUR_CLUSTER_NAME} --region ${AWS_REGION} \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
   --attach-policy-arn arn:aws:iam::${AWS_ACCOUNT_ID}:policy/kubecost-s3-federated-policy-${YOUR_CLUSTER_NAME} \ # Remove this line if you want to set up small-scale infrastructure
    --override-existing-serviceaccounts \
    --approve

eksctl create iamserviceaccount \
    --name kubecost-prometheus-server-amp \
    --namespace ${RELEASE} \
    --cluster ${YOUR_CLUSTER_NAME} --region ${AWS_REGION} \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusQueryAccess \
    --attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
    --override-existing-serviceaccounts \
    --approve

Integrating Kubecost with Amazon Managed Service for Prometheus

Preparing the configuration file

Run the following command to create a file called config-values.yaml, which contains the defaults that Kubecost will use for connecting to your Amazon Managed Service for Prometheus workspace.

cat << EOF > config-values.yaml
global:
  amp:
    enabled: true
    prometheusServerEndpoint: http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}
    remoteWriteService: https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write
    sigv4:
      region: ${AWS_REGION}

sigV4Proxy:
  region: ${AWS_REGION}
  host: aps-workspaces.${AWS_REGION}.amazonaws.com
EOF

Primary cluster

Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the primary:

helm upgrade -i ${RELEASE} \
oci://public.ecr.aws/kubecost/cost-analyzer --version $VERSION \
--namespace ${RELEASE} --create-namespace \
-f https://tinyurl.com/kubecost-amazon-eks \
-f config-values.yaml \
-f https://raw.githubusercontent.com/kubecost/poc-common-configurations/main/etl-federation/primary-federator.yaml \ # Remove this line if you want to set up small-scale infrastructure
--set global.amp.prometheusServerEndpoint=${QUERYURL} \
--set global.amp.remoteWriteService=${REMOTEWRITEURL} \
--set kubecostProductConfigs.clusterName=${YOUR_CLUSTER_NAME} \
--set kubecostProductConfigs.projectID=${AWS_ACCOUNT_ID} \
--set prometheus.server.global.external_labels.cluster_id=${YOUR_CLUSTER_NAME} \
--set federatedETL.federator.primaryClusterID=${YOUR_CLUSTER_NAME} \ # Remove this line if you want to set up small-scale infrastructure
--set serviceAccount.create=false \
--set prometheus.serviceAccounts.server.create=false \
--set serviceAccount.name=kubecost-cost-analyzer-amp \
--set prometheus.serviceAccounts.server.name=kubecost-prometheus-server-amp \
--set federatedETL.federator.useMultiClusterDB=true \

Additional clusters

export RELEASE="kubecost"
export YOUR_CLUSTER_NAME=<YOUR_EKS_CLUSTER_NAME>
export AWS_REGION="<YOUR_AWS_REGION>"
export VERSION="1.103.4"
export KC_BUCKET="kubecost-etl-metrics"
export AWS_ACCOUNT_ID=$(aws sts get-caller-identity --query Account --output text)
export REMOTEWRITEURL="https://aps-workspaces.${AWS_REGION}.amazonaws.com/workspaces/${AMP_WORKSPACE_ID}/api/v1/remote_write"
export QUERYURL="http://localhost:8005/workspaces/${AMP_WORKSPACE_ID}"

Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the additional cluster:

helm upgrade -i ${RELEASE} \
oci://public.ecr.aws/kubecost/cost-analyzer --version $VERSION \
--namespace ${RELEASE}  --create-namespace \
-f https://tinyurl.com/kubecost-amazon-eks \
-f config-values.yaml \
-f https://raw.githubusercontent.com/kubecost/poc-common-configurations/main/etl-federation/agent-federated.yaml \ # Remove this line if you want to set up small-scale infrastructure
--set global.amp.prometheusServerEndpoint=${QUERYURL} \
--set global.amp.remoteWriteService=${REMOTEWRITEURL} \
--set kubecostProductConfigs.clusterName=${YOUR_CLUSTER_NAME} \
--set kubecostProductConfigs.projectID=${AWS_ACCOUNT_ID} \
--set prometheus.server.global.external_labels.cluster_id=${YOUR_CLUSTER_NAME} \
--set serviceAccount.create=false \
--set prometheus.serviceAccounts.server.create=false \
--set serviceAccount.name=kubecost-cost-analyzer-amp \
--set prometheus.serviceAccounts.server.name=kubecost-prometheus-server-amp \
--set federatedETL.useMultiClusterDB=true

Your Kubecost setup is now writing and collecting data from AMP. Data should be ready for viewing within 15 minutes.

To verify that the integration is set up, go to Settings in the Kubecost UI, and check the Prometheus Status section.

Add recording rules (optional)

    groups:
      - name: CPU
        rules:
          - expr: sum(rate(container_cpu_usage_seconds_total{container_name!=""}[5m]))
            record: cluster:cpu_usage:rate5m
          - expr: rate(container_cpu_usage_seconds_total{container_name!=""}[5m])
            record: cluster:cpu_usage_nosum:rate5m
          - expr: avg(irate(container_cpu_usage_seconds_total{container_name!="POD", container_name!=""}[5m])) by (container_name,pod_name,namespace)
            record: kubecost_container_cpu_usage_irate
          - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""}) by (container_name,pod_name,namespace)
            record: kubecost_container_memory_working_set_bytes
          - expr: sum(container_memory_working_set_bytes{container_name!="POD",container_name!=""})
            record: kubecost_cluster_memory_working_set_bytes
      - name: Savings
        rules:
          - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod))
            record: kubecost_savings_cpu_allocation
            labels:
              daemonset: "false"
          - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_cpu_allocation) by (pod)) / sum(kube_node_info)
            record: kubecost_savings_cpu_allocation
            labels:
              daemonset: "true"
          - expr: sum(avg(kube_pod_owner{owner_kind!="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod))
            record: kubecost_savings_memory_allocation_bytes
            labels:
              daemonset: "false"
          - expr: sum(avg(kube_pod_owner{owner_kind="DaemonSet"}) by (pod) * sum(container_memory_allocation_bytes) by (pod)) / sum(kube_node_info)
            record: kubecost_savings_memory_allocation_bytes
            labels:
              daemonset: "true"

Troubleshooting

The below queries must return data for Kubecost to calculate costs correctly.

For the queries below to work, set the environment variables:

KUBECOST_NAMESPACE=kubecost
KUBECOST_DEPLOYMENT=kubecost-cost-analyzer
CLUSTER_ID=YOUR_CLUSTER_NAME

Verify connection to AMP and that the metric for container_memory_working_set_bytes is available:

If you have set kubecostModel.promClusterIDLabel, you will need to change the query (CLUSTER_ID) to match the label (typically cluster or alpha_eksctl_io_cluster_name).

kubectl exec -i -t -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=container_memory_working_set_bytes\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 |jq

The output should contain a JSON entry similar to the following.

The value of cluster_id should match the value of kubecostProductConfigs.clusterName.

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "container_memory_working_set_bytes",
          "cluster_id": "qa-eks1",
          "alpha_eksctl_io_cluster_name": "qa-eks1",
          "alpha_eksctl_io_nodegroup_name": "qa-eks1-nodegroup",
          "beta_kubernetes_io_arch": "amd64",
          "beta_kubernetes_io_instance_type": "t3.medium",
          "beta_kubernetes_io_os": "linux",
          "eks_amazonaws_com_capacityType": "ON_DEMAND",
          "eks_amazonaws_com_nodegroup": "qa-eks1-nodegroup",
          "id": "/",
          "instance": "ip-10-10-8-66.us-east-2.compute.internal",
          "job": "kubernetes-nodes-cadvisor"
        },
        "value": [
          1697630036,
          "3043811328"
        ]
      }
    ]
  }
}

Verify Kubecost metrics are available in AMP:

kubectl exec -i -t -n $KUBECOST_NAMESPACE \
  deployments/$KUBECOST_DEPLOYMENT -c cost-analyzer-frontend \
  -- curl "0:9090/model/prometheusQuery?query=node_total_hourly_cost\{CLUSTER_ID=\"$CLUSTER_ID\"\}" \
 |jq

The output should contain a JSON entry similar to:

{
  "status": "success",
  "data": {
    "resultType": "vector",
    "result": [
      {
        "metric": {
          "__name__": "node_total_hourly_cost",
          "cluster_id": "qa-eks1",
          "alpha_eksctl_io_cluster_name": "qa-eks1",
          "arch": "amd64",
          "instance": "ip-192-168-47-226.us-east-2.compute.internal",
          "instance_type": "t3.medium",
          "job": "kubecost"
        },
        "value": [
          1697630306,
          "0.04160104542160034"
        ]
      }
    ]
  }
}

If the above queries fail, check the following:

Check logs of the sigv4proxy container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c sigv4proxy --tail -1

In a working sigv4proxy, there will be very few logs.

Correctly working log output:

time="2023-09-21T17:40:15Z" level=info msg="Stripping headers []" StripHeaders="[]"
time="2023-09-21T17:40:15Z" level=info msg="Listening on :8005" port=":8005"

Check logs in the `cost-model`` container for Prometheus connection issues:

kubectl logs deployments/$KUBECOST_DEPLOYMENT -c cost-model --tail -1 |grep -i err

Example errors:

ERR CostModel.ComputeAllocation: pod query 1 try 2 failed: avg(kube_pod_container_status_running...
Prometheus communication error: 502 (Bad Gateway) ...

Prometheus Configuration Guide

Bring your own Prometheus

Dependency requirements

Instructions

Disable node-exporter and kube-state-metrics (recommended)

Disabling Kubecost's Prometheus deployment

Recording rules

Node exporter metric labels

Distinguishing clusters

Data retention

Troubleshooting

Misconfigured Prometheus FQDN

Context deadline exceeded

Prometheus throttling

Wrong dependency version

Missing scrape configs

Data incorrectly is a single namespace

Negative idle reported

Single cluster tests

Enterprise multi-cluster test

Amazon Managed Service for Prometheus

Overview

Reference resources

Architecture

Quick-Start architecture

Federated architecture

Instructions

Prerequisites

Create Amazon Managed Service for Prometheus workspace:

Step 1: Run the following command to get the information of your current EKS cluster:

Step 2: Run the following command to create new a Amazon Managed Service for Prometheus workspace

Setting up the environment:

Step 1: Set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus

Step 2: Set up S3 bucket, IAM policy and Kubernetes secret for storing Kubecost ETL files

Step 3: Set up IRSA to allow Kubecost and Prometheus to read & write metrics from Amazon Managed Service for Prometheus

Integrating Kubecost with Amazon Managed Service for Prometheus

Preparing the configuration file

Primary cluster

Additional clusters

Add recording rules (optional)

Troubleshooting

Grafana Cloud Integration for Kubecost

Prerequisites

Step 1: Install the Grafana Agent on your cluster

Step 2: Create dbsecret to allow Kubecost to query the metrics from Grafana Cloud Prometheus:

Step 3 (optional): Configure Kubecost recording rules for Grafana Cloud using Cortextool

Step 4: Install Kubecost on the cluster

Grafana Mimir Integration for Kubecost

Prerequisites

Step 1: Install the Grafana Agent on your cluster

Step 2: Deploy Kubecost

Google Cloud Managed Service for Prometheus

Overview

Reference resources

Instructions

Prerequisites

Installation

Verification

Troubleshooting

Cluster efficiency displaying as 0%, or efficieny only displaying for most recent cluster

Grafana Cloud Integration for Kubecost

Prerequisites

Step 1: Install the Grafana Agent on your cluster

Step 2: Create dbsecret to allow Kubecost to query the metrics from Grafana Cloud Prometheus:

Step 3 (optional): Configure Kubecost recording rules for Grafana Cloud using Cortextool

Step 4: Install Kubecost on the cluster

Grafana Mimir Integration for Kubecost

Prerequisites

Step 1: Install the Grafana Agent on your cluster

Step 2: Deploy Kubecost

Google Cloud Managed Service for Prometheus

Overview

Reference resources

Instructions

Prerequisites

Installation

Verification

Troubleshooting

Cluster efficiency displaying as 0%, or efficieny only displaying for most recent cluster

Prometheus Configuration Guide

Step 2: Create `dbsecret` to allow Kubecost to query the metrics from Grafana Cloud Prometheus:

Step 2: Create `dbsecret` to allow Kubecost to query the metrics from Grafana Cloud Prometheus: