Kubecost Aggregator

Aggregator is the primary query backend for Kubecost. It is enabled in all configurations of Kubecost. In a default installation, it runs within the cost-analyzer pod, but in a multi-cluster installation of Kubecost, some settings must be changed. Multi-cluster Kubecost uses the Federated ETL configuration without Thanos (replacing the Federator component).

Existing documentation for Kubecost APIs will use endpoints for non-Aggregator environments unless otherwise specified, but will still be compatible after configuring Aggregator.

Configuring Aggregator

Prerequisites

  • Multi-cluster Aggregator can only be configured in a Federated ETL environment

  • All clusters in your Federated ETL environment must be configured to build & push ETL files to the object store via .Values.federatedETL.federatedCluster and .Values.kubecostModel.federatedStorageConfigSecret. See our Federated ETL doc for more details.

  • If you've enabled Cloud Integration, it must be configured via the cloud integration secret. Other methods are now deprecated. See our Multi-Cloud Integrations doc for more details.

  • This documentation is for Kubecost v2.0 and higher.

If you are upgrading to Kubecost v2.0 from the following environments, see our specialized migration guides instead:

Basic configuration

kubecostAggregator:
  replicas: 1
  deployMethod: statefulset
  cloudCost:
    enabled: true
federatedETL:
  federatedCluster: true
kubecostModel:
  containerStatsEnabled: true
  cloudCost:
    enabled: false
  federatedStorageConfigSecret: federated-store
kubecostProductConfigs:
  clusterName: YOUR_CLUSTER_NAME
  cloudIntegrationSecret: cloud-integration
  productKey:
    enabled: true
    key: YOUR_KEY
prometheus:
  server:
    global:
      external_labels:
        cluster_id: YOUR_CLUSTER_NAME
# when using managed identity/irsa, set the service account accordingly:
serviceAccount:
  create: false
  name: kubecost-irsa-sa

Aggregator Optimizations

For larger deployments of Kubecost, Aggregator can be tuned.

Aggregator is a memory and disk-intensive process. Ensure that your cluster has enough resources to support the configuration below.

Because the Aggregator PV is relatively small, the least expensive performance gain will be to move the storage class to a faster SSD. The storageClass name varies by provider, the terms used are gp3/extreme/premium/etc.

The settings below are in addition to the basic configuration above.

kubecostAggregator:
  env:
    # This interval defines how long the Aggregator spends ingesting ETL data
    # from the federated store bucket into SQL tables, before exiting its job to
    # enter the derivation step. If set too low for large scale users, the
    # Aggregator may not have enough time to ingest all new data that exists in
    # the federated store bucket. If set too high, there will be a delay in data
    # between the Kubecost Agents and the Aggregator.
    #
    # Note, that the default value is set to 10m to optimize for the
    # first-install experience of Kubecost (i.e. it prioritizes small data
    # becoming available more quickly).
    #
    # default: 10m
    DB_BUCKET_REFRESH_INTERVAL: 1h

    # How much data to ingest from the federated store bucket, and how much data
    # to keep in the DB before rolling the data off.
    # 
    # Note: If increasing this value to backfill historical data, it will take
    # time to gradually ingest & process those historical ETL files. Consider
    # also increasing the resources available to the aggregator as well as the
    # refresh & concurrency env vars.
    # 
    # default: 91
    ETL_DAILY_STORE_DURATION_DAYS: "365"
    
    # How many threads to use when ingesting Asset/Allocation/CloudCost data
    # from the federated store bucket. In most cases the default is sufficient,
    # but can be increased if trying to backfill historical data.
    # default: 3
    DB_CONCURRENT_INGESTION_COUNT: "5"

    # log level
    # default: info
    LOG_LEVEL: info
  aggregatorDbStorage:
    # governs storage size of aggregator DB storage
    # !!NOTE!! disk performance is _critically important_ to aggregator performance
    # ensure disk is specd high enough, and check for bottlenecks
    # default: 128Gi
    storageRequest: 512Gi
  resources:
    requests:
      cpu: 1000m
      memory: 1Gi
    limits:
      # cpu: 2000m
      memory: 16Gi

There is no baseline for what is considered a larger deployment, which will be dependent on load times in your Kubecost environment.

Running the upgrade

If you have not already, create the required Kubernetes secrets. Refer to the Federated ETL doc and Cloud Integration doc for more details.

kubectl create secret generic federated-store -n kubecost --from-file=federated-store.yaml
kubectl create secret generic cloud-integration -n kubecost --from-file=cloud-integration.json

Finally, upgrade your existing Kubecost installation. This command will install Kubecost if it does not already exist.

If you are upgrading from an existing installation, make sure to append your existing values.yaml configurations to the ones described above.

helm upgrade --install "kubecost" \
  --repo https://kubecost.github.io/cost-analyzer/ cost-analyzer \
  --namespace kubecost \
  -f aggregator.yaml

Validating Aggregator pod is running successfully

When first enabled, the aggregator pod will ingest the last 90 days (if applicable) of ETL data from the federated-store. Because the combined folder is ignored, the legacy Federator pod is not used here, but can still run if needed. As ETL_DAILY_STORE_DURATION_DAYS increases, the amount of time it will take for Aggregator to make data available will increase. You can run kubectl get pods and ensure the aggregator pod is running, but should still wait for all data to be ingested.

Troubleshooting Aggregator

Resetting Aggregator StatefulSet data

When deploying the Aggregator as a StatefulSet, it is possible to perform a reset of the Aggregator data. The Aggregator itself doesn't store any data, and relies on object storage. As such, a reset involves removing that Aggregator's local storage, and allowing it to re-ingest data from the object store. The procedure is as follows:

  1. Scale down the Aggregator StatefulSet to 0

  2. When the Aggregator pod is gone, delete the aggregator-db-storage-xxx-0 PVC

  3. Scale the Aggregator StatefulSet back to 1. This will re-create the PVC, empty.

  4. Wait for Kubecost to re-ingest data from the object store. This could take from several minutes to several hours, depending on your data size and retention settings.

Aggregator not displaying any data to frontend after several hours

One reason you may not see data in the frontend yet is because the Aggregator is processing all your ETL files in the federated store bucket into SQL tables.

If you are seeing a lot of the following logs, it could be an indicator that your .Values.kubecostAggregator.env.DB_BUCKET_REFRESH_INTERVAL may be set too low, causing the Aggregator to continuously restart its data ingestion process:

INF asset worker context cancelled: context canceled
INF allocation worker context cancelled: context canceled

To fix, try continuously increasing the environment variable's value, until the errors no longer appear. We recommend starting with 1h. More details about the environment variable described above.

kubecostAggregator:
  env:
    DB_BUCKET_REFRESH_INTERVAL: 1h

Checking the database for node metadata

Confirming whether node metadata exists in your database can be useful when troubleshooting missing data. Run the following command which will open a shell into the Aggregator pod:

kubectl exec -it KUBECOST-AGGREGATOR-POD-NAME sh

Point to the path where your database exists

cd /var/configs/waterfowl/duckdb/v0_9_2
ls -lah

Copy the database to a new file for testing to avoid modifications to the original data

cp kubecost-example.duckdb.read kubecost-example.duckdb.read.kubecost.copy

Open a DuckDB REPL pointed at the copied database

duckdb kubecost-example.duckdb.read.kubecost.copy

Run the following debugging queries to check if node data is available:

show tables;
describe node_1h;
select * from node_1h;
select providerid,windowstart,windowend,* from node_1h;

.maxrows 100;

Last updated