1 of 3

Thanos Federation

This feature is only officially supported on Kubecost Enterprise plans.

Thanos is a tool to aggregate Prometheus metrics to a central object storage (S3 compatible) bucket. Thanos is implemented as a sidecar on the Prometheus pod on all clusters. Thanos Federation is one of two primary methods to aggregate all cluster information back to a single view as described in our Multi-Cluster article.

The preferred method for multi-cluster is ETL Federation. The configuration guide below is for Kubecost Thanos Federation, which may not scale as well as ETL Federation in large environments.

This guide will cover how to enable Thanos on your primary cluster, and on any additional secondary clusters.

Configuration

Follow steps here to enable all required Thanos components on a Kubecost primary cluster, including the Prometheus sidecar.
For each additional cluster, only the Thanos sidecar is needed.

Consider the following Thanos recommendations for secondaries:

* Reuse your existing storage bucket and access credentials.
* Do not deploy multiple instances of `thanos-compact`.
* Optionally deploy `thanos-bucket` in each additional cluster, but it is not required.
* Optionally disable `thanos.store` and `thanos.query` (Clusters with store/query disabled will only have access to their metrics but will still write to the global bucket.)

Thanos modules can be disabled in [thanos/values.yaml](https://github.com/kubecost/cost-analyzer-helm-chart/blob/master/cost-analyzer/charts/thanos/values.yaml), or in [values-thanos.yaml](https://github.com/kubecost/cost-analyzer-helm-chart/blob/develop/cost-analyzer/values-thanos.yaml) if overriding these values from a values-thanos.yaml file supplied from the command line (`helm upgrade kubecost -f values.yaml -f values-thanos.yaml`), or by passing these parameters directly via Helm install or upgrade as follows:

```
--set thanos.compact.enabled=false --set thanos.bucket.enabled=false
```

You can also optionally disable `thanos.store`, `thanos.query` and `thanos.queryFrontend` with thanos/values.yaml or with these flags:

```
--set thanos.query.enabled=false --set thanos.store.enabled=false --set thanos.queryFrontend.enabled=false
```

Ensure you provide a unique identifier for prometheus.server.global.external_labels.cluster_id to have additional clusters be visible in the Kubecost product, e.g. cluster-two.

cluster_id can be replaced with another label (e.g. cluster) by modifying .Values.kubecostModel.promClusterIDLabel.

Follow the same verification steps available here.

Sample configurations for each cloud provider can be found here.

Architecture diagram

Configuring Thanos

This feature is only officially supported on Kubecost Enterprise plans.

Kubecost leverages Thanos and durable storage for three different purposes:

Centralize metric data for a global multi-cluster view into Kubernetes costs via a Prometheus sidecar
Allow for unlimited data retention
Backup Kubecost ETL data

To enable Thanos, follow these steps:

Step 1: Create object-store.yaml

This step creates the object-store.yaml file that contains your durable storage target (e.g. GCS, S3, etc.) configuration and access credentials. The details of this file are documented thoroughly in Thanos documentation.

We have guides for using cloud-native storage for the largest cloud providers. Other providers can be similarly configured.

Use the appropriate guide for your cloud provider:

Step 2: Create object-store secret

Create a secret with the .yaml file generated in the previous step:

kubectl create secret generic kubecost-thanos -n kubecost --from-file=./object-store.yaml

Step 3: Unique Cluster ID

Each cluster needs to be labelled with a unique Cluster ID, which is done in two places.

values-clusterName.yaml

kubecostProductConfigs:
  clusterName: kubecostProductConfigs_clusterName
prometheus:
  server:
    global:
      external_labels:
        cluster_id: kubecostProductConfigs_clusterName

Step 4: Deploying Kubecost with Thanos

The Thanos subchart includes thanos-bucket, thanos-query, thanos-store, thanos-compact, and service discovery for thanos-sidecar. These components are recommended when deploying Thanos on the primary cluster.

These values can be adjusted under the thanos block in values-thanos.yaml. Available options are here: thanos/values.yaml

helm upgrade kubecost kubecost/cost-analyzer \
    --install \
    --namespace kubecost \
    -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/v1.108/cost-analyzer/values-thanos.yaml \
    -f values-clusterName.yaml

The thanos-store container is configured to request 2.5GB memory, this may be reduced for smaller deployments. thanos-store is only used on the primary Kubecost cluster.

To verify installation, check to see all Pods are in a READY state. View Pod logs for more detail and see common troubleshooting steps below.

Troubleshooting

Thanos sends data to the bucket every 2 hours. Once 2 hours have passed, logs should indicate if data has been sent successfully or not.

You can monitor the logs with:

kubectl logs --namespace kubecost -l app=prometheus -l component=server --prefix=true --container thanos-sidecar --tail=-1 | grep uploaded

Monitoring logs this way should return results like this:

[pod/kubecost-prometheus-server-xxx/thanos-sidecar] level=debug ts=2022-06-09T13:00:10.084904136Z caller=objstore.go:206 msg="uploaded file" from=/data/thanos/upload/BUCKETID/chunks/000001 dst=BUCKETID/chunks/000001 bucket="tracing: kc-thanos-store"

As an aside, you can validate the Prometheus metrics are all configured with correct cluster names with:

kubectl logs --namespace kubecost -l app=prometheus -l component=server --prefix=true --container thanos-sidecar --tail=-1 | grep external_labels

To troubleshoot the IAM Role Attached to the serviceaccount, you can create a Pod using the same service account used by the thanos-sidecar (default is kubecost-prometheus-server):

s3-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: s3-pod
  name: s3-pod
spec:
  serviceAccountName: kubecost-prometheus-server
  containers:
  - image: amazon/aws-cli
    name: my-aws-cli
    command: ['sleep', '500']

kubectl apply -f s3-pod.yaml
kubectl exec -i -t s3-pod -- aws s3 ls s3://kc-thanos-store

This should return a list of objects (or at least not give a permission error).

Cluster not writing data to thanos bucket

If a cluster is not successfully writing data to the bucket, review thanos-sidecar logs with the following command:

kubectl logs kubecost-prometheus-server-<your-pod-id> -n kubecost -c thanos-sidecar

Logs in the following format are evidence of a successful bucket write:

level=debug ts=2019-12-20T20:38:32.288251067Z caller=objstore.go:91 msg="uploaded file" from=/data/thanos/upload/BUCKET-ID/meta.json dst=debug/metas/BUCKET-ID.json bucket=kc-thanos

Stores not listed at the `/stores` endpoint

If thanos-query can't connect to both the sidecar and the store, you may want to directly specify the store gRPC service address instead of using DNS discovery (the default). You can quickly test if this is the issue by running:

kubectl edit deployment kubecost-thanos-query -n kubecost

and adding

--store=kubecost-thanos-store-grpc.kubecost:10901

to the container args. This will cause a query restart and you can visit /stores again to see if the store has been added.

If it has, you'll want to use these addresses instead of DNS more permanently by setting .Values.thanos.query.stores in values-thanos.yaml.

...
thanos:
  store:
    enabled: true
    grpcSeriesMaxConcurrency: 20
    blockSyncConcurrency: 20
    extraEnv:
      - name: GOGC
        value: "100"
    resources:
      requests:
        memory: "2.5Gi"
  query:
    enabled: true
    timeout: 3m
    # Maximum number of queries processed concurrently by query node.
    maxConcurrent: 8
    # Maximum number of select requests made concurrently per a query.
    maxConcurrentSelect: 2
    resources:
      requests:
        memory: "2.5Gi"
    autoDownsampling: false
    extraEnv:
      - name: GOGC
        value: "100"
    stores:
      - "kubecost-thanos-store-grpc.kubecost:10901"

Additional Troubleshooting

A common error is as follows, which means you do not have the correct access to the supplied bucket:

thanos-svc-account@project-227514.iam.gserviceaccount.com does not have storage.objects.list access to thanos-bucket., forbidden"

Assuming pods are running, use port forwarding to connect to the thanos-query-http endpoint:

kubectl port-forward svc/kubecost-thanos-query-http 8080:10902 --namespace kubecost

Then navigate to http://localhost:8080 in your browser. This page should look very similar to the Prometheus console.

If you navigate to Stores using the top navigation bar, you should be able to see the status of both the thanos-store and thanos-sidecar which accompanied the Prometheus server:

Also note that the sidecar should identify with the unique cluster_id provided in your values.yaml in the previous step. Default value is cluster-one.

The default retention period for when data is moved into the object storage is currently 2h. This configuration is based on Thanos suggested values. By default, it will be 2 hours before data is written to the provided bucket.

Instead of waiting 2h to ensure that Thanos was configured correctly, the default log level for the Thanos workloads is debug (it's very light logging even on debug). You can get logs for the thanos-sidecar, which is part of the prometheus-server Pod, and thanos-store. The logs should give you a clear indication of whether or not there was a problem consuming the secret and what the issue is. For more on Thanos architecture, view this resource.

Thanos Upgrade

Kubecost v1.67.0+ uses Thanos 0.15.0. If you're upgrading to Kubecost v1.67.0+ from an older version and using Thanos, with AWS S3 as your backing storage for Thanos, you'll need to make a small change to your Thanos Secret in order to bump the Thanos version to 0.15.0 before you upgrade Kubecost.

Thanos 0.15.0 has over 10x performance improvements, so this is recommended.

This is simplified if you're using our default values-thanos.yaml, which has the new configs already.

For the Thanos Secret you're using, the encrypt-sse line needs to be removed. Everything else should stay the same.

For example, view this sample config:

The easiest way to do this is to delete the existing secret and upload a new one:

kubectl delete secret -n kubecost kubecost-thanos

Update your secret .YAML file as above, and save it as object-store.yaml.

kubectl create secret generic kubecost-thanos -n kubecost --from-file=./object-store.yaml

Once this is done, you're ready to upgrade!

Configuring Thanos

This feature is only officially supported on Kubecost Enterprise plans.

Kubecost leverages Thanos and durable storage for three different purposes:

Centralize metric data for a global multi-cluster view into Kubernetes costs via a Prometheus sidecar
Allow for unlimited data retention
Backup Kubecost ETL data

To enable Thanos, follow these steps:

Step 1: Create object-store.yaml

We have guides for using cloud-native storage for the largest cloud providers. Other providers can be similarly configured.

Use the appropriate guide for your cloud provider:

Step 2: Create object-store secret

Create a secret with the .yaml file generated in the previous step:

kubectl create secret generic kubecost-thanos -n kubecost --from-file=./object-store.yaml

Step 3: Unique Cluster ID

Each cluster needs to be labelled with a unique Cluster ID, which is done in two places.

values-clusterName.yaml

kubecostProductConfigs:
  clusterName: kubecostProductConfigs_clusterName
prometheus:
  server:
    global:
      external_labels:
        cluster_id: kubecostProductConfigs_clusterName

Step 4: Deploying Kubecost with Thanos

These values can be adjusted under the thanos block in values-thanos.yaml. Available options are here: thanos/values.yaml

helm upgrade kubecost kubecost/cost-analyzer \
    --install \
    --namespace kubecost \
    -f https://raw.githubusercontent.com/kubecost/cost-analyzer-helm-chart/v1.108/cost-analyzer/values-thanos.yaml \
    -f values-clusterName.yaml

The thanos-store container is configured to request 2.5GB memory, this may be reduced for smaller deployments. thanos-store is only used on the primary Kubecost cluster.

To verify installation, check to see all Pods are in a READY state. View Pod logs for more detail and see common troubleshooting steps below.

Troubleshooting

Thanos sends data to the bucket every 2 hours. Once 2 hours have passed, logs should indicate if data has been sent successfully or not.

You can monitor the logs with:

kubectl logs --namespace kubecost -l app=prometheus -l component=server --prefix=true --container thanos-sidecar --tail=-1 | grep uploaded

Monitoring logs this way should return results like this:

[pod/kubecost-prometheus-server-xxx/thanos-sidecar] level=debug ts=2022-06-09T13:00:10.084904136Z caller=objstore.go:206 msg="uploaded file" from=/data/thanos/upload/BUCKETID/chunks/000001 dst=BUCKETID/chunks/000001 bucket="tracing: kc-thanos-store"

As an aside, you can validate the Prometheus metrics are all configured with correct cluster names with:

kubectl logs --namespace kubecost -l app=prometheus -l component=server --prefix=true --container thanos-sidecar --tail=-1 | grep external_labels

To troubleshoot the IAM Role Attached to the serviceaccount, you can create a Pod using the same service account used by the thanos-sidecar (default is kubecost-prometheus-server):

s3-pod.yaml

apiVersion: v1
kind: Pod
metadata:
  labels:
    run: s3-pod
  name: s3-pod
spec:
  serviceAccountName: kubecost-prometheus-server
  containers:
  - image: amazon/aws-cli
    name: my-aws-cli
    command: ['sleep', '500']

kubectl apply -f s3-pod.yaml
kubectl exec -i -t s3-pod -- aws s3 ls s3://kc-thanos-store

This should return a list of objects (or at least not give a permission error).

Cluster not writing data to thanos bucket

If a cluster is not successfully writing data to the bucket, review thanos-sidecar logs with the following command:

kubectl logs kubecost-prometheus-server-<your-pod-id> -n kubecost -c thanos-sidecar

Logs in the following format are evidence of a successful bucket write:

level=debug ts=2019-12-20T20:38:32.288251067Z caller=objstore.go:91 msg="uploaded file" from=/data/thanos/upload/BUCKET-ID/meta.json dst=debug/metas/BUCKET-ID.json bucket=kc-thanos

Stores not listed at the `/stores` endpoint

kubectl edit deployment kubecost-thanos-query -n kubecost

and adding

--store=kubecost-thanos-store-grpc.kubecost:10901

to the container args. This will cause a query restart and you can visit /stores again to see if the store has been added.

If it has, you'll want to use these addresses instead of DNS more permanently by setting .Values.thanos.query.stores in values-thanos.yaml.

...
thanos:
  store:
    enabled: true
    grpcSeriesMaxConcurrency: 20
    blockSyncConcurrency: 20
    extraEnv:
      - name: GOGC
        value: "100"
    resources:
      requests:
        memory: "2.5Gi"
  query:
    enabled: true
    timeout: 3m
    # Maximum number of queries processed concurrently by query node.
    maxConcurrent: 8
    # Maximum number of select requests made concurrently per a query.
    maxConcurrentSelect: 2
    resources:
      requests:
        memory: "2.5Gi"
    autoDownsampling: false
    extraEnv:
      - name: GOGC
        value: "100"
    stores:
      - "kubecost-thanos-store-grpc.kubecost:10901"

Additional Troubleshooting

A common error is as follows, which means you do not have the correct access to the supplied bucket:

thanos-svc-account@project-227514.iam.gserviceaccount.com does not have storage.objects.list access to thanos-bucket., forbidden"

Assuming pods are running, use port forwarding to connect to the thanos-query-http endpoint:

kubectl port-forward svc/kubecost-thanos-query-http 8080:10902 --namespace kubecost

Then navigate to http://localhost:8080 in your browser. This page should look very similar to the Prometheus console.

If you navigate to Stores using the top navigation bar, you should be able to see the status of both the thanos-store and thanos-sidecar which accompanied the Prometheus server:

Also note that the sidecar should identify with the unique cluster_id provided in your values.yaml in the previous step. Default value is cluster-one.

Thanos Federation

Configuration

Architecture diagram

Configuring Thanos

Step 1: Create object-store.yaml

Step 2: Create object-store secret

Step 3: Unique Cluster ID

Step 4: Deploying Kubecost with Thanos

Troubleshooting

Cluster not writing data to thanos bucket

Stores not listed at the /stores endpoint

Additional Troubleshooting

Thanos Upgrade

Thanos Federation

Configuration

Architecture diagram

Configuring Thanos

Step 1: Create object-store.yaml

Step 2: Create object-store secret

Step 3: Unique Cluster ID

Step 4: Deploying Kubecost with Thanos

Troubleshooting

Cluster not writing data to thanos bucket

Stores not listed at the /stores endpoint

Additional Troubleshooting

Thanos Upgrade

Stores not listed at the `/stores` endpoint

Stores not listed at the `/stores` endpoint