Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
Loading...
To get started with Kubecost and OpenCost, visit our Installation page which will take you step by step through getting Kubecost set up.
This installation method is available for free and leverages the Kubecost Helm Chart. It provides access to all OpenCost and Kubecost community functionality and can scale to large clusters. This will also provide a token for trialing and retaining data across different Kubecost product tiers.
You can also install directly with the Kubecost Helm Chart with Helm v3.1+ using the following commands. This provides the same functionality as the step above but doesn't generate a product token for managing tiers or upgrade trials.
You can run Helm Template against the Kubecost Helm Chart to generate local YAML output. This requires extra effort when compared to directly installing the Helm Chart but is more flexible than deploying a flat manifest.
You can install via flat manifest. This install path is not recommended because it has limited flexibility for managing your deployment and future upgrades.
Lastly, you can deploy the open-source OpenCost project directly as a Pod. This install path provides a subset of free functionality and is available here. Specifically, this install path deploys the underlying cost allocation model without the same UI or access to enterprise functionality: cloud provider billing integration, RBAC/SAML support, and scale improvements in Kubecost.
Kubecost has a number of product configuration options that you can specify at install time in order to minimize the number of settings changes required within the product UI. This makes it simple to redeploy Kubecost. These values can be configured under kubecostProductConfigs
in our values.yaml. These parameters are passed to a ConfigMap that Kubecost detects and writes to its /var/configs
.
If you encounter any errors while installing Kubecost, first visit our Troubleshoot Install doc. If the error you are experiencing is not already documented here, or a solution is not found, contact our Support team at support@kubecost.com for more help.
Kubecost releases are scheduled on a near-monthly basis. You can keep up to date with new Kubecost updates and patches by following our release notes here.
After installing Kubecost, you will be able to update Kubecost with the following command, which will upgrade you to the most recent version:
You can upgrade or downgrade to a specific version of Kubecost with the following command:
To uninstall Kubecost and its dependencies, run the following command:
After successfully installing Kubecost, first time users should review our First Time User Guide to start immediately seeing the benefits of the product while also ensuring their workspace is properly set up.
Often while using and configuring Kubecost, our documentation may ask you to pass certain Helm flag values. There are three different approaches for passing custom Helm values into your Kubecost product, which are explained in this doc. In these examples, we are updating the kubecostProductConfigs.productKey.key
Helm value which enables Kubecost Enterprise, however these methods will work for all other Helm flags.
--set
command-line flagsFor example, you can only pass a product key if that is all you need to configure.
values
fileSimilar to Method 1, you can create a separate values file that contains only the parameters needed.
Your values.yaml should look like this:
Then run your install command:
This file contains the default Helm values that come with your Kubecost install. Taking this approach means you may need to sync with the repo to use the latest release. Be careful when applying certain Helm values related to your UI configuration to your secondary clusters. For more information, see this section in our Multi-Cluster doc about primary and secondary clusters.
Once you have familiarized yourself with Kubecost and integrated with any cloud providers, it's time to move on to more advanced concepts. This doc provides commonly used product configurations and feature overviews to help get you up and running after the Kubecost product has been installed. You may be redirected to other Kubecost docs to learn more about specific concepts or follow tutorials.
The default Kubecost installation has a 32Gb persistent volume and a 15-day retention period for Prometheus metrics. This is enough space to retain data for roughly 300 pods, depending on your exact node and container count. See the Kubecost Helm chart configuration options to adjust both the retention period and storage size.
To determine the appropriate disk size, you can use this formula to approximate:
Where ingested samples can be measured as the average over a recent period, e.g. sum(avg_over_time(scrape_samples_post_metric_relabeling[24h]))
. On average, Prometheus uses around 1.5-2 bytes per sample. So, ingesting 100k samples per minute and retaining them for 15 days would demand around 40 GB. It’s recommended to add another 20-30% capacity for headroom and WAL. More info on disk sizing here.
More than 30 days of data should not be stored in Prometheus for larger clusters. For long-term data retention, contact us at support@kubecost.com about Kubecost with durable storage enabled. More info on Kubecost storage here.
Users should set and/or update resource requests and limits before taking Kubecost into production at scale. These inputs can be configured in the Kubecost values.yaml for Kubecost modules and subcharts.
The exact recommended values for these parameters depend on the size of your cluster, availability requirements, and usage of the Kubecost product. Suggested values for each container can be found within Kubecost itself on the namespace page. More info on these recommendations is available here.
For best results, run Kubecost for up to seven days on a production cluster, then tune resource requests/limits based on resource consumption.
To broaden usage to other teams or departments within your Kubecost environment, basic security measures will usually be required. There are a number of options for protecting your workspace depending on your Kubecost product tier.
Establishing an ingress controller will allow for control of access for your workspace. Learn more about enabling external access in Kubecost with our Ingress Examples doc.
SSO/SAML/RBAC/OIDC are only officially supported on Kubecost Enterprise plans.
You can configure SSO and RBAC on a separate baseline deployment, which will not only shorten the deployment time of security features, but it will also avoid unwanted access denial. This is helpful when using only one developer deployment. See our user management guides below:
For teams already running node exporter on the default port, our bundled node exporter may remain in a Pending
state. You can optionally use an existing node exporter DaemonSet by setting the prometheus.nodeExporter.enabled
and prometheus.serviceAccounts.nodeExporter.create
Kubecost Helm chart config options to false
. This requires your existing node exporter endpoint to be visible from the namespace where Kubecost is installed. More configs options shown here.
You may optionally pass the following Helm flags to install Kubecost and its bundled dependencies without any persistent volumes. However, any time the Prometheus server pod is restarted, all historical billing data will be lost unless Thanos or other long-term storage is enabled in the Kubecost product.
Efficiency and idle costs can teach you more about the cost-value of your Kubernetes spend by showing you how efficiently your resources are used. To learn more about pod resource efficiency and cluster idle costs, see Efficiency and Idle.
Kubecost requires a Kubernetes cluster to be deployed.
Users should be running Kubernetes 1.20+.
Kubernetes 1.28 is officially supported as of v1.105.
Versions outside of the stated compatibility range may work, depending on individual configurations, but are untested.
Managed Kubernetes clusters (e.g. EKS, GKE, AKS) most common
Kubernetes distributions (e.g. OpenShift, DigitalOcean, Rancher, Tanzu)
Bootstrapped Kubernetes cluster
On-prem and air-gapped using custom pricing sheets
AWS (Amazon Web Services)
All regions supported, as shown in opencost/pkg/cloud/awsprovider.go
x86, ARM
GCP (Google Cloud Platform)
All regions supported, as shown in opencost/pkg/cloud/gcpprovider.go
x86
Azure (Microsoft)
All regions supported, as shown in opencost/pkg/cloud/azureprovider.go
x86
This list is certainly not exhaustive! This is simply a list of observations as to where our users run Kubecost based on their questions and feedback. Please contact us with any questions!
Kubecost helps you monitor and manage cost and capacity in Kubernetes environments. We integrate with your infrastructure to help your team track, manage, and reduce spend.
Below are frequently visited Kubecost documentation pages for both the Commercial Kubecost product and OpenCost.
On this site, you’ll find everything you need to set up Kubecost for your team.
Kubecost provides real-time cost visibility and insights for teams using Kubernetes, helping you continuously reduce your cloud costs.
Cost allocation
Flexible, customizable cost breakdown and resource allocation for accurate showbacks, chargebacks, and ongoing monitoring
Unified cost monitoring
See all of your Kubernetes and out-of-cluster spend in one place, with full cloud service billing integration
Optimization Insights
Get customized recommendations based on your own environment and behavior patterns
Alerts and governance
Achieve peak application performance and improve reliability with customizable alerts, configurable Availability Tiers, and real-time updates.
Purpose-built for teams running Kubernetes
Running containers on Kubernetes requires a new approach for visualizing and optimizing spend. Kubecost is designed from the ground up for Kubernetes and the Cloud Native ecosystem.
Own & control all of your own data
Kubecost is fully deployed in your infrastructure—we don’t require you to egress any data to a remote service. It’s deeply important to us that users are able to retain and control access to their own private information, e.g. sensitive cloud spend data.
Built on open source
Kubecost began as an open source project with a goal of giving small engineering teams access to great cost visibility. As a result, our solution is tightly integrated with the open source cloud native ecosystem, e.g. Kubernetes, Prometheus, and Grafana.
For larger teams and companies with more complex infrastructure, you need the right features in place for efficiency, administration, and security. Kubecost Enterprise offers even more features and control so that any team can use our products, according to your entire organization’s standards.
Unified visibility across all Kubernetes clusters
View aggregate spend allocation across all environments by cluster, namespace, label, team, service, etc. As an example, this functionality allows you to see the cost of a namespace or set of labels across all of your clusters. An unlimited number of clusters is supported.
Long-term metric retention
Retain data for years with various durable storage options. Provides record keeping on spend, allocation, and efficiency metrics with simple backup & restore functionality.
Access control with SSO/SAML
Finely manage read and/or admin access by individual users or user groups. Learn more about [configuring user management]](install-and-configure/advanced-configuration/user-management-saml/README.md).
High availability mode
Use multiple Kubecost replica pods with a Leader/Follower implementation to ensure one leader always exists across all replicas to run high availability mode. Learn more.
Advanced custom pricing
Advanced custom pricing pipelines give teams the ability to set custom per-asset pricing for resources. This is typically used for on-prem and air-gapped environments, but can also be applied to teams that want to allocate internal costs differently than cloud provider defaults.
Advanced integrations
Connect internal alerting, monitoring, and BI solutions to Kubecost metrics and reporting.
Enterprise Support
Dedicated SRE support via private Slack channel and video calls. Expert technical support and guidance based on your specific goals.
Check out our Installation guide to review your install options and get started on your Kubecost journey. Installation and onboarding only take a few minutes.
Once Kubecost has been successfully installed, check out our First Time User Guide which will get you started with connecting to your cluster's cloud service provider, review your data, and setting up multi-cluster environments.
If your Kubecost installation was not successful, go to our Troubleshoot Install doc which will work you through some of the most common installation-related issues.
Additionally, check out our blog to learn more about best practices with Kubecost's cost monitoring.
You can stay up-to-date with Kubecost by following releases on GitHub and the official Kubecost blog.
Follow us on social media here.
Contact us via email (support@kubecost.com) or join us on Slack if you have questions!
After successfully installing Kubecost, new users should familiarize themselves with these onboarding steps to begin immediately realizing value. This doc will explain to you the core features and options you will have access to and direct you to other necessary docs groups that will help you get set up.
While certain steps in this article may be optional depending on your setup, these are recommended best practices for seeing the most value out of Kubecost as soon as possible.
Many Kubernetes adopters may have billing with cloud service providers (CSPs) that differs from public pricing. By default, Kubecost will detect the CSP of the cluster where it is installed and pull list prices for nodes, storage, and LoadBalancers across all major CSPs: Azure, AWS, and GCP.
However, Kubecost is also able to integrate these CSPs to receive the most accurate billing data. By completing a cloud integration, Kubecost is able to reconcile costs with your actual cloud bill to reflect enterprise discounts, Spot market prices, commitment discounts, and more.
New users should seek to integrate any and all CSPs they use into Kubecost. For an overview of cloud integrations and getting started, see our Cloud Billing Integrations doc. Once you have completed all necessary integrations, return to this article.
Due to the frequency of updates from providers, it can take anywhere from 24 to 48 hours to see adjusted costs.
Now that your base install and CSP integrations are complete, it's time to determine the accuracy against your cloud bill. Based on different methods of cost aggregation, Kubecost should assess your billing data within a 3-5% margin of error.
After enabling port-forwarding, you should have access to the Kubecost UI. Explore the different pages in the left navigation, starting with the Monitor dashboards. These pages, including Allocations, Assets, Clusters, and Cloud Costs, are comprised of different categories of cost spending, and allow you to apply customized queries for specific billing data. These queries can then be saved in the form of reports for future quick access. Each page of the Kubecost UI has more dedicated information in the Navigating the Kubecost UI section.
It's important to take precautions to ensure your billing data is preserved, and you know how to monitor your infrastructure's health.
Metrics reside in Prometheus, but extracting information for either the UI or through API responses directly from this store is not performant at scale. For this reason, the data is optimized and stored in a structure is called extract, transform, load, or ETL. Kubecost's definition of ETL usually will refer to this ETL process.
Like any other system, backup of critical data is a must, and backing up ETL is no exception. To address this, we offer a number of different options based on your product tier. Descriptions and instructions for our backup functionalities can be found in our ETL Backup doc.
Similar to most systems, monitoring health is vital. For this, we offer several means of monitoring the health of both Kubecost and the host cluster.
Alerts can be configured to enable a proactive approach to monitoring your spend, and can be distributed across different workplace communication tools including email, Slack, and Microsoft Teams. Alerts can establish budgets for your different types of spend and cost-efficiency, and warn you if those budgets are reached. These Alerts are able to be configured via Helm or directly in your Kubecost UI.
The Health page will display an overall cluster health score which assesses how reliably and efficiently your infrastructure is performing. Scores start at 100 and decrease based on how severe any present errors are.
Kubecost has multiple ways of supporting multi-cluster environments, which vary based on your Kubecost product tier.
Kubecost Free will only allow you to view a single cluster at a time in the Kubecost UI. However, you can connect multiple different clusters and switch through them using Kubecost's context switcher.
Kubecost Enterprise provides a "single-pane-of-glass" view which combines metrics across all clusters into a shared storage bucket. One cluster is designated as the primary cluster from which you view the UI, with all other clusters considered secondary. Attempting to view the UI through a secondary cluster will not display metrics across your entire environment.
It is recommended to complete the steps above for your primary cluster before adding any secondary clusters. To learn more about advanced multi-cluster/Federated configurations, see our Multi-Cluster doc.
After completing these primary steps, you are well on your way to being proficient in Kubecost. However, managing Kubernetes infrastructure can be complicated, and for that we have plenty more documentation to help. For advanced or optional configuration options, see our Next Steps with Kubecost guide which will introduce you to additional concepts.
Multi-cloud integrations are only officially supported on Kubecost Enteprise plans.
This document outlines how to set up cloud integration for accounts on multiple cloud service providers (CSPs), or multiple accounts on the same cloud provider. This configuration can be used independently of, or in addition, to other cloud integration configurations provided by Kubecost. Once configured, Kubecost will display cloud assets for all configured accounts and perform reconciliation for all federated clusters that have their respective accounts configured.
For each cloud account that you would like to configure, you will need to make sure that it is exporting cost data to its respective service to allow Kubecost to gain access to it.
Azure: Set up cost data export following this guide.
GCP: Set up BigQuery billing data exports with this guide.
AWS: Follow steps 1-3 to set up and configure a Cost and Usage Report (CUR) in our guide.
Alibaba: Create a user account with access to the QueryInstanceBill API.
The secret should contain a file named cloud-integration.json with the following format (only containing applicable CSPs in your setup):
This method of cloud integration supports multiple configurations per cloud provider simply by adding each cost export to their respective arrays in the .json file. The structure and required values for the configuration objects for each cloud provider are described below. Once you have filled in the configuration object, use the command:
Once the secret is created, set .Values.kubecostProductConfigs.cloudIntegrationSecret
to <SECRET_NAME>
and upgrade Kubecost via Helm.
A GitHub repository with sample files required can be found here. Select the folder with the name of the cloud service you are configuring.
The following values can be located in the Azure Portal under Cost Management > Exports, or Storage accounts:
azureSubscriptionID
is the Subscription ID belonging to the Storage account which stores your exported Azure cost report data.
azureStorageAccount
is the name of the Storage account where the exported Azure cost report data is being stored.
azureStorageAccessKey
can be found by selecting Access Keys from the navigation sidebar then selecting Show keys. Using either of the two keys will work.
azureStorageContainer
is the name that you chose for the exported cost report when you set it up. This is the name of the container where the CSV cost reports are saved in your Storage account.
azureContainerPath
is an optional value which should be used if there is more than one billing report that is exported to the configured container. The path provided should have only one billing export because Kubecost will retrieve the most recent billing report for a given month found within the path.
azureCloud
is an optional value which denotes the cloud where the storage account exists. Possible values are public
and gov
. The default is public
.
Set these values into the following object and add them to the Azure array:
If you don't already have a GCP service key for any of the projects you would like to configure, you can run the following commands in your command line to generate and export one. Make sure your GCP project is where your external costs are being run.
You can then get your service account key to paste into the UI:
<KEY_JSON>
is the GCP service key created above. This value should be left as a JSON when inserted into the configuration object
<PROJECT_ID>
is the Project ID in the GCP service key.
<BILLING_DATA_DATASET>
requires a BigQuery dataset prefix (e.g. billing_data
) in addition to the BigQuery table name. A full example is billing_data.gcp_billing_export_v1_018AIF_74KD1D_534A2
.
Set these values into the following object and add it to the GCP array:
Many of these values in this config can be generated using the following command:
For each AWS account that you would like to configure, create an Access Key for the Kubecost user who has access to the CUR. Navigate to IAM Management Console dashboard, and select Access Management > Users. Find the Kubecost user and select Security Credentials > Create Access Key. Note the Access Key ID and Secret access key.
Gather each of these values from the AWS console for each account you would like to configure.
<ACCESS_KEY_ID>
is the ID of the Access Key created in the previous step.
<ACCESS_KEY_SECRET>
is the secret of the Access Key created in the
<ATHENA_BUCKET_NAME>
is the S3 bucket storing Athena query results which Kubecost has permission to access. The name of the bucket should match s3://aws-athena-query-results-*
, so the IAM roles defined above will automatically allow access to it. The bucket can have a canned ACL set to Private or other permissions as needed.
<ATHENA_REGION>
is the AWS region Athena is running in
<ATHENA_DATABASE>
is the name of the database created by the Athena setup. The Athena database name is available as the value (physical id) of AWSCURDatabase
in the CloudFormation stack created above.
<ATHENA_TABLE>
is the name of the table created by the Athena setup The table name is typically the database name with the leading athenacurcfn_
removed (but is not available as a CloudFormation stack resource).
<ATHENA_WORKGROUP>
is the workgroup assigned to be used with Athena. Default value is Primary
.
<ATHENA_PROJECT_ID>
is the AWS AccountID where the Athena CUR is. For example: 530337586277
.
<MASTER_PAYER_ARN>
is an optional value which should be set if you are using a multi-account billing set-up and are not accessing Athena through the primary account. It should be set to the ARN of the role in the management (formerly master payer) account, for example: arn:aws:iam::530337586275:role/KubecostRole
.
Set these values into the following object and add them to the AWS array in the cloud-integration.json:
Additionally set the kubecostProductConfigs.athenaProjectID
Helm value to the AWS account that Kubecost is being installed in.
Kubecost does not support complete integrations with Alibaba, but you will still be able to view accurate list prices for cloud resources. Gather these following values from the Alibaba Cloud Console for your account:
clusterRegion
is the most used region
accountID
is your Alibaba account ID
serviceKeyName
is the RAM user key name
serviceKeySecret
is the RAM user secret
Set these values into the following object and add them to the Alibaba array in your cloud-integration.json:
Enabling external access to the Kubecost product requires exposing access to port 9090 on the kubecost-cost-analyzer
pod. Exposing this endpoint will handle routing to Grafana as well. There are multiple ways to do this, including Ingress or Service definitions.
Please exercise caution when exposing Kubecost via an Ingress controller especially if there is no authentication in use. Consult your organization's internal recommendations.
Common samples below and others can be found on our GitHub repository.
The following example definitions use the NGINX Ingress Controller.
Here is a second basic auth example that uses a Kubernetes Secret.
When deploying Grafana on a non-root URL, you also need to update your grafana.ini to reflect this. More info can be found in values.yaml.
Once an AWS Load Balancer (ALB) Controller is installed, you can use the following Ingress resource manifest pointed at the Kubecost cost-analyzer service:
Integrating Kubecost with your AWS data provides the ability to allocate out-of-cluster (OOC) costs, e.g. RDS instances and S3 buckets, back to Kubernetes concepts like namespace and deployment as well as reconcile cluster assets back to your billing data. The latter is especially helpful when teams are using Reserved Instances, Savings Plans, or Enterprise Discounts. All billing data remains on your cluster when using this functionality and is not shared externally. Read our Cloud Integrations doc for more information on how Kubecost connects with Cloud Service Providers.
The following guide provides the steps required for enabling OOC costs allocation and accurate pricing, e.g. reserved instance price allocation. In a multi-account organization, all of the following steps will need to be completed in the payer account.
You can learn how to perform this using our AWS Cloud Integration doc.
Kubecost utilizes AWS tagging to allocate the costs of AWS resources outside of the Kubernetes cluster to specific Kubernetes concepts, such as namespaces, pods, etc. These costs are then shown in a unified dashboard within the Kubecost interface.
To allocate external AWS resources to a Kubernetes concept, use the following tag naming scheme:
Cluster
kubernetes_cluster
cluster-name
Namespace
kubernetes_namespace
namespace-name
Deployment
kubernetes_deployment
deployment-name
Label
kubernetes_label_NAME*
label-value
DaemonSet
kubernetes_daemonset
daemonset-name
Pod
kubernetes_pod
pod-name
Container
kubernetes_container
container-name
In the kubernetes_label_NAME
tag key, the NAME
portion should appear exactly as the tag appears inside of Kubernetes. For example, for the tag app.kubernetes.io/name
, this tag key would appear as kubernetes_label_app.kubernetes.io/name
.
To use an alternative or existing AWS tag schema, you may supply these in your values.yaml under kubecostProductConfigs.labelMappingConfigs.\<aggregation\>\_external_label
. Also be sure to set kubecostProductConfigs.labelMappingConfigs.enabled=true
.
For more information, consult AWS' Tag your Amazon EC2 resources.
Tags may take several hours to show up in the Cost Allocations Tags section described in the next step.
Tags that contain :
in the key may be converted to _
in the Kubecost UI due to Prometheus readability. To use AWS Label Mapping Configs, use this mapping format:
To view examples of common label mapping configs, see here.
In order to make the custom Kubecost AWS tags appear on the CURs, and therefore in Kubecost, individual cost allocation tags must be enabled. Details on which tags to enable can be found in Step 2.
For instructions on enabling user-defined cost allocation tags, consult AWS' Activating user-defined cost allocation tags
Account-level tags are applied (as labels) to all the Assets built from resources defined under a given AWS account. You can filter AWS resources in the Kubecost Assets View (or API) by account-level tags by adding them ('tag:value') in the Label/Tag filter.
If a resource has a label with the same name as an account-level tag, the resource label value will take precedence.
Modifications incurred on account-level tags may take several hours to update on Kubecost.
Your AWS account will need to support the organizations:ListAccounts
and organizations:ListTagsForResource
policies to benefit from this feature.
In the Kubecost UI, view the Allocations dashboard. If external costs are not shown, open your browser's Developer Tools > Console to see any reported errors.
Query Athena directly to ensure data is available. Note: it can take up to 6 hours for data to be written.
You may need to upgrade your AWS Glue if you are running an old version. See Upgrading to the AWS Glue Data Catalog step-by-step for more info.
Finally, review pod logs from the cost-model
container in the cost-analyzer
pod and look for auth errors or Athena query results.
By default, Kubecost pulls on-demand asset prices from the public AWS pricing API. For more accurate pricing, this integration will allow Kubecost to reconcile your current measured Kubernetes spend with your actual AWS bill. This integration also properly accounts for Enterprise Discount Programs, Reserved Instance usage, Savings Plans, Spot usage, and more.
You will need permissions to create the Cost and Usage Report (CUR), and add IAM credentials for Athena and S3. Optional permission is the ability to add and execute CloudFormation templates. Kubecost does not require root access in the AWS account.
This guide contains multiple possible methods for connecting Kubecost to AWS billing, based on user environment and preference. Because of this, there may not be a straightforward approach for new users. To address this, a streamlined guide containing best practices can be found here for IRSA environments. This quick start guide has some assumptions to carefully consider, and may not be applicable for all users. See prerequisites in the linked article.
Integrating your AWS account with Kubecost may be a complicated process if you aren’t deeply familiar with the AWS platform and how it interacts with Kubecost. This section provides an overview of some of the key terminology and AWS services that are involved in the process of integration.
Cost and Usage Report: AWS report which tracks cloud spending and writes to an Amazon Simple Storage Service (Amazon S3) bucket for ingestion and long term historical data. The CUR is originally formatted as a CSV, but when integrated with Athena, is converted to Parquet format.
Amazon Athena: Analytics service which queries the CUR S3 bucket for your AWS cloud spending, then outputs data to a separate S3 bucket. Kubecost uses Athena to query for the bill data to perform reconciliation. Athena is technically optional for AWS cloud integration, but as a result, Kubecost will only provide unreconciled costs (on-demand public rates).
S3 bucket: Cloud object storage tool which both CURs and Athena output cost data to. Kubecost needs access to these buckets in order to read that data.
For the below guide, a GitHub repository with sample files can be found here.
Follow these steps to set up a Legacy CUR using the settings below.
Select the Legacy CUR export type.
For time granularity, select Daily.
Under 'Additional content', select the Enable resource IDs checkbox.
Under 'Report data integration' select the Amazon Athena checkbox.
For CUR data written to an S3 bucket only accessed by Kubecost, it is safe to expire or delete the objects after seven days of retention.
Remember the name of the bucket you create for CUR data. This will be used in Step 2.
Familiarize yourself with how column name restrictions differ between CURs and Athena tables. AWS may change your CUR name when you upload your CUR to your Athena table in Step 2, documented in AWS' Running Amazon Athena queries. As best practice, use all lowercase letters and only use _
as a special character.
AWS may take up to 24 hours to publish data. Wait until this is complete before continuing to the next step.
If you believe you have the correct permissions, but cannot access the Billing and Cost Management page, have the owner of your organization's root account follow these instructions.
As part of the CUR creation process, Amazon also creates a CloudFormation template that is used to create the Athena integration. It is created in the CUR S3 bucket, listed in the Objects tab in the path s3-path-prefix/cur-name
and typically has the filename crawler-cfn.yml
. This .yml is your necessary CloudFormation template. You will need it in order to complete the CUR Athena integration. For more information, see the AWS doc Setting up Athena using AWS CloudFormation templates.
Your S3 path prefix can be found by going to your AWS Cost and Usage Reports dashboard and selecting your newly-created CUR. In the 'Report details' tab, you will find the S3 path prefix.
Once Athena is set up with the CUR, you will need to create a new S3 bucket for Athena query results.
Navigate to the S3 Management Console.
Select Create bucket. The Create Bucket page opens.
Use the same region used for the CUR bucket and pick a name that follows the format aws-athena-query-results-.
Select Create bucket at the bottom of the page.
Navigate to the Amazon Athena dashboard.
Select Settings, then select Manage. The Manage settings window opens.
Set Location of query result to the S3 bucket you just created, which will look like s3://aws-athena-query-results..., then select Save.
For Athena query results written to an S3 bucket only accessed by Kubecost, it is safe to expire or delete the objects after 1 day of retention.
Kubecost offers a set of CloudFormation templates to help set your IAM roles up.
If you’re new to provisioning IAM roles, we suggest downloading our templates and using the CloudFormation wizard to set these up. You can learn how to do this in AWS' Creating a stack on the AWS CloudFormation console doc. Open the step below which represents your CUR and management account arrangement, download the .yaml file listed, and upload them as the stack template in the 'Creating a stack' > 'Selecting a stack template' step.
If you are using the alternative multi-cloud integration method, steps 4 and 5 are not required.
Now that the policies have been created, attach those policies to Kubecost. We support the following methods:
These values can either be set from the Kubecost UI or via .Values.kubecostProductConfigs
in the Helm chart. Values for all fields must be provided.
To add values in the Kubecost UI, select Settings from the left navigation, then scroll to Cloud Cost Settings. Select Update next to External Cloud Cost Configuration (AWS). The Billing Data Export Configuration window opens. Fill in all the below fields:
Athena Region
The AWS region Athena is running in
Athena Database
The name of the database created by the Athena setup
Athena Tablename
The name of the table created by the Athena setup
Athena Result Bucket
An S3 bucket to store Athena query results that you’ve created that Kubecost has permission to access
AWS account ID
The AWS account ID where the Athena CUR is, likely your management account.
When you are done, select Update to confirm.
If you set any kubecostProductConfigs
from the Helm chart, all changes via the front end will be overridden on pod restart.
athenaProjectID
: The AWS AccountID where the Athena CUR is, likely your management account.
athenaBucketName
: An S3 bucket to store Athena query results that you’ve created that Kubecost has permission to access
The name of the bucket should match s3://aws-athena-query-results-*
, so the IAM roles defined above will automatically allow access to it
The bucket can have a Canned ACL of Private
or other permissions as you see fit.
athenaRegion
: The AWS region Athena is running in
athenaDatabase
: The name of the database created by the Athena setup
The athena database name is available as the value (physical id) of AWSCURDatabase
in the CloudFormation stack created above (in Step 2: Setting up Athena)
athenaTable
: the name of the table created by the Athena setup
The table name is typically the database name with the leading athenacurcfn_
removed (but is not available as a CloudFormation stack resource). Confirm the table name by visiting the Athena dashboard.
athenaWorkgroup
: The workgroup assigned to be used with Athena. If not specified, defaults to Primary
Make sure to use only underscore as a delimiter if needed for tables and views. Using a hyphen/dash will not work even though you might be able to create it. See the AWS docs for more info.
If you are using a multi-account setup, you will also need to set .Values.kubecostProductConfigs.masterPayerARN
to the Amazon Resource Number (ARN) of the role in the management account, e.g. arn:aws:iam::530337586275:role/KubecostRole
.
Once you've integrated with the CUR, you can visit Settings > View Full Diagnostics in the UI to determine if Kubecost has been successfully integrated with your CUR. If any problems are detected, you will see a yellow warning sign under the cloud provider permissions status header
You can check pod logs for authentication errors by running: kubectl get pods -n <namespace>
kubectl logs <kubecost-pod-name> -n <namespace> -c cost-model
If you do not see any authentication errors, log in to your AWS console and visit the Athena dashboard. You should be able to find the CUR. Ensure that the database with the CUR matches the athenaTable entered in Step 5. It likely has a prefix with athenacurcfn_
:
You can also check query history to see if any queries are failing:
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources. You can search in the Athena "Recent queries" dashboard to find additional info about the error.
Resolution: This error is typically caused by the incorrect (Athena results) s3 bucket being specified in the CloudFormation template of Step 3 from above. To resolve the issue, ensure the bucket used for storing the AWS CUR report (Step 1) is specified in the S3ReadAccessToAwsBillingData
SID of the IAM policy (default: kubecost-athena-access) attached to the user or role used by Kubecost (Default: KubecostUser / KubecostRole). See the following example.
This error can also occur when the management account cross-account permissions are incorrect, however, the solution may differ.
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources.
Resolution: Please verify that the prefix s3://
was used when setting the athenaBucketName
Helm value or when configuring the bucket name in the Kubecost UI.
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources.
Resolution: While rare, this issue was caused by an Athena instance that failed to provision properly on AWS. The solution was to delete the Athena DB and deploy a new one. To verify this is needed, find the failed query ID in the Athena "Recent queries" dashboard and attempt to manually run the query.
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources.
Resolution: Previously, if you ran a query without specifying a value for query result location, and the query result location setting was not overridden by a workgroup, Athena created a default location for you. Now, before you can run an Athena query in a region in which your account hasn't used Athena previously, you must specify a query result location, or use a workgroup that overrides the query result location setting. While Athena no longer creates a default query results location for you, previously created default aws-athena-query-results-MyAcctID-MyRegion
locations remain valid and you can continue to use them. The bucket should be in the format of: aws-athena-query-results-MyAcctID-MyRegion
It may also be required to remove and reinstall Kubecost. If doing this please remeber to backup ETL files prior or contact support for additional assistance. See also this AWS doc on specifying a query result location.
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources or in the Kubecost cost-model
container logs.
Resolution: Verify in AWS' Cost and Usage Reports dashboard that the Resource IDs are enabled as "Report content" for the CUR created in Step 1. If the Resource IDs are not enabled, you will need to re-create the report (this will require redoing Steps 1 and 2 from this doc).
Symptom: A similar error to this will be shown on the Diagnostics page under Pricing Sources or in the Kubecost cost-model
container logs.
Resolution: Verify that s3://
was included in the bucket name when setting the .Values.kubecostProductConfigs.athenaBucketName
Helm value.
AWS services used here are:
Kubecost's cost-model
requires roughly 2 CPU and 10 GB of RAM per 50,000 pods monitored. The backing Prometheus database requires roughly 2 CPU and 25 GB per million metrics ingested per minute. You can pick the EC2 instances necessary to run Kubecost accordingly.
Kubecost can write its cache to disk. Roughly 32 GB per 100,000 pods monitored is sufficient. (Optional: our cache can exist in memory)
Cloudformation (Optional: manual IAM configuration or via Terraform is fine)
EKS (Optional: all K8s flavors are supported)
Kubecost is capable of aggregating the costs of EC2 compute resources over a given timeframe with a specified duration step size. To achieve this, Kubecost uses Athena queries to gather usage data points with differing price models. The result of this process is a list of resources with their cost by timeframe.
The reconciliation process makes two queries to Athena, one to gather resources that are paid for with either the on-demand model or a savings plan and one query for resources on the reservation price model. The first query includes resources given at a blended rate, which could be on-demand usage or resources that have exceeded the limits of a savings plan. It will also include resources that are part of a savings plan which will have a savings plan effective cost. The second query only includes reserved resources and the cost which reflects the rate they were reserved at.
The queries make use of the following columns from Athena:
line_item_usage_start_date
The beginning timestamp of the line item usage. Used to filter resource usage within a date range and to aggregate on usage window.
line_item_usage_end_date
The ending timestamp of the line item usage. Used to filter resource usage within a date range and to aggregate on usage window.
line_item_resource_id
An ID, also called the provider ID, is given to line items that are instantiated resources.
line_item_line_item_type
The type of a line item, used to determine if the resource usage is covered by a savings plan and has a discounted price.
line_item_usage_type
What is being used in a line item, for the purposes of a compute resource this, is the type of VM and where it is running
line_item_product_code
The service that a line item is from. Used to filter out items that are not from EC2.
reservation_reservation_a_r_n
Amazon Resource Name for reservation of line item, the presence of this value is used to identify a resource as being part of a reservation plan.
line_item_unblended_cost
The undiscounted cost of a resource.
savings_plan_savings_plan_effective_cost
The cost of a resource discounted by a savings plan
reservation_effective_cost
The cost of a resource discounted by a reservation
This query is grouped by six columns:
line_item_usage_start_date
line_item_usage_end_date
line_item_resource_id
line_item_line_item_type
line_item_usage_type
line_item_product_code
The columns line_item_unblended_cost
and savings_plan_savings_plan_effective_cost
are summed on this grouping. Finally, the query filters out rows that are not within a given date range, have a missing line_item_resource_id
, and have a line_item_product_code
not equal to "AmazonEC2". The grouping has three important aspects, the timeframe of the line items, the resource as defined by the resource id, and the usage type, which is later used to determine the proper cost of the resources as it was used. This means that line items are grouped according to the resource, the time frame of the usage, and the rate at which the usage was charged.
The reservation query is grouped on five columns:
line_item_usage_start_date
line_item_usage_end_date
reservation_reservation_a_r_n
line_item_resource_id
line_item_product_code
The query is summed on the reservation_effective_cost
and filtered by the date window, for missing reservation_reservation_a_r_n
values and also removes line items with line_item_product_code
not equal to "AmazonEC2". This grouping is on resource id by timeframe removing all non-reservation line items.
The on-demand query is categorized into different resource types: compute, network, storage, and others. The network is identified by the presence of the "byte" in the line_item_usage_type
. Compute and storage are identified by the presence of "i-" and "vol-" prefixes in line_item_resource_id
respectively. Non compute values are removed from the results. Out of the two costs aggregated by this query the correct one to use is determined by the line_item_line_item_type
, if it has a value of "SavingsPlanCoveredUsage", then the savings_plan_savings_plan_effective_cost
is used as the cost, and if not then the line_item_unblended_cost
is used.
In the reservation query, all of the results are of the compute category and there is only the reservation_effective_cost
to use as a cost.
These results are then merged into one set, with the provider id used to associate the cost with other information about the resource.
There are several different ways to look at your node cost data. The default for the cost explorer is Unblended" but it makes the most sense from an allocation perspective to use the amortized rates. Be sure Amortized costs is selected when looking at cost data. Here's an example of how they can vary dramatically on our test cluster.
The t2-mediums here are covered by a savings plan. Unblended, the cost is only $0.06/day for two.
When Amortized costs is selected, the price jumps to $1.50/day
This should closely match our data on the Assets page, for days where we have adjustments come in from the pricing CUR.
There are many ways to integrate your AWS Cost and Usage Report (CUR) with Kubecost. This tutorial is intended as the best-practice method for users whose environments meet the following assumptions:
Kubecost will run in a different account than the AWS Payer Account
The IAM permissions will utilize AWS IRSA to avoid shared secrets
The configuration of Kubecost will be done using a cloud-integration.json file, and not via Kubecost UI (following infrastructure as code practices)
If this is not an accurate description of your environment, see our AWS Cloud Integration doc for more options.
This guide is a one-time setup per AWS payer account and is typically one per organization. It can be automated, but may not be worth the effort given that it will not be needed again.
Kubecost supports multiple AWS payer accounts as well as multiple cloud providers from a single Kubecost primary cluster. For multiple payer accounts, create additional entries inside the array below.
Detail for multiple cloud provider setups is here.
To begin, download the recommended configuration template files from our poc-common-config repo. You will need the following files from this folder:
cloud-integration.json
iam-payer-account-cur-athena-glue-s3-access.json
iam-payer-account-trust-primary-account.json
iam-access-cur-in-payer-account.json
Begin by opening cloud_integration.json, which should look like this:
Update athenaWorkgroup
to primary
, then save the file and close it. The remaining values will be obtained during this tutorial.
Follow the AWS documentation to create a CUR export using the settings below.
For time granularity, select Daily.
Select the checkbox to enable Resource IDs in the report.
Select the checkbox to enable Athena integration with the report.
Select the checkbox to enable the JSON IAM policy to be applied to your bucket.
If this CUR data is only used by Kubecost, it is safe to expire or delete the objects after seven days of retention.
AWS may take up to 24 hours to publish data. Wait until this is complete before continuing to the next step.
While you wait, update the following configuration files:
Update your cloud-integration.json file by providing a projectID
value, which will be the AWS payer account number where the CUR is located and where the Kubecost primary cluster is running.
Update your iam-payer-account-cur-athena-glue-s2-access.json file by replacing all instances of CUR_BUCKET_NAME
to the name of the bucket you created for CUR data.
As part of the CUR creation process, Amazon creates a CloudFormation template that is used to create the Athena integration. It is created in the CUR S3 bucket under s3-path-prefix/cur-name
and typically has the filename crawler-cfn.yml. This .yml is your CloudFormation template. You will need it in order to complete the CUR Athena integration. You can read more about this here.
Your S3 path prefix can be found by going to your AWS Cost and Usage Reports dashboard and selecting your bucket's report. In the Report details tab, you will find the S3 path prefix.
Once Athena is set up with the CUR, you will need to create a new S3 bucket for Athena query results. The bucket used for the CUR cannot be used for the Athena output.
Navigate to the S3 Management Console.
Select Create bucket. The Create Bucket page opens.
Provide a name for your bucket. This is the value for athenaBucketName
in your cloud-integration.json file. Use the same region used for the CUR bucket.
Select Create bucket at the bottom of the page.
Navigate to the Amazon Athena dashboard.
Select Settings, then select Manage. The Manage settings window opens.
Set Location of query result to the S3 bucket you just created, then select Save.
Navigate to Athena in the AWS Console. Be sure the region matches the one used in the steps above. Update your cloud-integration.json file with the following values. Use the screenshots below for help.
athenaBucketName
: the name of the Athena bucket your created in this step
athenaDatabase
: the value in the Database dropdown
athenaRegion
: the AWS region value where your Athena query is configured
athenaTable
: the partitioned value found in the Table list
For Athena query results written to an S3 bucket only accessed by Kubecost, it is safe to expire or delete the objects after one day of retention.
From the AWS payer account
In iam-payer-account-cur-athena-glue-s3-access.json, replace all ATHENA_RESULTS_BUCKET_NAME
instances with your Athena S3 bucket name (the default will look like aws-athena-query-results-xxxx
).
In iam-payer-account-trust-primary-account.json, replace SUB_ACCOUNT_222222222
with the account number of the account where the Kubecost primary cluster will run.
In the same location as your downloaded configuration files, run the following command to create the appropriate policy (jq
is not required):
Now we can obtain the last value masterPayerARN
for cloud-integration.json as the ARN associated with the newly-created IAM role, as seen below in the AWS console:
By arriving at this step, you should have been able to provide all values to your cloud-integration.json file. If any values are missing, reread the tutorial and follow any steps needed to obtain those values.
From the AWS Account where the Kubecost primary cluster will run
In iam-access-cur-in-payer-account.json, update PAYER_ACCOUNT_11111111111
with the AWS account number of the payer account and create a policy allowing Kubecost to assumeRole in the payer account:
Note the output ARN (used in the iamserviceaccount --attach-policy-arn
below):
Create a namespace and set environment variables:
Enable the OIDC-Provider:
Create the Kubernetes service account, attaching the assumeRole policy. Replace SUB_ACCOUNT_222222222
with the AWS account number where the primary Kubecost cluster will run.
Create the secret (in this setup, there are no actual secrets in this file):
Install Kubecost using the service account and cloud-integration secret:
It can take over an hour to process the billing data for large AWS accounts. In the short-term, follow the logs and look for a message similar to (7.7 complete)
, which should grow gradually to (100.0 complete)
. Some errors (ERR) are expected, as seen below.
For help with troubleshooting, see the section in our original AWS integration guide.
Integration with cloud service providers (CSPs) via their respective billing APIs allows Kubecost to display out-of-cluster (OOC) costs (e.g. AWS S3, Google Cloud Storage, Azure Storage Account). Additionally, it allows Kubecost to reconcile Kubecost's in-cluster predictions with actual billing data to improve accuracy.
If you are using Kubecost Cloud, do not attempt to modify your install using information from this article. You need to consult Kubecost Cloud's specific cloud integration procedures which can be found here.
As indicated above, setting up a cloud integration with your CSP allows Kubecost to pull in additional billing data. The two processes that incorporate this information are reconciliation and CloudCost (formerly known as CloudUsage).
Reconciliation matches in-cluster assets with items found in the billing data pulled from the CSP. This allows Kubecost to display the most accurate depiction of your in-cluster spending. Additionally, the reconciliation process creates Network
assets for in-cluster nodes based on the information in the billing data. The main drawback of this process is that the CSPs have between a 6 to 24-hour delay in releasing billing data, and reconciliation requires a complete day of cost data to reconcile with the in-cluster assets. This requires a 48-hour window between resource usage and reconciliation. If reconciliation is performed within this window, asset cost is deflated to the partially complete cost shown in the billing data.
Cost-based metrics are based on on-demand pricing unless there is definitive data from a CSP that the node is not on-demand. This way estimates are as accurate as possible. If a new reserved instance is provisioned or a node joins a savings plan:
Kubecost continues to emit on-demand pricing until the node is added to the cloud bill.
Once the node is added to the cloud bill, Kubecost starts emitting something closer to the actual price.
For the time period where Kubecost assumed the node was on-demand but it was actually reserved, reconciliation fixes the price in ETL.
The reconciled assets will inherit the labels from the corresponding items in the billing data. If there exist identical label keys between the original assets and those of the billing data items, the label value of the original asset will take precedence.
Visit Settings, then toggle on Highlight Unreconciled Costs, then select Save at the bottom of the page to apply changes. Now, when you visit your Allocations or Assets dashboards, the most recent 36 hours of data will display hatching to signify unreconciled costs.
As of v1.106 of Kubecost, CloudCost is enabled by default, and Cloud Usage is disabled. Upgrading Kubecost will not affect the UI or hinder performance relating to this.
CloudCost allows Kubecost to pull in OOC cloud spend from your CSP's billing data, including any services run by the CSP as well as compute resources. By labelling OOC costs, their value can be distributed to your Allocations data as external costs. This allows you to better understand the proportion of OOC cloud spend that your in-cluster usage depends on.
Your cloud billing data is reflected in the aggregate costs of Account
, Provider
, Invoice Entity
, and Service
. Aggregating and drilling down into any of these categories will provide a subset of the entire bill, based on the Helm value .values.cloudCost.topNItems
, which will log 1,000 values. This subset is each days' top n
items by cost. An optional label list can be used to include or exclude items to be pulled from the bill.
CloudCost becomes available as soon as they appear in the billing data, with the 6 to 24-hour delay mentioned above, and are updated as they become more complete.
You can view your existing cloud integrations and their success status in the Kubecost UI by visiting Settings, then scrolling to Cloud Integrations. To create a new integration or learn more about existing integrations, select View additional details to go to the Cloud Integrations page.
Here, you can view your integrations and filter by successful or failed integrations. For non-successful integrations, Kubecost will display a diagnostic error message in the Status column to contextualize steps toward successful integration.
Select an individual integration to view a side panel that contains the most recent run, next run, refresh rate, and an exportable YAML of Helm configs for its CSP's integration values.
You can add a new cloud integration by selecting Add Integration. For guides on how to set up an integration for a specific CSP, follow these links to helpful Kubecost documentation:
Select an existing cloud integration, then in the slide panel that appears, select Delete.
The Kubecost Helm chart provides values that can enable or disable each cloud process on the deployment once a cloud integration has been set up. Turning off either of these processes will disable all the benefits provided by them.
.Values.kubecostModel.etlAssetReconciliationEnabled
true
Enables reconciliation processes and endpoints. This Helm value corresponds to the ETL_ASSET_RECONCILIATION_ENABLED
environment variable.
.Values.kubecostModel.etlCloudUsage
true
Enables Cloud Usage processes and endpoints. This Helm value corresponds to the ETL_CLOUD_USAGE_ENABLED
environment variable.
.Values.kubecostModel.etlCloudRefreshRateHours
6
The interval at which the run loop executes for both reconciliation and Cloud Usage. Reducing this value will decrease resource usage and billing data access costs, but will result in a larger delay in the most current data being displayed. This Helm value corresponds to the ETL_CLOUD_REFRESH_RATE_HOURS
environment variable.
.Values.kubecostModel.etlCloudQueryWindowDays
7
The maximum number of days that will be queried from a cloud integration in a single query. Reducing this value can help to reduce memory usage during the build process, but will also result in more queries which can drive up billing data access costs. This Helm value corresponds to the ETL_CLOUD_QUERY_WINDOW_DAYS
environment variable.
.Values.kubecostModel.etlCloudRunWindowDays
3
The number of days into the past each run loop will query. Reducing this value will reduce memory load, however, it can cause Kubecost to miss updates to the CUR, if this has happened the day will need to be manually repaired. This Helm value corresponds to the ETL_CLOUD_RUN_WINDOW_DAYS
environment variable.
Often an integrated cloud account name may be a series of random letter and numbers which do not reflect the account's owner, team, or function. Kubecost allows you to rename cloud accounts to create more readable cloud metrics in your Kubecost UI. After you have successfully integrated your cloud account (see above), you need to manually edit your values.yaml and provide the original account name and your intended rename:
You will see these changes reflected in Kubecost's UI on the Overview page under Cloud Costs Breakdown. These example account IDs could benefit from being renamed:
The ETL contains a Map of Cloud Stores, each representing an integration with a CSP. Each Cloud Store is responsible for the Cloud Usage and reconciliation pipelines which add OOC costs and adjust Kubecost's estimated cost respectively by cost and usage data pulled from the CSP. Each Cloud Store has a unique identifier called the ProviderKey
which varies depending on which CSP is being connected to and ensures that duplicate configurations are not introduced into the ETL. The value of the ProviderKey
is the following for each CSP at a scope that the billing data is being for:
AWS: Account Id
GCP: Project Id
Azure: Subscription Id
The ProviderKey
can be used as an argument for the endpoints for Cloud Usage and Reconciliation repair APIs, to indicate that the specified operation should only be done on a single Cloud Store rather than all of them, which is the default behavior. Additionally, the Cloud Store keeps track of the Status of the Cloud Connection Diagnostics for each of the Cloud Usage and reconciliation. The Cloud Connection Status is meant to be used as a tool in determining the health of the Cloud Connection that is the basis of each Cloud Store. The Cloud Connection Status has various failure states that are meant to provide actionable information on how to get your Cloud Connection running properly. These are the Cloud Connection Statuses:
INITIAL_STATUS: The zero value of Cloud Connection Status means that the cloud connection is untested. Once Cloud Connection Status has been changed and it should not return to this value. This status is assigned on creation to the Cloud Store
MISSING_CONFIGURATION: Kubecost has not detected any method of Cloud Configuration. This value is only possible on the first Cloud Store that is created as a wrapper for the open-source CSP. This status is assigned during failures in Configuration Retrieval.
INCOMPLETE_CONFIGURATION: Cloud Configuration is missing the required values to connect to the cloud provider. This status is assigned during failures in Configuration Retrieval.
FAILED_CONNECTION: All required Cloud Configuration values are filled in, but a connection with the CSP cannot be established. This is indicative of a typo in one of the Cloud Configuration values or an issue in how the connection was set up in the CSP's Console. The assignment of this status varies between CSPs but should happen if there if an error is thrown when an interaction with an object from the CSP's SDK occurs.
MISSING_DATA: The Cloud Integration is properly configured, but the CSP is not returning billing/cost and usage data. This status is indicative of the billing/cost and usage data export of the CSP being incorrectly set up or the export being set up in the last 48 hours and not having started populating data yet. This status is set when a query has been successfully made but the results come back empty. If the CSP already has a SUCCESSFUL_CONNECTION status, then this status should not be set because this indicates that the specific query made may have been empty.
SUCCESSFUL_CONNECTION: The Cloud Integration is properly configured and returning data. This status is set on any successful query where data is returned
After starting or restarting Cloud Usage or reconciliation, two subprocesses are started: one which fills in historic data over the coverage of the Daily CloudUsage and Asset Store, and one which runs periodically on a predefined interval to collect and process new cost and usage data as it is made available by the CSP. The ETL's status endpoint contains a cloud object that provides information about each Cloud Store including the Cloud Connection Status and diagnostic information about Cloud Usage and Reconciliation. The diagnostic items on the Cloud Usage and Reconciliation are:
Coverage: The window of time that the historical subprocess has covered
LastRun: The last time that the process ran, updates each time the periodic subprocess runs
NextRun: Next scheduled run of the periodic subprocess
Progress: Ratio of Coverage to Total amount of time to be covered
RefreshRate: The interval that the periodic subprocess runs
Resolution: The window size of the process
StartTime: When the Cloud Process was started
For more information on APIs related to rebuilding and repairing Cloud Usage or reconciliation, see the CloudCost Diagnostic APIs doc.
In order to create a Google service account for use with Thanos, navigate to the and select IAM & Admin > Service Accounts.
From here, select the option Create Service Account.
Provide a service account name, ID, and description, then select Create and Continue.
You should now be at the Service account permissions (optional) page. Select the first Role dropdown and select Storage Object Creator. Select Add Another Role, then select Storage Object Viewer from the second dropdown. Select Continue.
You should now be prompted to allow specific accounts access to this service account. This should be based on specific internal needs and is not a requirement. You can leave this empty and select Done.
Once back to the Service accounts page, select the Actions icon > Manage keys. Then, select the Add Key dropdown and select Create new key. A Create private key window opens.
Select JSON as the Key type and select Create. This will download a JSON service account key entry for use with the Thanos object-store.yaml
mentioned in the initial setup step.
Connecting your Azure account to Kubecost allows you to view Kubernetes metrics side-by-side with out-of-cluster (OOC) costs (e.g. Azure Database Services). Additionally, it allows Kubecost to reconcile measured Kubernetes spend with your actual Azure bill. This gives teams running Kubernetes a complete and accurate picture of costs. For more information, read and this .
To configure Kubecost's Azure Cloud Integration, you will need to set up daily exports of cost reports to Azure storage. Kubecost will then access your cost reports through the Azure Storage API to display your OOC cost data alongside your in-cluster costs.
A GitHub repository with sample files used in below instructions can be found .
Follow Azure's tutorial to export cost reports. For Metric, make sure you select Amortized cost (Usage and Purchases). For Export type, make sure you select Daily export of month-to-date costs. Do not select File Partitioning. Also, take note of the Account name and Container specified when choosing where to export the data to. Note that a successful cost export will require to be registered in your subscription.
Alternatively, you can follow this .
It will take a few hours to generate the first report, after which Kubecost can use the Azure Storage API to pull that data.
Once the cost export has successfully executed, verify that a non-empty CSV file has been created at this path: <STORAGE_ACCOUNT>/<CONTAINER_NAME>/<OPTIONAL_CONTAINER_PATH>/<COST_EXPORT_NAME>/<DATE_RANGE>/<CSV_FILE>
.
If you have sensitive data in an existing Azure Storage account, it is recommended to create a separate Azure Storage account to store your cost data export.
For more granular billing data it is possible to to resource groups, management groups, departments, or enrollments. AKS clusters will create their own resource groups which can be used. This functionality can then be combined with Kubecost to ingest multiple scoped billing exports.
Obtain the following values from Azure to provide to Kubecost. These values can be located in the Azure Portal by selecting Storage Accounts, then selecting your specific Storage account for details.
azureSubscriptionID
is the "Subscription ID" belonging to the Storage account which stores your exported Azure cost report data.
azureStorageAccount
is the name of the Storage account where the exported Azure cost report data is being stored.
azureStorageAccessKey
can be found by selecting Access keys in your Storage account left navigation under "Security + networking". Using either of the two keys will work.
azureStorageContainer
is the name that you chose for the exported cost report when you set it up. This is the name of the container where the CSV cost reports are saved in your Storage account.
azureContainerPath
is an optional value which should be used if there is more than one billing report that is exported to the configured container. The path provided should have only one billing export because Kubecost will retrieve the most recent billing report for a given month found within the path.
azureCloud
is an optional value which denotes the cloud where the storage account exist, possible values are public
and gov
. The default is public
.
Next, create a JSON file which must be named cloud-integration.json with the following format:
Next, create the Secret:
Next, ensure the following are set in your Helm values:
Next, upgrade Kubecost via Helm:
You can verify a successful configuration by checking the following in the Kubecost UI:
The Assets dashboard will be broken down by Kubernetes assets.
The Assets dashboard will no longer show a banner that says "External cloud cost not configured".
The Diagnostics page (via Settings > View Full Diagnostics) view will show a green checkmark under Cloud Integrations.
If there are no in-cluster costs for a particular day, then there will not be out-of-cluster costs either
Kubecost utilizes Azure tagging to allocate the costs of Azure resources outside of the Kubernetes cluster to specific Kubernetes concepts, such as namespaces, pods, etc. These costs are then shown in a unified dashboard within the Kubecost interface.
To allocate external Azure resources to a Kubernetes concept, use the following tag naming scheme:
In the kubernetes_label_NAME
tag key, the NAME portion should appear exactly as the tag appears inside of Kubernetes. For example, for the tag app.kubernetes.io/name
, this tag key would appear as kubernetes_label_app.kubernetes.io/name.
To use an alternative or existing Azure tag schema, you may supply these in your values.yaml under the kubecostProductConfigs.labelMappingConfigs.<aggregation>_external_label
. Also be sure to set kubecostProductConfigs.labelMappingConfigs.enabled = true
To troubleshoot a configuration that is not yet working:
$ kubectl get secrets -n kubecost
to verify you've properly configured cloud-integration.json
.
$ helm get values kubecost
to verify you've properly set .Values.kubecostProductConfigs.cloudIntegrationSecret
Verify that a non-empty CSV file has been created at this path in your Azure Portal Storage Account: <STORAGE_ACCOUNT>/<CONTAINER_NAME>/<OPTIONAL_CONTAINER_PATH>/<COST_EXPORT_NAME>/<DATE_RANGE>/<CSV_FILE>
. Ensure new CSVs are being generated every day.
When opening a cost report CSV, ensure that there are rows in the file that do not have a MeterCategory of “Virtual Machines” or “Storage” as these items are ignored because they are in cluster costs. Additionally, make sure that there are items with a UsageDateTime that matches the date you are interested in.
When reviewing logs:
The following error is reflective of Kubecost's previous Azure Cloud Integration method and can be safely disregarded.
ERR Error, Failed to locate azure storage config file: /var/azure-storage-config/azure-storage-config.json
Kubecost provides the ability to allocate out-of-cluster (OOC) costs, e.g. Cloud SQL instances and Cloud Storage buckets, back to Kubernetes concepts like namespaces and deployments.
Read the doc for more information on how Kubecost connects with cloud service providers.
The following guide provides the steps required for allocating OOC costs in GCP.
A GitHub repository with sample files used in the below instructions can be found .
Begin by reviewing on exporting cloud billing data to BigQuery.
GCP users must create a to gain access to all Kubecost CloudCost features including . Exports of type "Standard usage cost data" and "Pricing Data" do not have the correct information to support CloudCosts.
If you are using the alternative method, Step 2 is not required.
If your Big Query dataset is in a different project than the one where Kubecost is installed, please see the section on .
Add a service account key to allocate OOC resources (e.g. storage buckets and managed databases) back to their Kubernetes owners. The service account needs the following:
If you don't already have a GCP service account with the appropriate rights, you can run the following commands in your command line to generate and export one. Make sure your GCP project is where your external costs are being run.
After creating the GCP service account, you can connect it to Kubecost in one of two ways before configuring:
NAMESPACE
is the namespace Kubecost is installed into
KSA_NAME
is the name of the service account attributed to the Kubecost deployment
Create a service account key:
Once the GCP service account has been connected, set up the remaining configuration parameters.
You're almost done. Now it's time to configure Kubecost to finalize your connectivity.
If you've connected using Workload Identity Federation, add these configs:
Otherwise, if you've connected using a service account key, create a secret for the GCP service account key you've created and add the following configs:
When managing the service account key as a Kubernetes secret, the secret must reference the service account key JSON file, and that file must be named compute-viewer-kubecost-key.json.
In Kubecost, select Settings from the left navigation, and under Cloud Integrations, select Add Cloud Integration > GCP, then provide the relevant information in the GCP Billing Data Export Configuration window:
GCP Service Key: Optional field. If you've created a service account key, copy the contents of the compute-viewer-kubecost-key.json file and paste them here. If you've connected using Workload Identity federation in Step 3, you should leave this box empty.
GCP Project Id: The ID of your GCP project.
GCP Billing Database: Requires a BigQuery dataset prefix (e.g. billing_data
) in addition to the BigQuery table name. A full example is billing_data.gcp_billing_export_resource_v1_XXXXXX_XXXXXX_XXXXX
Be careful when handling your service key! Ensure you have entered it correctly into Kubecost. Don't lose it or let it become publicly available.
Google generates special labels for GKE resources (e.g. "goog-gke-node", "goog-gke-volume"). Values with these labels are excluded from OOC costs because Kubecost already includes them as in-cluster assets. Thus, to make sure all cloud assets are included, we recommend installing Kubecost on each cluster where insights into costs are required.
If a resource has a label with the same name as a project-level label, the resource label value will take precedence.
Modifications incurred on project-level labels may take several hours to update on Kubecost.
Due to organizational constraints, it is common that Kubecost must be run in a separate project from the project containing the billing data Big Query dataset, which is needed for Cloud Integration. Configuring Kubecost in this scenario is still possible, but some of the values in the above script will need to be changed. First, you will need the project id of the projects where Kubecost is installed, and the Big Query dataset is located. Additionally, you will need a GCP user with the permissions iam.serviceAccounts.setIamPolicy
for the Kubecost project and the ability to manage the roles listed above for the Big Query Project. With these, fill in the following script to set the relevant variables:
Once these values have been set, this script can be run and will create the service account needed for this configuration.
Now that your service account is created follow the normal configuration instructions.
InvalidQuery
400 error for GCP integrationIn cases where Kubecost does not detect a connection following GCP integration, revisit Step 1 and ensure you have enabled detailed usage cost, not standard usage cost. Kubecost uses detailed billing cost to display your OOC spend, and if it was not configured correctly during installation, you may receive errors about your integration.
Kubecost needs access to the Microsoft Azure Billing Rate Card API to access accurate pricing data for your Kubernetes resources.
You can also get this functionality plus external costs by completing the full .
Start by creating an Azure role definition. Below is an example definition, replace YOUR_SUBSCRIPTION_ID
with the Subscription ID where your Kubernetes cluster lives:
Save this into a file called myrole.json.
Next, you'll want to register that role with Azure:
Next, create an Azure service principal.
Keep this information which is used in the service-key.json below.
Next, create a Secret for the Azure Service Principal
When managing the service account key as a Kubernetes Secret, the secret must reference the service account key JSON file, and that file must be named service-key.json
.
Finally, set the kubecostProductConfigs.serviceKeySecretName
Helm value to the name of the Kubernetes secret you created. We use the value azure-service-key
in our examples.
Or at the command line:
Kubecost supports querying the Azure APIs for cost data based on the region, offer durable ID, and currency defined in your Microsoft Azure offer.
Those properties are configured with the following Helm values:
kubecostProductConfigs.azureBillingRegion
kubecostProductConfigs.azureOfferDurableID
kubecostProductConfigs.currencyCode
Be sure to verify your billing information with Microsoft and update the above Helm values to reflect your bill to country, subscription offer durable ID/number, and currency.
The following Microsoft documents are a helpful reference:
Kubecost uses public pricing from Cloud Service Providers (CSPs) to calculate costs until the actual cloud bill is available, at which point Kubecost will reconcile your Spot prices from your Cost and Usage Report (CUR). This is almost always ready in 48 hours. Most users will likely prefer to configure instead of configuring the Spot data feed manually as demonstrated in this article.
However, if the majority of costs are due to Spot nodes, it may be useful to configure the Spot pricing data feed as it will increase accuracy for short-term (<48 hour) node costs until the Spot prices from the CUR are available. Note that all other (non-Spot) costs will still be based on public (on-demand) pricing until CUR billing data is reconciled.
With Kubecost, Spot pricing data can be pulled hourly by integrating directly with the AWS Spot feed.
First, to enable the AWS Spot data feed, follow AWS' doc.
While configuring, note the settings used as these values will be needed for the Kubecost configuration.
There are multiple options: this can either be set from the Kubecost UI or via .Values.kubecostProductConfigs
in the Helm chart. If you set any kubecostProductConfigs
from the Helm chart, all changes via the front end will be deleted on pod restart.
projectID
the Account ID of the AWS Account on which the Spot nodes are running.
awsSpotDataRegion
region of your Spot data bucket
awsSpotDataBucket
the configured bucket for the Spot data feed
awsSpotDataPrefix
optional configured prefix for your Spot data feed bucket
spotLabel
optional Kubernetes node label name designating whether a node is a Spot node. Used to provide pricing estimates until exact Spot data becomes available from the CUR
spotLabelValue
optional Kubernetes node label value designating a Spot node. Used to provide pricing estimates until exact Spot data becomes available from the CUR. For example, if your Spot nodes carry a label lifecycle:spot
, then the spotLabel
would be lifecycle
and the spotLabelValue
would be spot
In the UI, you can access these fields via the Settings page, then scroll to Cloud Cost Settings. Next to Spot Instance Configuration, select Update, then fill out all fields.
Spot data feeds are an account level setting, not a payer level. Every AWS Account will have its own Spot data feed. Spot data feed is not currently available in AWS GovCloud.
For Spot data written to an S3 bucket only accessed by Kubecost, it is safe to delete objects after three days of retention.
Kubecost requires read access to the Spot data feed bucket. The following IAM policy can be used to grant Kubecost read access to the Spot data feed bucket.
To attach the IAM policy to the Kubecost service account, you can use IRSA or the account's service key.
If your serviceaccount/kubecost-cost-analyzer
already has IRSA annotations attached, be sure to include all policies necessary when running this command.
Create a service-key.json as shown:
Create a K8s secret:
Set the following Helm config:
Verify the below points:
Make sure data is present in the Spot data feed bucket.
Make sure Project ID is configured correctly. You can cross-verify the values under Helm values in bug report
Check the value of kubecost_node_is_spot
in Prometheus:
"1" means Spot data instance configuration is correct.
"0" means not configured properly.
Is there a prefix? If so, is it configured in Kubecost?
Make sure the IAM permissions are aligned with https://github.com/kubecost/cloudformation/blob/7feace26637aa2ece1481fda394927ef8e1e3cad/kubecost-single-account-permissions.yaml#L36
Make sure the Spot data feed bucket has all permissions to access by Kubecost
The Spot Instance in the Spot data feed bucket should match the instance in the cluster where the Spot data feed is configured. awsSpotDataBucket
has to be present in the right cluster.
Additional details about the cloud-integration.json
file can be found in our doc.
For more details on what Azure resources support tagging, along with what resource type tags are available in cost reports, please review the official Microsoft documentation .
You can set up an to bind a Kubernetes service account to your GCP service account as seen below, where:
You will also need to enable the in the GCP project.
It is recommended to provide the GCP details in your to ensure they are retained during an upgrade or redeploy. First, set the following configs:
You can now label assets with the following schema to allocate costs back to their appropriate Kubernetes owner. Learn more on updating GCP asset labels.
To use an alternative or existing label schema for GCP cloud assets, you may supply these in your under the kubecostProductConfigs.labelMappingConfigs.<aggregation>_external_label
.
Project-level labels are applied to all the Assets built from resources defined under a given GCP project. You can filter GCP resources in the Kubecost (or ).
There are cases where labels applied at the account label do not show up in the date-partitioned data. If account level labels are not showing up, you can switch to querying them unpartitioned by setting an extraEnv in Kubecost: GCP_ACCOUNT_LABELS_NOT_PARTITIONED: true
. See .
Create a file called and update it with the Service Principal details from the above steps:
In the :
Cluster
kubernetes_cluster
cluster-name
Namespace
kubernetes_namespace
namespace-name
Deployment
kubernetes_deployment
deployment-name
Label
kubernetes_label_NAME*
label-value
DaemonSet
kubernetes_daemonset
daemonset-name
Pod
kubernetes_pod
pod-name
Container
kubernetes_container
container-name
Certain features of Kubecost, including Savings Insights like Orphaned Resources and Reserved Instances, require access to the cluster's GCP account. This is usually indicated by a 403 error from Google APIs which is due to 'insufficient authentication scopes'. Viewing this error in the Kubecost UI will display the cause of the error as "ACCESS_TOKEN_SCOPE_INSUFFICIENT"
.
To obtain access to these features, follow this tutorial which will show you how to configure your Google IAM Service Account and Workload Identity for your application.
Go to your GCP Console and select APIs & Services > Credentials from the left navigation. Select + Create Credentials > API Key.
On the Credentials page, select the icon in the Actions column for your newly-created API key, then select Edit API key. The Edit API key page opens.
Under ‘API restrictions’, select Restrict key, then from the dropdown, select only Cloud Billing API. Select OK to confirm. Then select Save at the bottom of the page.
From here, consult Google Cloud's guide Use Workload Identity to perform the following steps:
Enable Workload Identity on an existing GCP cluster, or spin up a new cluster which will have Workload Identity enabled by default
Migrate any existing workloads to Workload Identity
Configure your applications to use Workload Identity
Create both a Kubernetes service account (KSA) and an IAM service account (GSA).
Annotate the KSA with the email of the GSA.
Update your pod spec to use the annotated KSA, and ensure all nodes on that workload use Workload Identity.
You can stop once you have modified your pod spec (before 'Verify the Workload Identity Setup'). You should now have a GCP cluster with Workload Identity enabled, and both a KSA and a GSA, which are connected via the role roles/iam.workloadIdentityUser
.
In the GCP Console, select IAM & Admin > IAM. Find your newly-created GSA and select the Edit Principal pencil icon. You will need to provide the following roles to this service account:
BigQuery Data Viewer
BigQuery Job User
BigQuery User
Compute Viewer
Service Account Token Creator
Select Save.
The following roles need to be added to your IAM service account:
roles/bigquery.user
roles/compute.viewer
roles/bigquery.dataViewer
roles/bigquery.jobUser
roles/iam.serviceAccountTokenCreator
Use this command to add each role individually to the GSA:
From here, restart the pod(s) to confirm your changes. You should now have access to all expected Kubecost functionality through your service account with Identity Workload.
Aggregator is a new backend for Kubecost. It is used in a Federated ETL configuration without Thanos, replacing the Federator component. Aggregator serves a critical subset of Kubecost APIs, but will eventually be the default model for Kubecost and serve all APIs. Currently, Aggregator supports all major monitoring and savings APIs, and also budgets and reporting.
Existing documentation for Kubecost APIs will use endpoints for non-Aggregator environments unless otherwise specified, but will still be compatible after configuring Aggregator.
Aggregator is designed to accommodate queries of large-scale datasets by improving API load times and reducing UI errors. It is not designed to introduce new functionality; it is meant to improve functionality at scale.
Aggregator is currently free for all Enterprise users to configure, and is always able to be rolled back.
Aggregator can only be configured in a Federated ETL environment
Must be using v1.107.0 of Kubecost or newer
Your values.yaml file must have set kubecostDeployment.queryServiceReplicas
to its default value 0
.
You must have your context set to your primary cluster. Kubecost Aggregator cannot be deployed on secondary clusters.
Select from one of the two templates below and save the content as federated-store.yaml. This will be your configuration template required to set up Aggregator.
The name of the .yaml file used to create the secret must be named federated-store.yaml or Aggregator will not start.
Basic configuration:
Advanced configuration (for larger deployments):
There is no baseline for what is considered a larger deployment, which will be dependent on load times in your Kubecost environment.
Once you’ve configured your federated-store.yaml_, create a secret using the following command:
Next, you will need to create an additional cloud-integration
secret. Follow this tutorial on creating cloud integration secrets to generate your cloud-integration.json file, then run the following command:
Finally, upgrade your existing Kubecost installation. This command will install Kubecost if it does not already exist:
Upgrading your existing Kubecost using your configured federated-store.yaml_ file above will reset all existing Helm values configured in your values.yaml. If you wish to preserve any of those changes, append your values.yaml by adding the contents of your federated-store.yaml file into it, then replacing federated-store.yaml
with values.yaml
in the upgrade command below:
When first enabled, the aggregator pod will ingest the last three years (if applicable) of ETL data from the federated-store. This may take several hours. Because the combined folder is ignored, the federator pod is not used here, but can still run if needed. You can run kubectl get pods
and ensure the aggregator
pod is running, but should still wait for all data to be ingested.
Federated ETL Architecture is only officially supported on Kubecost Enterprise plans.
This doc provides recommendations to improve the stability and recoverability of your Kubecost data when deploying in a Federated ETL architecture.
Kubecost can rebuild its extract, transform, load (ETL) data using Prometheus metrics from each cluster. It is strongly recommended to retain local cluster Prometheus metrics that meet an organization's disaster recovery requirements.
For long term storage of Prometheus metrics, we recommend setting up a Thanos sidecar container to push Prometheus metrics to a cloud storage bucket.
You can configure the Thanos sidecar following this example or this example. Additionally, ensure you configure the following:
object-store.yaml
so the Thanos sidecar has permissions to read/write to the cloud storage bucket
.Values.prometheus.server.global.external_labels.cluster_id
so Kubecost is able to distinguish which metric belongs to which cluster in the Thanos bucket.
Use your cloud service provider's bucket versioning feature to take frequent snapshots of the bucket holding your Kubecost data (ETL files and Prometheus metrics).
Configure Prometheus Alerting rules or Alertmanager to get notified when you are losing metrics or when metrics deviate beyond a known standard.
This feature is only officially supported on Kubecost Enterprise plans.
Thanos is a tool to aggregate Prometheus metrics to a central object storage (S3 compatible) bucket. Thanos is implemented as a sidecar on the Prometheus pod on all clusters. Thanos Federation is one of two primary methods to aggregate all cluster information back to a single view as described in our Multi-Cluster article.
The preferred method for multi-cluster is ETL Federation. The configuration guide below is for Kubecost Thanos Federation, which may not scale as well as ETL Federation in large environments.
This guide will cover how to enable Thanos on your primary cluster, and on any additional secondary clusters.
Follow steps here to enable all required Thanos components on a Kubecost primary cluster, including the Prometheus sidecar.
For each additional cluster, only the Thanos sidecar is needed.
Consider the following Thanos recommendations for secondaries:
Ensure you provide a unique identifier for prometheus.server.global.external_labels.cluster_id
to have additional clusters be visible in the Kubecost product, e.g. cluster-two
.
cluster_id
can be replaced with another label (e.g. cluster
) by modifying .Values.kubecostModel.promClusterIDLabel.
Follow the same verification steps available here.
Sample configurations for each cloud provider can be found here.
Federated ETL is only officially supported for Kubecost Enterprise plans.
Federated extract, transform, load (ETL) is one of two methods to aggregate all cluster information back to a single display described in our Multi-Cluster doc. Federated ETL gives teams the benefit of combining multiple Kubecost installations into one view without dependency on Thanos.
There are two primary advantages for using ETL Federation:
For environments that already have a Prometheus instance, Kubecost only requires a single pod per monitored cluster
Many solutions that aggregate Prometheus metrics (like Thanos), are often expensive to scale in large environments
This guide has specific detail on how ETL Configuration works and deployment options.
Alternatively, the most common configurations can be found in our poc-common-configurations repo.
The federated ETL is composed of three types of clusters.
Federated Clusters: The clusters which are being federated (clusters whose data will be combined and viewable at the end of the federated ETL pipeline). These clusters upload their ETL files after they have built them to Federated Storage.
Federator Clusters: The cluster on which the Federator (see in Other components) is set to run within the core cost-analyzer container. This cluster combines the Federated Cluster data uploaded to federated storage into combined storage.
Primary Cluster: A cluster where you can see the total Federated data that was combined from your Federated Clusters. These clusters read from combined storage.
These cluster designations can overlap, in that some clusters may be several types at once. A cluster that is a Federated Cluster, Federator Cluster, and Primary Cluster will perform the following functions:
As a Federated Cluster, push local cluster cost data to be combined from its local ETL build pipeline.
As a Federator Cluster, run the Federator inside the cost-analyzer, which pulls this local cluster data from S3, combines them, then pushes them back to combined storage.
As a Primary Cluster, pull back this combined data from combined storage to serve it on Kubecost APIs and/or the Kubecost frontend.
The Storages referred to here are an S3 (or GCP/Azure equivalent) storage bucket which acts as remote storage for the Federated ETL Pipeline.
Federated Storage: A set of folders on paths <bucket>/federated/<cluster id>
which are essentially ETL backup data, holding a “copy” of Federated Cluster data. Federated Clusters push this data to Federated Storage to be combined by the Federator. Federated Clusters write this data, and the Federator reads this data.
Combined Storage: A folder on S3 on the path <bucket>/federated/combined
which holds one set of ETL data containing all the allocations/assets
in all the ETL data from Federated Storage. The Federator takes files from Federated Storage and combines them, adding a single set of combined ETL files to Combined Storage to be read by the Primary Cluster. The Federator writes this data, and the Primary Cluster reads this data.
The Federator: A component of the cost-model which is run on the Federator Cluster, which can be a Federated Cluster, a Primary Cluster, or neither. The Federator takes the ETL binaries from Federated Storage and merges them, adding them to Combined Storage.
Federated ETL: The pipeline containing the above components.
This diagram shows an example setup of the Federated ETL with:
Three pure Federated Clusters (not classified as any other cluster type): Cluster 1, Cluster 2, and Cluster 3
One Federator Cluster that is also a Federated Cluster: Cluster 4
One Primary Cluster that is also a Federated Cluster: Cluster 5
The result is 5 clusters federated together.
Ensure each federated cluster has a unique clusterName
and cluster_id
:
Add a secret using that file: kubectl create secret generic <secret_name> -n kubecost --from-file=federated-store.yaml
. Then set .Values.kubecostModel.federatedStorageConfigSecret
to the kubernetes secret name.
For all clusters you want to federate together (i.e. see their data on the Primary Cluster), set .Values.federatedETL.federatedCluster
to true
. This cluster is now a Federated Cluster, and can also be a Federator or Primary Cluster.
For the cluster “hosting” the Federator, set .Values.federatedETL.federator.enabled
to true
. This cluster is now a Federator Cluster, and can also be a Federated or Primary Cluster.
Optional: If you have any Federated Clusters pushing to a store that you do not want a Federator Cluster to federate, add the cluster id under the Federator config section .Values.federatedETL.federator.clusters
. If this parameter is empty or not set, the Federator will take all ETL files in the /federated
directory and federate them automatically.
Multiple Federators federating from the same source will not break, but it’s not recommended.
In Kubecost, the Primary Cluster
serves the UI and API endpoints as well as reconciling cloud billing (cloud-integration).
For the cluster that will be the Primary Cluster, set .Values.federatedETL.primaryCluster
to true
. This cluster is now a Primary Cluster, and can also be a Federator or Federated Cluster.
Cloud-integration requires .Values.federatedETL.federator.primaryClusterID
set to the same value used for .Values.kubecostProductConfigs.clusterName
Important: If the Primary Cluster is also to be federated, please wait 2-3 hours for data to populate Federated Storage before setting a Federated Cluster to primary (i.e. set .Values.federatedETL.federatedCluster
to true
, then wait to set .Values.federatedETL.primaryCluster
to true
). This allows for maximum certainty of data consistency.
If you do not set this cluster to be federated as well as primary, you will not see local data for this cluster.
The Primary Cluster’s local ETL will be overwritten with combined federated data.
This can be undone by unsetting it as a Primary Cluster and rebuilding ETL.
Setting a Primary Cluster may result in a loss of the cluster’s local ETL data, so it is recommended to back up any filestore data that one would want to save to S3 before designating the cluster as primary.
Alternatively, a fresh Kubecost install can be used as a consumer of combined federated data by setting it as the Primary but not a Federated Cluster.
The Federated ETL should begin functioning. On any ETL action on a Federated Cluster (Load/Put into local ETL store) the Federated Clusters will add data to Federated Storage. The Federator will run 5 minutes after the Federator Cluster startup, and then every 30 minutes after that. The data is merged into the Combined Storage, where it can be read by the Primary.
To verify Federated Clusters are uploading their data correctly, check the container logs on a Federated Cluster. It should log federated uploads when ETL build steps run. The S3 bucket can also be checked to see if data is being written to the /federated/<cluster_id>
path.
To verify the Federator is functioning, check the container logs on the Federator Cluster. The S3 bucket can also be checked to verify that data is being written to /federated/combined
.
To verify the entire pipeline is working, either query Allocations/Assets
or view the respective views on the frontend. Multi-cluster data should appear after:
The Federator has run at least once.
There was data in the Federated Storage for the Federator to have combined.
If you are using an internal certificate authority (CA), follow this tutorial instead of the above Setup section.
Begin by creating a ConfigMap with the certificate provided by the CA on every agent, including the Federator and any federated clusters, and name the file kubecost-federator-certs.yaml.
Now run the following command, making sure you specify the location for the ConfigMap you created:
kubectl create cm kubecost-federator-certs --from-file=/path/to/kubecost-federator-certs.yaml
Mount the certification on the Federator and any federated clusters by passing these Helm flags to your values.yaml/manifest:
Create a file federated-store.yaml, which will go on all clusters:
Now run the following command (omit kubectl create namespace kubecost
if your kubecost
namespace already exists, or this command will fail):
When using ETL Federation, there are several methods to recover Kubecost data in the event of data loss. See our Backups and Alerting doc for more details regarding these methods.
In the event of missing or inaccurate data, you may need to rebuild your ETL pipelines. This is a documented procedure. See the Repair Kubecost ETLs doc for information and troubleshooting steps.
Kubecost Free can now be installed on an unlimited number of individual clusters. Larger teams will benefit from using Kubecost Enterprise to better manage many clusters. See pricing for more details.
In an Enterprise multi-cluster setup, the UI is accessed through a designated primary cluster. All other clusters in the environment send metrics to a central object-store with a lightweight agent (aka secondary clusters). The primary cluster is designated by setting the Helm flag .Values.federatedETL.primaryCluster=true
, which instructs this cluster to read from the combined
folder that was processed by the federator. This cluster will consume additional resources to run the Kubecost UI and backend.
As of Kubecost 1.108, agent health is monitored by a diagnostic pod that collects information from the local cluster and sends it to an object-store. This data is then processed by the Primary cluster and accessed via the UI and API.
Because the UI is only accessible through the primary cluster, Helm flags related to UI display are not applied to secondary clusters.
This feature is only supported for Kubecost Enterprise.
There are two primary methods to aggregate all cluster information back to a single Kubecost UI:
Both methods allow for greater compute efficiency by running the most resource-intensive workloads on a single primary cluster.
For environments that already have a Prometheus instance, ETL Federation may be preferred because only a single Kubecost pod is required.
The below diagrams highlight the two architectures:
Kubecost ETL Federation (Preferred)
Kubecost Thanos Federation
Kubecost uses a shared storage bucket to store metrics from clusters, known as durable storage, in order to provide a single-pane-of-glass for viewing cost across many clusters. Multi-cluster is an enterprise feature of Kubecost.
There are multiple methods to provide Kubecost access to an S3 bucket. This guide has two examples:
Using a Kubernetes secret
Attaching an AWS Identity and Access Management (IAM) role to the service account used by Prometheus
Both methods require an S3 bucket. Our example bucket is named kc-thanos-store
.
This is a simple S3 bucket with all public access blocked. No other bucket configuration changes should be required.
Once created, add an IAM policy to access this bucket. See our doc for instructions.
To use the Kubernetes secret method for allowing access, create a YAML file named object-store.yaml
with contents similar to the following example. See region to endpoint mappings .
Instead of using a secret key in a file, many will want to use this method.
Attach the policy to the Thanos pods service accounts. Your object-store.yaml
should follow the format below when using this option, which does not contain the secret_key and access_key fields.
Once that annotation has been created, configure the following:
You can encrypt the S3 bucket where Kubecost data is stored in AWS via S3 and KMS. However, because Thanos can store potentially millions of objects, it is suggested that you use bucket-level encryption instead of object-level encryption. More details available in these external docs:
To use Azure Storage as Thanos object store, you need to precreate a storage account from Azure portal or using Azure CLI. Follow the instructions from the .
Now create a .YAML file named object-store.yaml
with the following format:
Kubecost v1.67.0+ uses Thanos 0.15.0. If you're upgrading to Kubecost v1.67.0+ from an older version and using Thanos, with AWS S3 as your backing storage for Thanos, you'll need to make a small change to your Thanos Secret in order to bump the Thanos version to 0.15.0 before you upgrade Kubecost.
Thanos 0.15.0 has over 10x performance improvements, so this is recommended.
Your values-thanos.yaml needs to be updated to the new defaults . The PR bumps the image version, adds the component, and increases concurrency.
This is simplified if you're using our default values-thanos.yaml, which has the new configs already.
For the Thanos Secret you're using, the encrypt-sse
line needs to be removed. Everything else should stay the same.
For example, view this sample config:
The easiest way to do this is to delete the existing secret and upload a new one:
kubectl delete secret -n kubecost kubecost-thanos
Update your secret .YAML file as above, and save it as object-store.yaml.
kubectl create secret generic kubecost-thanos -n kubecost --from-file=./object-store.yaml
Once this is done, you're ready to upgrade!
Start by . The following example uses a bucket named thanos-bucket
. Next, download a service account JSON file from Google's service account manager ().
Now create a YAML file named object-store.yaml
in the following format, using your bucket name and service account details:
Note: Because this is a YAML file, it requires this specific indention.
Warning: Do not apply a retention policy to your Thanos bucket, as it will prevent Thanos compaction from completing.
In order to create an AWS IAM policy for use with Thanos:
Navigate to the AWS console and select IAM.
Select Policies in the Navigation menu, then select Create Policy.
Add the following JSON in the policy editor:
Make sure to replace <your-bucket-name>
with the name of your newly-created S3 bucket.
4. Select Review policy and name this policy, e.g. kc-thanos-store-policy
.
Navigate to Users in IAM control panel, then select Add user.
Provide a username (e.g. kubecost-thanos-service-account
) and select Programmatic access.
Select Attach existing policies directly, search for the policy name provided in Step 4, then create the user.
Capture your Access Key ID and secret in the view below:
This feature is only officially supported on .
Kubecost leverages Thanos and durable storage for three different purposes:
Centralize metric data for a global multi-cluster view into Kubernetes costs via a Prometheus sidecar
Allow for unlimited data retention
Backup Kubecost
To enable Thanos, follow these steps:
This step creates the object-store.yaml file that contains your durable storage target (e.g. GCS, S3, etc.) configuration and access credentials. The details of this file are documented thoroughly in .
We have guides for using cloud-native storage for the largest cloud providers. Other providers can be similarly configured.
Use the appropriate guide for your cloud provider:
Create a secret with the .yaml file generated in the previous step:
Each cluster needs to be labelled with a unique Cluster ID, which is done in two places.
values-clusterName.yaml
The Thanos subchart includes thanos-bucket
, thanos-query
, thanos-store
, thanos-compact
, and service discovery for thanos-sidecar
. These components are recommended when deploying Thanos on the primary cluster.
The thanos-store
container is configured to request 2.5GB memory, this may be reduced for smaller deployments. thanos-store
is only used on the primary Kubecost cluster.
To verify installation, check to see all Pods are in a READY state. View Pod logs for more detail and see common troubleshooting steps below.
Thanos sends data to the bucket every 2 hours. Once 2 hours have passed, logs should indicate if data has been sent successfully or not.
You can monitor the logs with:
Monitoring logs this way should return results like this:
As an aside, you can validate the Prometheus metrics are all configured with correct cluster names with:
To troubleshoot the IAM Role Attached to the serviceaccount, you can create a Pod using the same service account used by the thanos-sidecar (default is kubecost-prometheus-server
):
s3-pod.yaml
This should return a list of objects (or at least not give a permission error).
If a cluster is not successfully writing data to the bucket, review thanos-sidecar
logs with the following command:
Logs in the following format are evidence of a successful bucket write:
/stores
endpointIf thanos-query can't connect to both the sidecar and the store, you may want to directly specify the store gRPC service address instead of using DNS discovery (the default). You can quickly test if this is the issue by running:
kubectl edit deployment kubecost-thanos-query -n kubecost
and adding
--store=kubecost-thanos-store-grpc.kubecost:10901
to the container args. This will cause a query restart and you can visit /stores
again to see if the store has been added.
If it has, you'll want to use these addresses instead of DNS more permanently by setting .Values.thanos.query.stores in values-thanos.yaml.
A common error is as follows, which means you do not have the correct access to the supplied bucket:
Assuming pods are running, use port forwarding to connect to the thanos-query-http
endpoint:
If you navigate to Stores using the top navigation bar, you should be able to see the status of both the thanos-store
and thanos-sidecar
which accompanied the Prometheus server:
Also note that the sidecar should identify with the unique cluster_id
provided in your values.yaml in the previous step. Default value is cluster-one
.
The default retention period for when data is moved into the object storage is currently 2h. This configuration is based on Thanos suggested values. By default, it will be 2 hours before data is written to the provided bucket.
Then, follow to enable attaching IAM roles to pods.
You can define the IAM role to associate with a service account in your cluster by creating a service account in the same namespace as Kubecost and adding an annotation to it of the form eks.amazonaws.com/role-arn: arn:aws:iam::<AWS_ACCOUNT_ID>:role/<IAM_ROLE_NAME>
as described .
Visit the doc for troubleshooting help.
If you don’t want to use a service account, IAM credentials retrieved from an instance profile are also supported. You must get both access key and secret key from the same method (i.e. both from service or instance profile). More info on retrieving credentials .
These values can be adjusted under the thanos
block in values-thanos.yaml. Available options are here:
Then navigate to in your browser. This page should look very similar to the Prometheus console.
Instead of waiting 2h to ensure that Thanos was configured correctly, the default log level for the Thanos workloads is debug
(it's very light logging even on debug). You can get logs for the thanos-sidecar
, which is part of the prometheus-server
Pod, and thanos-store
. The logs should give you a clear indication of whether or not there was a problem consuming the secret and what the issue is. For more on Thanos architecture, view .
Secondary clusters use a minimal Kubecost deployment to send their metrics to a central storage-bucket (aka durable storage) that is accessed by the primary cluster to provide a single-pane-of-glass view into all aggregated cluster costs globally. This aggregated cluster view is exclusive to Kubecost Enterprise.
Kubecost's UI will appear broken when set to a secondary cluster. It should only be used for troubleshooting.
This guide explains settings that can be tuned in order to run the minimum Kubecost components to run Kubecost more efficiently.
See the Additional resources section below for complete examples in our GitHub repo.
Disable product caching and reduce query concurrency with the following parameters:
Grafana is not needed on secondary clusters.
Kubecost and its accompanying Prometheus collect a reduced set of metrics that allow for lower resource/storage usage than a standard Prometheus deployment.
The following configuration options further reduce resource consumption when not using the Kubecost frontend:
Potentially reducing retention even further, metrics are sent to the storage-bucket every 2 hours.
You can tune prometheus.server.persistentVolume.size
depending on scale, or outright disable persistent storage.
Disable Thanos components. These are only used for troubleshooting on secondary clusters. See this guide for troubleshooting via kubectl logs.
Secondary clusters write to the global storage-bucket via the thanos-sidecar on the prometheus-server pod.
You can disable node-exporter and the service account if cluster/node rightsizing recommendations are not required.
node-export must be disabled if there is an existing DaemonSet. More info here.
For reference, this secondary-clusters.yaml
snippet is a list of the most common settings for efficient secondary clusters:
You can find complete installation guides and sample files on our repo.
Additional considerations for properly tuning resource consumption is here.
This document will describe why your Kubecost instance’s data can be useful to share with us, what content is in the data, and how to share it.
Kubecost product releases are tested and verified against a combination of generated/synthetic Kubernetes cluster data and examples of customer data that have been shared with us. Customers who share snapshots of their data with us help to ensure that product changes handle their specific use cases and scales. Because the Kubecost product for many customers is run as an on-prem service, with no data sharing back to us, we do not inherently have this data for many of our customers.
Sharing data with us requires an ETL backup executed by the customer in their own environment before the resulting data can be sent out. Kubecost's ETL is a computed cache built upon Prometheus metrics and cloud billing data, on which nearly all API requests made by the user and the Kubecost frontend currently rely. Therefore, the ETL data will contain metric data and identifying information for that metric (e.g. a container name, pod name, namespace, and cluster name) during a time window, but will not contain other information about containers, pods, clusters, cloud resources, etc. You can read more about these metric details in our Kubecost Metrics doc.
The full methodology for creating the ETL backup can be found in our ETL Backup doc. Once these files have been backed up, the content will look as follows before compressing the data:
Once the data is downloaded to the local disk from either the automated or manual ETL backup methods, the data must be converted to a gzip file. A suggested method for downloading the ETL backup and compressing it quickly is to use this script. Check out the tar
syntax in that script if doing this manually without the script. When the compressed ETL backup is ready to share, please work with a Kubecost support engineer on sharing the file with us. Our most common approach is to use a Google Drive folder with access limited to you and the support engineer, but we recognize not all companies are open to this and will work with you to determine the most business-appropriate method.
If you are interested in reviewing the contents of the data, either before or after sending the ETL backup to us, you can find an example Golang implementation on how to read the raw ETL data.
We do not recommend enabling ETL Backup in conjunction with Federated ETL.
Kubecost's extract, transform, load (ETL) data is a computed cache based on Prometheus's metrics, from which the user can perform all possible Kubecost queries. The ETL data is stored in a persistent volume mounted to the kubecost-cost-analyzer
pod.
There are a number of reasons why you may want to backup this ETL data:
To ensure a copy of your Kubecost data exists, so you can restore the data if needed
To reduce the amount of historical data stored in Prometheus/Thanos, and instead retain historical ETL data
Beginning in v1.100, this feature is enabled by default if you have Thanos enabled. To opt out, set .Values.kubecostModel.etlBucketConfigSecret="".
Kubecost provides cloud storage backups for ETL backing storage. Backups are not the typical approach of "halt all reads/writes and dump the database." Instead, the backup system is a transparent feature that will always ensure that local ETL data is backed up, and if local data is missing, it can be retrieved from backup storage. This feature protects users from accidental data loss by ensuring that previously backed-up data can be restored at runtime.
Durable backup storage functionality is supported with a Kubecost Enterprise plan.
When the ETL pipeline collects data, it stores daily and hourly (if configured) cost metrics on a configured storage. This defaults to a PV-based disk storage, but can be configured to use external durable storage on the following providers:
AWS S3
Azure Blob Storage
Google Cloud Storage
This configuration secret follows the same layout documented for Thanos here.
You will need to create a file named object-store.yaml using the chosen storage provider configuration (documented below), and run the following command to create the secret from this file:
The file must be named object-store.yaml.
If Kubecost was installed via Helm, ensure the following value is set.
If you are using an existing disk storage option for your ETL data, enabling the durable backup feature will retroactively back up all previously stored data*. This feature is also fully compatible with the existing S3 backup feature.
If you are using a memory store for your ETL data with a local disk backup (kubecostModel.etlFileStoreEnabled: false
), the backup feature will simply replace the local backup. In order to take advantage of the retroactive backup feature, you will need to update to file store (kubecostModel.etlFileStoreEnabled: true
). This option is now enabled by default in the Helm chart.
The simplest way to backup Kubecost's ETL is to copy the pod's ETL store to your local disk. You can then send that file to any other storage system of your choice. We provide a script to do that.
To restore the backup, untar the results of the ETL backup script into the ETL directory pod.
There is also a Bash script available to restore the backup in Kubecost's etl-backup repo.
Currently, this feature is still in development, but there is currently a status card available on the Diagnostics page that will eventually show the status of the backup system:
In some scenarios like when using Memory store, setting kubecostModel.etlHourlyStoreDurationHours
to a value of 48
hours or less will cause ETL backup files to become truncated. The current recommendation is to keep etlHourlyStoreDurationHours at its default of 49
hours.
This feature is currently in beta. It is enabled by default.
Multi-Cluster Diagnostics offers a single view into the health of all the clusters you currently monitor with Kubecost.
Health checks include, but are not limited to:
Whether Kubecost is correctly emitting metrics
Whether Kubecost is being scraped by Prometheus
Whether Prometheus has scraped the required metrics
Whether Kubecost's ETL files are healthy
Additional configuration options can found in the values.yaml under diagnostics:
.
The multi-cluster diagnostics feature is run as an independent deployment (i.e. deployment/kubecost-diagnostics
). Each diagnostics deployment monitors the health of Kubecost and sends that health data to the central object store at the /diagnostics
filepath.
The below diagram depicts these interactions. This diagram is specific to the requests required for diagnostics only. For additional diagrams, see our multi-cluster guide.
The diagnostics API can be accessed through /model/multi-cluster-diagnostics?window=2d
(or /model/mcd
for short)
The window
query parameter is required, which will return all diagnostics within the specified time window.
GET
http://<your-kubecost-address>/model/multi-cluster-diagnostics
The Multi-cluster Diagnostics API provides a single view into the health of all the clusters you currently monitor with Kubecost.
window*
string
Duration of time over which to query. Accepts words like today
, week
, month
, yesterday
, lastweek
, lastmonth
; durations like 30m
, 12h
, 7d
; comma-separated RFC3339 date pairs like 2021-01-02T15:04:05Z,2021-02-02T15:04:05Z
; comma-separated Unix timestamp (seconds) pairs like 1578002645,1580681045
.
Amazon Elastic Kubernetes Services (Amazon EKS) is a managed container service to run and scale Kubernetes applications in the AWS cloud. In collaboration with Amazon EKS, Kubecost provides optimized bundle for Amazon EKS cluster cost visibility that enables customers to accurately track costs by namespace, cluster, pod or organizational concepts such as team or application. Customers can use their existing AWS support agreements to obtain support. Kubernetes platform administrators and finance leaders can use Kubecost to visualize a breakdown of their Amazon EKS cluster charges, allocate costs, and chargeback organizational units such as application teams.
In this article, you will learn more about how the Amazon EKS architecture interacts with Kubecost. You will also learn to deploy Kubecost on EKS using one of three different methods:
Deploy Kubecost on an Amazon EKS cluster using Amazon EKS add-on
Deploy Kubecost on an Amazon EKS cluster via Helm
Deploy Kubecost on an Amazon EKS Anywhere cluster using Helm
User experience diagram:
Amazon EKS cost monitoring with Kubecost architecture:
Subscribe to Kubecost on AWS Marketplace here.
You have access to an Amazon EKS cluster.
After subscribing to Kubecost on AWS Marketplace and following the on-screen instructions successfully, you are redirected to Amazon EKS console. To get started in the Amazon EKS console, go to your EKS clusters, and in the Add-ons tab, select Get more add-ons to find Kubecost EKS add-ons in the cluster setting of your existing EKS clusters. You can use the search bar to find "Kubecost - Amazon EKS cost monitoring" and follow the on-screen instructions to enable Kubecost add-on for your Amazon EKS cluster. You can learn more about direct deployment to Amazon EKS clusters from this AWS blog post.
On your workspace, run the following command to enable the Kubecost add-on for your Amazon EKS cluster:
You need to replace $YOUR_CLUSTER_NAME
and $AWS_REGION
accordingly with your actual Amazon EKS cluster name and AWS region.
To monitor the installation status, you can run the following command:
The Kubecost add-on should be available in a few minutes. Run the following command to enable port-forwarding to expose the Kubecost dashboard:
To disable Kubecost add-on, you can run the following command:
To get started, you can follow these steps to deploy Kubecost into your Amazon EKS cluster in a few minutes using Helm.
You have access to an Amazon EKS cluster.
If your cluster is version 1.23 or later, you must have the Amazon EBS CSI driver installed on your cluster. You can also follow these instructions to install Amazon EBS CSI driver:
Run the following command to create an IAM service account with the policies needed to use the Amazon EBS CSI Driver.
Remember to replace $CLUSTER_NAME
with your actual cluster name.
Install the Amazon EBS CSI add-on for EKS using the AmazonEKS_EBS_CSI_DriverRole by issuing the following command:
After completing these prerequisite steps, you're ready to begin EKS integration.
In your environment, run the following command from your terminal to install Kubecost on your existing Amazon EKS cluster:
To install Kubecost on Amazon EKS cluster on AWS Graviton2 (ARM-based processor), you can run following command:
On the Amazon EKS cluster with mixed processor architecture worker nodes (AMD64, ARM64), this parameter can be used to schedule Kubecost deployment on ARM-based worker nodes: --set nodeSelector."beta\\.kubernetes\\.io/arch"=arm64
Remember to replace $VERSION with the actual version number. You can find all available versions via the Amazon ECR public gallery here.
By default, the installation will include certain prerequisite software including Prometheus and kube-state-metrics. To customize your deployment, such as skipping these prerequisites if you already have them running in your cluster, you can configure any of the available values to modify storage, network configuration, and more.
Run the following command to enable port-forwarding to expose the Kubecost dashboard:
You can now access Kubecost's UI by visiting http://localhost:9090
in your local web browser. Here, you can monitor your Amazon EKS cluster cost and efficiency. Depending on your organization’s requirements and setup, you may have different options to expose Kubecost for internal access. There are a few examples that you can use for your references:
See Kubecost's Ingress Examples doc as a reference for using Nginx ingress controller with basic auth.
You can also consider using AWS LoadBalancer controller to expose Kubecost and use Amazon Cognito for authentication, authorization, and user management. You can learn more via the AWS blog post Authenticate Kubecost Users with Application Load Balancer and Amazon Cognito.
Deploying Kubecost on EKS Anywhere via Helm is not the officially recommended method by Kubecost or AWS. The recommended method is via EKS add-on (see above).
Amazon EKS Anywhere (EKS-A) is an alternate deployment of EKS which allows you to create and configure on-premises clusters, including on your own virtual machines. It is possible to deploy Kubecost on EKS-A clusters to monitor spend data.
Deploying Kubecost on an EKS-A cluster should function similarly at the cluster level, such as when retrieving Allocations or Assets data. However, because on-prem servers wouldn't be visible in a CUR (as the billing source is managed outside AWS), certain features like the Cloud Cost Explorer will not be accessible.
You have installed the EKS-A installer and have access to an Amazon EKS-A cluster.
In your environment, run the following command from your terminal to install Kubecost on your existing Amazon EKS cluster:
To install Kubecost on an EKS-A cluster on AWS Graviton2 (ARM-based processor), you can run following command:
On the Amazon EKS cluster with mixed processor architecture worker nodes (AMD64, ARM64), this parameter can be used to schedule Kubecost deployment on ARM-based worker nodes: --set nodeSelector."beta\\.kubernetes\\.io/arch"=arm64
Remember to replace $VERSION with the actual version number. You can find all available versions via the Amazon ECR public gallery here.
By default, the installation will include certain prerequisite software including Prometheus and kube-state-metrics. To customize your deployment, such as skipping these prerequisites if you already have them running in your cluster, you can configure any of the available values to modify storage, network configuration, and more.
Run the following command to enable port-forwarding to expose the Kubecost dashboard:
You can now access Kubecost's UI by visiting http://localhost:9090
in your local web browser. Here, you can monitor your Amazon EKS cluster cost and efficiency through the Allocations and Assets pages.
Amazon EKS documentation:
AWS blog content:
This feature is only supported on Kubecost Enterprise plans.
The query service replica (QSR) is a scale-out query service that reduces load on the cost-model pod. It allows for improved horizontal scaling by being able to handle queries for larger intervals, and multiple simultaneous queries.
The query service will forward /model/allocation
and /model/assets
requests to the Query Services StatefulSet.
The diagram below demonstrates the backing architecture of this query service and its functionality.
There are three options that can be used for the source ETL Files:
For environments that have Kubecost Federated ETL enabled, this store will be used, no additional configuration is required.
For single cluster environments, QSR can target the ETL backup store. To learn more about ETL backups, see the ETL Backup doc.
Alternatively, an object-store containing the ETL dataset to be queried can be configured using a secret kubecostDeployment.queryServiceConfigSecret
. The file name of the secret must be object-store.yaml
. Examples can be found in our Configuring Thanos doc.
QSR uses persistent volume storage to avoid excessive S3 transfers. Data is retrieved from S3 hourly as new ETL files are created and stored in these PVs. The databaseVolumeSize
should be larger than the size of the data in the S3 bucket.
When the pods start, data from the object-store is synced and this can take a significant time in large environments. During the sync, parts of the Kubecost UI will appear broken or have missing data. You can follow the pod logs to see when the sync is complete.
The default of 100Gi is enough storage for 1M pods and 90 days of retention. This can be adjusted:
Once the data store is configured, set kubecostDeployment.queryServiceReplicas
to a non-zero value and perform a Helm upgrade.
Once QSR has been enabled, the new pods will automatically handle all API requests to /model/allocation
and /model/assets
.
This document provides the steps for installing the Kubecost product from the AWS marketplace. .
To deploy Kubecost from AWS Marketplace, you need to assign an IAM policy with appropriate IAM permission to a Kubernetes service account before starting the deployment. You can either use AWS managed policy arn:aws:iam::aws:policy/AWSMarketplaceMeteringRegisterUsage
or create your own IAM policy. You can learn more with AWS' tutorial.
Here's an example IAM policy:
Create an IAM role with AWS-managed IAM policy.
Create a K8s service account name awsstore-serviceaccount
in your Amazon EKS cluster.
Set up a trust relationship between the created IAM role with awsstore-serviceaccount.
Modify awsstore-serviceaccount
annotation to associate it with the created IAM role
Remember to replace CLUSTER_NAME
with your actual Amazon EKS cluster name.
Define which available version you would like to install using this following command You can check available version titles from the AWS Marketplace product, e.g: prod-1.95.0:
export IMAGETAG=<VERSION-TITLE>
Deploy Kubecost with Helm using the following command:
Run this command to enable port-forwarding and access the Kubecost UI:
You can now start monitoring your Amazon EKS cluster cost with Kubecost by visiting http://localhost:9090
.
is a free, open-source tool that enables you to deploy Kubecost on Kubernetes with the cloud provider of your choice. Plural is an open-source DevOps platform for self-hosting applications on Kubernetes without the management overhead. With baked-in SSO, automated upgrades, and secret encryption, you get all the benefits of a managed service with none of the lock-in or cost.
Kubecost is available as direct install with Plural, and it synergizes very well with the ecosystem, providing cost monitoring out of the box to users that deploy their Kubernetes clusters with Plural.
First, create an account on . This is only to track your installations and allow for the delivery of automated upgrades. You will not be asked to provide any infrastructure credentials or sensitive information.
Next, install the Plural CLI by following steps 1-3 of .
You'll need a Git repository to store your Plural configuration. This will contain the Helm charts, Terraform config, and Kubernetes manifests that Plural will autogenerate for you.
You have two options:
Run plural init
in any directory to let Plural initiate an OAuth workflow to create a Git repo for you.
Create a Git repo manually, clone it down, and run plural init
inside it.
Running plural init
will start a configuration wizard to configure your Git repo and cloud provider for use with Plural. You're now ready to install Kubecost on your Plural repo.
To find the console bundle name for your cloud provider, run:
Now, to add it your workspace, run the install command. If you're on AWS, this is what the command would look like:
Plural's Kubecost distribution has support for AWS, GCP, and Azure, so feel free to pick whichever best fits your infrastructure.
To generate the configuration and deploy your infrastructure, run:
Note: Deploys will generally take 10-20 minutes, based on your cloud provider.
To make management of your installation as simple as possible, we recommend installing the Plural Console. The console provides tools to manage resource scaling, receiving automated upgrades, creating dashboards tailored to your Kubecost installation, and log aggregation. This can be done using the exact same process as above, using AWS as an example:
Now, head over to kubecost.YOUR_SUBDOMAIN.onplural.sh
to access the Kubecost UI. If you set up a different subdomain for Kubecost during installation, make sure to use that instead.
To monitor and manage your Kubecost installation, head over to the Plural Console at console.YOUR_SUBDOMAIN.onplural.sh
.
To bring down your Plural installation of Kubecost at any time, run:
To bring your entire Plural deployment down, run:
Note: Only do this if you're absolutely sure you want to bring down all associated resources with this repository.
is a SaaS-first Kubernetes Operations Platform (KOP) with enterprise-class scalability, zero-trust security and interoperability for managing applications across public clouds, data centers & edge.
See to learn more about the platform and how to use it.
This document will walk you through installing Kubecost on a cluster that has been provisioned or imported using the Rafay controller. The steps below describe how to create and use a custom cluster blueprint via the . The entire workflow can also be fully automated and embedded into an automation pipeline using the or .
You have already one or more Kubernetes clusters using the .
Under Integrations:
Select Repositories and create a new repository named kubecost
of type Helm.
Select Create.
Enter the endpoint value of https://kubecost.github.io/cost-analyzer/
.
Select Save.
You'll need to override the default values.yaml file. Create a new file called kubecost-custom-values.yaml with the following content:
Under Infrastructure, select Namespaces and create a new namespace called kubecost
, and select type Wizard.
Select Save & Go to Placement.
Select the cluster(s) that the namespace will be added to. Select Save & Go To Publish.
Select Publish to publish the namespace to the selected cluster(s).
Once the namespace has been published, select Exit.
Under Infrastructure, select Clusters.
Select the kubectl button on the cluster to open a virtual terminal.
Verify that the kubecost
namespace has been created by running the following command:
Select Add-ons and Create a new add-on called kubecost.
Select Bring your own.
Select Helm 3 for type.
Select Pull files from repository.
Select Helm for the repository type.
Select kubecost
for the namespace.
Select Select.
Create a new version of the add-on.
Select New Version.
Provide a version name such as v1
.
Select kubecost
for the repository.
Enter cost-analyzer
for the chart name.
Upload the kubecost-custom-values.yaml
file that was previously created.
Select Save Changes.
Once you've created the Kubecost add-on, use it in assembling a custom cluster blueprint. You can add other add-ons to the same custom blueprint.
Under Infrastructure, select Blueprints.
Create a new blueprint and give it a name such as kubecost
.
Select Save.
Create a new version of the blueprint.
Select New Version.
Provide a version name such as v1
.
Under Add-Ons, select the kubecost
Add-on and the version that was previously created.
Select Save Changes.
You may now apply this custom blueprint to a cluster.
Select Options for the target cluster in the Web Console.
Select Update Blueprint and select the kubecost
blueprint and version you created previously.
Select Save and Publish.
This will start the deployment of the add-ons configured in the kubecost
blueprint to the targeted cluster. The blueprint sync process can take a few minutes. Once complete, the cluster will display the current cluster blueprint details and whether the sync was successful or not.
You can optionally verify whether the correct resources have been created on the cluster. Select the kubectl
button on the cluster to open a virtual terminal.
Then, verify the pods in the kubecost
namespace. Run kubectl get pod -n kubecost
, and check that the output is similar to the example below.
You can now access the Kubecost UI by visiting http://localhost:9090
in your browser.
You have now successfully created a custom cluster blueprint with the kubecost
add-on and applied to a cluster. Use this blueprint on as many clusters as you require.
The following requirements are given:
Rancher with default monitoring
Use of an existing Prometheus and Grafana (Kubecost will be installed without Prometheus and Grafana)
Istio with gateway and sidecar for deployments
Kubecost v1.85.0+ includes changes to support cAdvisor metrics without the container_name
rewrite rule.
Istio is activated by editing the namespace. To do this, execute the command kubectl edit namespace kubecost
and insert the label istio-injection: enabled
After Istio has been activated, some adjustments must be made to the deployment with kubectl -n kubecost edit deployment kubecost-cost-analyzer
to allow communication within the namespace. For example, the healtch-check is completed successfully. When editing the deployment, the two annotations must be added:
An authorization policy governs access restrictions in namespaces and specifies how resources within a namespace are allowed to access it.
Peer authentication is used to set how traffic is tunneled to the Istio sidecar. In the example, enforcing TLS is disabled so that Prometheus can grab the metrics from Kubecost (if this action is not performed, it returns at HTTP 503 error).
A destination rule is used to specify how traffic should be handled after routing to a service. In my case, TLS is disabled for connections from Kubecost to Prometheus and Grafana (namespace "cattle-monitoring-system").
A virtual service is used to direct data traffic specifically to individual services within the service mesh. The virtual service defines how the routing should run. A gateway is required for a virtual service.
After creating the virtual service, Kubecost should be accessible at the URL http(s)://${gateway}/kubecost/
.
Installing Kubecost on an Alibaba cluster is the same as other cloud providers with Helm v3.1+:
helm install kubecost/cost-analyzer -n kubecost -f values.yaml
Your values.yaml files must contain the below parameters:
The alibaba-service-key
can be created using the following command:
Your path needs a file having Alibaba Cloud secrets. Alibaba secrets can be passed in a JSON file with the file in the format:
These two can be generated in the Alibaba Cloud portal. Hover over your user account icon, then select AccessKey Management. A new window opens. Select Create AccessKey to generate a unique access token that will be used for all activities related to Kubecost.
While getting all the available Storage Classes that the Alibaba K8s cluster comes with, there may not be a default storage class. Kubecost installation may fail as the cost-model pod and Prometheus server pod would be in a status pending state.
To fix this issue, make any of the Storage Classes in the Alibaba K8s cluster as Default using the below command:
Following this, installation should proceed as normal.
Installing Kubecost on a GKE Autopilot cluster is similar to other cloud providers with Helm v3.1+, with a few changes. Autopilot requires the use of service, which generates additional costs within your Google Cloud account.
helm install kubecost/cost-analyzer -n kubecost -f values.yaml
Your values.yaml files must contain the below parameters. Resources are specified for each section of the Kubecost deployment, and Pod Security Policies are disabled.
Open the OperatorConfig on your Autopilot Cluster resource for editing:
Add the following collection section to the resource:
Save the file and close the editor. After a short time, the Kubelet metric endpoints will be scraped and the metrics become available for querying in Managed Service for Prometheus.
We recommend doing this via . The command below helps to automate these manual steps:
More details and how to set up the appropriate trust relationships is available .
Your Amazon EKS cluster needs to have IAM OIDC provider enabled to set up IRSA. Learn more with AWS' doc.
The CLI will prompt you to choose whether you want to use Plural OIDC. allows you to log in to the applications you host on Plural with your acting as an SSO provider.
If you have any issues with installing Kubecost on Plural, feel free to join the Plural and we can help you out.
If you'd like to request any new features for our Kubecost installation, feel free to open an issue or PR .
To learn more about what you can do with Plural and more advanced uses of the platform, feel free to dive deeper into .
Login to the and navigate to your Project as an Org Admin or Infrastructure Admin.
From the :
In order to access the Kubecost UI, you'll need to enable access to the frontend application using . To do this, download and use the Kubeconfig
with the KubeCTL CLI (../../accessproxy/kubectl_cli/
).
You can find on Kubecost as well as guides for how to create or import a cluster using the Rafay controller on the site.
Currently, Kubecost does not support complete integration of your Alibaba billing data like for other major cloud providers. Instead, Kubecost will only support public pricing integration, which will provide proper list prices for all cloud-based resources. Features like reconciliation and savings insights are not available for Alibaba. For more information on setting up a public pricing integration, see our doc.
Grafana Cloud is a composable observability platform, integrating metrics, traces and logs with Grafana. Customers can leverage the best open source observability software without the overhead of installing, maintaining, and scaling your observability stack.
This document will show you how to integrate the Grafana Cloud Prometheus metrics service with Kubecost.
You have access to a running Kubernetes cluster
You have created a Grafana Cloud account
You have permissions to create Grafana Cloud API keys
Install the Grafana Agent for Kubernetes on your cluster. On the existing K8s cluster that you intend to install Kubecost, run the following commands to install the Grafana Agent to scrape the metrics from Kubecost /metrics
endpoint. The script below installs the Grafana Agent with the necessary scraping configuration for Kubecost, you may want to add additional scrape configuration for your setup. Please remember to replace the following values with your actual Grafana cloud's values:
REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-ENDPOINT
REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-USERNAME
REPLACE-WITH-GRAFANA-PROM-REMOTE-WRITE-API-KEY
REPLACE-WITH-YOUR-CLUSTER-NAME
You can also verify if grafana-agent
is scraping data with the following command (optional):
To learn more about how to install and config Grafana agent as well as additional scrape configuration, please refer to Grafana Agent documentation or you can check Kubecost Prometheus scrape config at this GitHub repository.
dbsecret
to allow Kubecost to query the metrics from Grafana Cloud Prometheus:Create two files in your working directory, called USERNAME
and PASSWORD
respectively
Verify that you can run queries against your Grafana Cloud Prometheus query endpoint (optional):
Create K8s secret name dbsecret
:
Verify if the credentials appear correctly (optional):
To set up recording rules in Grafana Cloud, download the Cortextool CLI utility. While they are optional, they offer improved performance.
After installing the tool, create a file called kubecost_rules.yaml with the following command:
Then, make sure you are in the same directory as your _kubecost\_rules.yaml_
, and load the rules using Cortextool. Replace the address with your Grafana Cloud’s Prometheus endpoint (Remember to omit the /api/prom path from the endpoint URL).
Print out the rules to verify that they’ve been loaded correctly:
Install Kubecost on your K8s cluster with Grafana Cloud Prometheus query endpoint and dbsecret
you created in Step 2.
The process is complete. By now, you should have successfully completed the Kubecost integration with Grafana Cloud.
Optionally, you can also add our Kubecost Dashboard for Grafana Cloud to your organization to visualize your cloud costs in Grafana.
There are several considerations when disabling the Kubecost included Prometheus deployment. Kubecost strongly recommends installing Kubecost with the bundled Prometheus in most environments.
The Kubecost Prometheus deployment is optimized to not interfere with other observability instrumentation and by default only contains metrics that are useful to the Kubecost product. This results in 70-90% fewer metrics than a Prometheus deployment using default settings.
Additionally, if multi-cluster metric aggregation is required, Kubecost provides a turnkey solution that is highly tuned and simple to support using the included Prometheus deployment.
This feature is accessible to all users. However, please note that comprehensive support is provided with a paid support plan.
Kubecost requires the following minimum versions:
Prometheus: v2.18 (v2.13-2.17 supported with limited functionality)
kube-state-metrics: v1.6.0+
cAdvisor: kubelet v1.11.0+
node-exporter: v0.16+ (Optional)
If you have node-exporter and/or KSM running on your cluster, follow this step to disable the Kubecost included versions. Additional detail on KSM requirements.
In contrast to our recommendation above, we do recommend disabling the Kubecost's node-exporter and kube-state-metrics if you already have them running in your cluster.
This process is not recommended. Before continuing, review the Bring your own Prometheus section if you haven't already.
Pass the following parameters in your Helm install:
The FQDN can be a full path via https://prometheus-prod-us-central-x.grafana.net/api/prom/
if you use Grafana Cloud-managed Prometheus. Learn more in the Grafana Cloud Integration for Kubecost doc.
Have your Prometheus scrape the cost-model /metrics
endpoint. These metrics are needed for reporting accurate pricing data. Here is an example scrape config:
This config needs to be added to extraScrapeConfigs
in the Prometheus configuration. See the example extraScrapeConfigs.yaml.
By default, the Prometheus chart included with Kubecost (bundled-Prometheus) contains scrape configs optimized for Kubecost-required metrics. You need to add those scrape configs jobs into your existing Prometheus setup to allow Kubecost to provide more accurate cost data and optimize the required resources for your existing Prometheus.
You can find the full scrape configs of our bundled-Prometheus here. You can check Prometheus documentation for more information about the scrape config, or read this documentation if you are using Prometheus Operator.
This step is optional. If you do not set up Kubecost's CPU usage recording rule, Kubecost will fall back to a PromQL subquery which may put unnecessary load on your Prometheus.
Kubecost-bundled Prometheus includes a recording rule used to calculate CPU usage max, a critical component of the request right-sizing recommendation functionality. Add the recording rules to reduce query load here.
Alternatively, if your environment supports serviceMonitors
and prometheusRules
, pass these values to your Helm install:
To confirm this job is successfully scraped by Prometheus, you can view the Targets page in Prometheus and look for a job named kubecost
.
This step is optional, and only impacts certain efficiency metrics. View issue/556 for a description of what will be missing if this step is skipped.
You'll need to add the following relabel config to the job that scrapes the node exporter DaemonSet.
This does not override the source label. It creates a new label called kubernetes_node
and copies the value of pod into it.
In order to distinguish between multiple clusters, Kubecost needs to know the label used in prometheus to identify the name. Use the .Values.kubecostModel.promClusterIDLabel
. The default cluster label is cluster_id
, though many environments use the key of cluster
.
By default, metric retention is 91 days, however the retention of data can be further increased with a configurable value for a property etlDailyStoreDurationDays
. You can find this value here.
Increasing the default etlDailyStorageDurationDays
value will naturally result in greater memory usage. At higher values, this can cause errors when trying to display this information in the Kubecost UI. You can remedy this by increasing the Step size when using the Allocations dashboard.
The Diagnostics page (Settings > View Full Diagnostics) provides diagnostic info on your integration. Scroll down to Prometheus Status to verify that your configuration is successful.
Below you can find solutions to common Prometheus configuration problems. View the Kubecost Diagnostics doc for more information.
Evidenced by the following pod error message No valid prometheus config file at ...
and the init pods hanging. We recommend running curl <your_prometheus_url>/api/v1/status/config
from a pod in the cluster to confirm that your Prometheus config is returned. Here is an example, but this needs to be updated based on your pod name and Prometheus address:
In the above example, <your_prometheus_url> may include a port number and/or namespace, example: http://prometheus-operator-kube-p-prometheus.monitoring:9090/api/v1/status/config
If the config file is not returned, this is an indication that an incorrect Prometheus address has been provided. If a config file is returned from one pod in the cluster but not the Kubecost pod, then the Kubecost pod likely has its access restricted by a network policy, service mesh, etc.
Network policies, Mesh networks, or other security related tooling can block network traffic between Prometheus and Kubecost which will result in the Kubecost scrape target state as being down in the Prometheus targets UI. To assist in troubleshooting this type of error you can use the curl
command from within the cost-analyzer container to try and reach the Prometheus target. Note the "namespace" and "deployment" name in this command may need updated to match your environment, this example uses the default Kubecost Prometheus deployment.
When successful, this command should return all of the metrics that Kubecost uses. Failures may be indicative of the network traffic being blocked.
Ensure Prometheus isn't being CPU throttled due to a low resource request.
Review the Dependency Requirements section above
Visit Prometheus Targets page (screenshot above)
Make sure that honor_labels is enabled
Ensure results are not null for both queries below.
Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost
Ensure kube-state-metrics are available: kube_node_status_capacity
For both queries, verify nodes are returned. A successful response should look like:
An error will look like:
Ensure that all clusters and nodes have values- output should be similar to the above Single Cluster Tests
Make sure Prometheus is scraping Kubecost search metrics for: node_total_hourly_cost
On macOS, change date -d '1 day ago'
to date -v '-1d'
Ensure kube-state-metrics are available: kube_node_status_capacity
For both queries, verify nodes are returned. A successful response should look like:
An error will look like:
Kubecost leverages the open-source Prometheus project as a time series database and post-processes the data in Prometheus to perform cost allocation calculations and provide optimization insights for your Kubernetes clusters such as Amazon Elastic Kubernetes Service (Amazon EKS). Prometheus is a single machine statically-resourced container, so depending on your cluster size or when your cluster scales out, it could exceed the scraping capabilities of a single Prometheus server. In collaboration with Amazon Web Services (AWS), Kubecost integrates with Amazon Managed Service for Prometheus (AMP), a managed Prometheus-compatible monitoring service, to enable the customer to easily monitor Kubernetes cost at scale.
The architecture of this integration is similar to Amazon EKS cost monitoring with Kubecost, which is described in the previous blog post, with some enhancements as follows:
In this integration, an additional AWS SigV4 container is added to the cost-analyzer pod, acting as a proxy to help query metrics from Amazon Managed Service for Prometheus using the AWS SigV4 signing process. It enables passwordless authentication to reduce the risk of exposing your AWS credentials.
When the Amazon Managed Service for Prometheus integration is enabled, the bundled Prometheus server in the Kubecost Helm Chart is configured in the remote_write mode. The bundled Prometheus server sends the collected metrics to Amazon Managed Service for Prometheus using the AWS SigV4 signing process. All metrics and data are stored in Amazon Managed Service for Prometheus, and Kubecost queries the metrics directly from Amazon Managed Service for Prometheus instead of the bundled Prometheus. It helps customers not worry about maintaining and scaling the local Prometheus instance.
There are two architectures you can deploy:
The Quick-Start architecture supports a small multi-cluster setup of up to 100 clusters.
The Federated architecture supports a large multi-cluster setup for over 100 clusters.
The infrastructure can manageup to 100 clusters. The following architecture diagram illustrates the small-scale infrastructure setup:
To support the large-scale infrastructure of over 100 clusters, Kubecost leverages a Federated ETL architecture. In addition to Amazon Prometheus Workspace, Kubecost stores its extract, transform, and load (ETL) data in a central S3 bucket. Kubecost's ETL data is a computed cache based on Prometheus's metrics, from which users can perform all possible Kubecost queries. By storing the ETL data on an S3 bucket, this integration offers resiliency to your cost allocation data, improves the performance and enables high availability architecture for your Kubecost setup.
The following architecture diagram illustrates the large-scale infrastructure setup:
You have an existing AWS account. You have IAM credentials to create Amazon Managed Service for Prometheus and IAM roles programmatically. You have an existing Amazon EKS cluster with OIDC enabled. Your Amazon EKS clusters have Amazon EBS CSI driver installed
The example output should be in this format:
The Amazon Managed Service for Prometheus workspace should be created in a few seconds. Run the following command to get the workspace ID:
Run the following command to set environment variables for integrating Kubecost with Amazon Managed Service for Prometheus:
Note: You can ignore Step 2 for the small-scale infrastructure setup.
a. Create Object store S3 bucket to store Kubecost ETL metrics. Run the following command in your workspace:
b. Create IAM Policy to grant access to the S3 bucket. The following policy is for demo purposes only. You may need to consult your security team and make appropriate changes depending on your organization's requirements.
Run the following command in your workspace:
c. Create Kubernetes secret to allow Kubecost to write ETL files to the S3 bucket. Run the following command in your workspace:
These following commands help to automate the following tasks:
Create an IAM role with the AWS-managed IAM policy and trusted policy for the following service accounts: kubecost-cost-analyzer-amp
, kubecost-prometheus-server-amp
.
Modify current K8s service accounts with annotation to attach a new IAM role.
Run the following command in your workspace:
For more information, you can check AWS documentation at IAM roles for service accounts and learn more about Amazon Managed Service for Prometheus managed policy at Identity-based policy examples for Amazon Managed Service for Prometheus
Run the following command to create a file called config-values.yaml, which contains the defaults that Kubecost will use for connecting to your Amazon Managed Service for Prometheus workspace.
Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the primary:
These installation steps are similar to those for a primary cluster setup, except you do not need to follow the steps in the section "Create Amazon Managed Service for Prometheus workspace", and you need to update these environment variables below to match with your additional clusters. Please note that the AMP_WORKSPACE_ID
and KC_BUCKET
are the same as the primary cluster.
Run this command to install Kubecost and integrate it with the Amazon Managed Service for Prometheus workspace as the additional cluster:
Your Kubecost setup is now writing and collecting data from AMP. Data should be ready for viewing within 15 minutes.
To verify that the integration is set up, go to Settings in the Kubecost UI, and check the Prometheus Status section.
Read our Custom Prometheus integration troubleshooting guide if you run into any errors while setting up the integration. For support from AWS, you can submit a support request through your existing AWS support contract.
You can add these recording rules to improve the performance. Recording rules allow you to precompute frequently needed or computationally expensive expressions and save their results as a new set of time series. Querying the precomputed result is often much faster than running the original expression every time it is needed. Follow these instructions to add the following rules:
The below queries must return data for Kubecost to calculate costs correctly.
For the queries below to work, set the environment variables:
Verify connection to AMP and that the metric for container_memory_working_set_bytes
is available:
If you have set kubecostModel.promClusterIDLabel
, you will need to change the query (CLUSTER_ID
) to match the label (typically cluster
or alpha_eksctl_io_cluster_name
).
The output should contain a JSON entry similar to the following.
The value of cluster_id
should match the value of kubecostProductConfigs.clusterName
.
Verify Kubecost metrics are available in AMP:
The output should contain a JSON entry similar to:
If the above queries fail, check the following:
Check logs of the sigv4proxy
container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):
In a working sigv4proxy
, there will be very few logs.
Correctly working log output:
Check logs in the `cost-model`` container for Prometheus connection issues:
Example errors:
Using an existing Grafana deployment can be accomplished through one of two options:
Linking to an external Grafana directly
Deploying with Grafana sidecar enabled
After installing Kubecost, select Settings from the left navigation and update Grafana Address to a URL that is visible to users accessing Grafana dashboards. This variable can alternatively be passed at the time you deploy Kubecost via the kubecostProductConfigs.grafanaURL
parameter in values.yaml. Next, import Kubecost Grafana dashboards as JSON from this folder.
Passing the Grafana parameters below in your values.yaml will install ConfigMaps for Grafana dashboards that will be picked up by the Grafana sidecar if you have Grafana with the dashboard sidecar already installed.
Ensure that the following flags are set in your Operator deployment:
sidecar.dashboards.enabled=true
sidecar.dashboards.searchNamespace
isn't restrictive. Use ALL
if Kubecost runs in another namespace.
The Kubecost UI cannot link to the Grafana dashboards unless kubecostProductConfigs.grafanaURL
is set, either via the Helm chart, or via the Settings page, as described in Option 1.
When using Kubecost on a custom ingress path, you must add this path to the Grafana root_url
:
If you choose to disable Grafana, set the following Helm values to ensure successful pod startup:
Kubecost supports deploying to Red Hat OpenShift (OCP) and includes options and features which assist in getting Kubecost running quickly and easily with OpenShift-specific resources.
There are two main options to deploy Kubecost on OpenShift.
More details and instructions on both deployment options are covered in the sections below.
A standard deployment of Kubecost to OpenShift is no different from deployments to other platforms with the exception of additional settings which may be required to successfully deploy to OpenShift.
Kubecost is installed with Cost Analyzer and Prometheus as a time-series database. Data is gathered by the Prometheus instance bundled with Kubecost. Kubecost then pushes and queries metrics to and from Prometheus.
The standard deployment is illustrated in the following diagram.
An existing OpenShift or OpenShift-compatible cluster (ex., OKD).
Access to the cluster to create a new project and deploy new workloads.
helm
CLI installed locally.
Add the Kubecost Helm chart repository and scan for new charts.
Install Kubecost using OpenShift specific values. Note that the below command fetches the OpenShift values from the development branch which may not reflect the state of the release which was just installed. We recommend using the corresponding values file from the chart release.
Because OpenShift disallows defining certain fields in a pod's securityContext
configuration, values specific to OpenShift must be used. The necessary values have already been defined in the OpenShift values file but may be customized to your specific needs.
If you want to install Kubecost with your desired cluster name, provide the following values to either your values override file or via the --set
command. Remember to replace the cluster name/id with the value you wish to use for this installation.
Other OpenShift-specific values include the ability to deploy a Route and SecurityContextConstraints for optional components requiring more privileges such as Kubecost network costs and Prometheus node exporter. To view all the latest OpenShift-specific values and their use, please see the OpenShift values file.
If you have not opted to do so during installation, it may be necessary to create a Route to the service kubecost-cost-analyzer
on port 9090
of the kubecost
project (if using default values). For more information on Routes, see the OpenShift documentation here.
After installation, wait for all pods to be ready. Kubecost will begin collecting data and may take up to 15 minutes for the UI to reflect the resources in the local cluster.
Kubecost offers a Red Hat community operator which can be found in the Operator Hub catalog of the OpenShift web console. When using this deployment method, the operator is installed and a Kubernetes Custom Resource is created which then triggers the operator to deploy the Helm chart. The chart deployed by the community operator is the same chart which is referenced in the standard deployment.
An existing OpenShift cluster.
Access to the cluster to create a new project and deploy new workloads.
Log in to your OCP cluster web console and select Operators > OperatorHub > then enter "Kubecost" in the search box.
Click the Install button to be taken to the operator installation page.
On the installation page, select the update approval method and then click Install.
Once the operator has been installed, create a namespace in which to deploy a Kubecost installation.
You can also select Operators > Installed Operators to review the details as shown below.
Once the namespace has been created, create the CostAnalyzer Custom Resource (CR) with the desired values for your installation. The CostAnalyzer CR represents the total Helm values used to deploy Kubecost and any of its components. This may either be created in the OperatorHub portal or via the oc
CLI. The default CostAnalyzer sample provided is pre-configured for a basic installation of Kubecost.
To create the CostAnalyzer resource from OperatorHub, from the installed Kubecost operator page, click on the CostAnalyzer tab and click the Create CostAnalyzer button.
Click on the radio button YAML view to see a full example of a CostAnalyzer CR. Here, you can either create a CostAnalyzer directly or download the Custom Resource for later use.
Change the namespace
field to kubecost
if this was the name of your namespace created previously.
Click the Create button to create the CostAnalyzer based on the current YAML.
After about a minute, Kubecost should be up and running based upon the configuration defined in the CostAnalyzer CR. You can get details on this installation by clicking on the instance which was just deployed.
If you have not opted to do so during installation, it may be necessary to create a Route to the service kubecost-cost-analyzer
on port 9090
of the kubecost
project (if using default values). For more information on Routes, see the OpenShift documentation here.
You can apply your product key at any time within the product UI or during an install or upgrade process. More details on both options are provided below.
If you have a , you only need to apply your product key on the Kubecost primary cluster, and not on any of the Kubecost secondary clusters.
kubecostToken
is a different concept from your product key and is used for managing trial access.
Many Kubecost product configuration options can be specified at install-time, including your product key.
To create a secret you will need to create a JSON file called productkey.json with the following format. Be sure to replace <YOUR_PRODUCT_KEY>
with your Kubecost product key.
Run the following command to create the secret. Replace <SECRET_NAME>
with a name for the secret (example: productkeysecret
):
Update your to enable the product key and specify the secret name:
kubecostProductConfigs.productKey.enabled=true
kubecostProductConfigs.productKey.secretname=<SECRET_NAME>
Run a helm upgrade
command to start using your product key.
You must also set the kubecostProductConfigs.productKey.enabled=true
when using this option. That this will leave your secrets unencrypted in values.yaml. Use a Kubernetes secret as in the previous method to avoid this.
To apply your license key within the Kubecost UI, visit the Overview page, then select Upgrade in the page header.
Next, select Add Key in the dialog menu shown below.
You can then supply your Kubecost provided license key in the input box that is now visible.
To verify that your key has been applied successfully, visit Settings to confirm the final digits are as expected:
If interested in filtering or aggregating by when using the , you will need to enable annotation emission. This will configure your Kubecost installation to generate the kube_pod_annotations
and kube_namespace_annotations
metrics as listed in our doc.
You can enable it in your values.yaml:
You can also enable it via your helm install
or helm upgrade
command:
These flags can be set independently. Setting one of these to true and the other to false will omit one and not the other.
Kubecost leverages the open-source Prometheus project as a time series database and post-processes the data in Prometheus to perform cost allocation calculations and provide optimization insights for your Kubernetes clusters. Prometheus is a single machine statically-resourced container, so depending on your cluster size or when your cluster scales out, your cluster could exceed the scraping capabilities of a single Prometheus server. In this doc, you will learn how Kubecost integrates with , a managed Prometheus-compatible monitoring service, to enable the customer to monitor Kubernetes costs at scale easily.
For this integration, GMP is required to be enabled for your GKE cluster with the managed collection. Next, Kubecost is installed in your GKE cluster and leverages GMP Prometheus binary to ingest metrics into GMP database seamlessly. In this setup, Kubecost deployment also automatically creates a Prometheus proxy that allows Kubecost to query the metrics from the GMP database for cost allocation calculation.
This integration is currently in beta.
You have a GCP account/subscription.
You have permission to manage GKE clusters and GCP monitoring services.
You have an existing GKE cluster with GMP enabled. You can learn more .
You can use the following command to install Kubecost on your GKE cluster and integrate with GMP:
In this installation command, these additional flags are added to have Kubecost work with GMP:
prometheus.server.image.repository
and prometheus.server.image.tag
replace the standard Prometheus image with GMP specific image.
global.gmp.enabled
and global.gmp.gmpProxy.projectId
are for enabling the GMP integration.
prometheus.server.global.external_labels.cluster_id
and kubecostProductConfigs.clusterName
helps to set the name for your Kubecost setup.
Your Kubecost setup now writes and collects data from GMP. Data should be ready for viewing within 15 minutes.
Run the following command to enable port-forwarding to expose the Kubecost dashboard:
To verify that the integration is set up, go to Settings in the Kubecost UI, and check the Prometheus Status section.
The below queries must return data for Kubecost to calculate costs correctly. For the queries to work, set the environment variables:
Verify connection to GMP and that the metric for container_memory_working_set_bytes
is available:
If you have set kubecostModel.promClusterIDLabel
in the Helm chart, you will need to change the query (CLUSTER_ID
) to match the label.
Verify Kubecost metrics are available in GMP:
You should receive an output similar to:
If id
returns as a blank value, you can set the following Helm value to force set cluster
as the Prometheus cluster ID label:
If the above queries fail, check the following:
Check logs of the sigv4proxy
container (may be the Kubecost deployment or Prometheus Server deployment depending on your setup):
In a working sigv4proxy
, there will be very few logs.
Correctly working log output:
Check logs in the cost-model
container for Prometheus connection issues:
Example errors:
The network costs DaemonSet is an optional utility that gives Kubecost more detail to attribute costs to the correct pods.
When networkCost is enabled, Kubecost gathers pod-level network traffic metrics to allocate network transfer costs to the pod responsible for the traffic.
See this doc for more detail on .
The network costs metrics are collected using a DaemonSet (one pod per node) that uses source and destination detail to determine egress and ingress data transfers by pod and are classified as internet, cross-region and cross-zone.
With the network costs DaemonSet enabled, the Network column on the Allocations page will reflect the portion of network transfer costs based on the chart-level aggregation.
When using Kubecost version 1.99 and above: Greater detail can be accessed through Allocations UI only when aggregating by namespace and selecting the link on that namespace. This opens the namespace detail page where there is a card at the bottom.
If using Kubecost-bundled Prometheus instance, the scrape is automatically configured.
If you are integrating with an existing Prometheus, you can set networkCosts.prometheusScrape=true
and the network costs service should be auto-discovered.
You can adjust log level using the extraArgs
config:
The levels range from 0 to 5, with 0 being the least verbose (only showing panics) and 5 being the most verbose (showing trace-level information).
Service tagging allows Kubecost to identify network activity between the pods and various cloud services (e.g. AWS S3, EC2, RDS, Azure Storage, Google Cloud Storage).
To enable this, set the following Helm values:
In order to reduce resource usage, Kubecost recommends setting a CPU limit on the network costs DaemonSet. This will cause a few seconds of delay during peak usage and does not affect overall accuracy. This is done by default in Kubecost 1.99+.
For existing deployments, these are the recommended values:
The network-simulator was used to real-time simulate updating ConnTrack entries while simultaneously running a cluster simulated network costs instance. To profile the heap, after a warmup of roughly five minutes, a heap profile of 1,000,000 ConnTrack entries was gathered and examined.
Each ConnTrack entry is equivalent to two transport directions, so every ConnTrack entry is two map entries (connections).
After modifications were made to the network costs to parallelize the delta and dispatch, large map comparisons were significantly lighter in memory. The same tests were performed against simulated data with the following footprint results.
The primary source of network metrics is a DaemonSet Pod hosted on each of the nodes in a cluster. Each DaemonSet pod uses hostNetwork: true
such that it can leverage an underlying kernel module to capture network data. Network traffic data is gathered and the destination of any outbound networking is labeled as:
Internet Egress: Network target destination was not identified within the cluster.
Cross Region Egress: Network target destination was identified, but not in the same provider region.
Cross Zone Egress: Network target destination was identified, and was part of the same region but not the same zone.
These classifications are important because they correlate with network costing models for most cloud providers. To see more detail on these metric classifications, you can view pod logs with the following command:
This will show you the top source and destination IP addresses and bytes transferred on the node where this Pod is running. To disable logs, you can set the helm value networkCosts.trafficLogging
to false
.
As of Kubecost 1.101, LoadBalancers that proxy traffic to the Internet (ingresses and gateways) can be specifically classified.
In-zone: A list of destination addresses/ranges that will be classified as in-zone traffic, which is free for most providers.
In-region: A list of addresses/ranges that will be classified as the same region between source and destinations but different zones.
Cross-region: A list of addresses/ranges that will be classified as different regions from the source regions.
Internet: By design, all IP addresses not in a specific list are considered internet. This list can include IPs that would otherwise be "in-zone" or local to be classified as Internet traffic.
/proc/net/
/proc/sys/net/netfilter
To verify this feature is functioning properly, you can complete the following steps:
Confirm the kubecost-network-costs
pods are Running. If these Pods are not in a Running state, kubectl describe them and/or view their logs for errors.
Ensure kubecost-networking
target is Up in your Prometheus Targets list. View any visible errors if this target is not Up. You can further verify data is being scrapped by the presence of the kubecost_pod_network_egress_bytes_total
metric in Prometheus.
Verify Network Costs are available in your Kubecost Allocation view. View your browser's Developer Console on this page for any access/permissions errors if costs are not shown.
Failed to locate network pods: Error message is displayed when the Kubecost app is unable to locate the network pods, which we search for by a label that includes our release name. In particular, we depend on the label app=<release-name>-network-costs
to locate the pods. If the app has a blank release name this issue may happen.
Resource usage is a function of unique src and dest IP/port combinations. Most deployments use a small fraction of a CPU and it is also ok to have this Pod CPU throttled. Throttling should increase parse times but should not have other impacts. The following Prometheus metrics are available in v15.3 for determining the scale and the impact of throttling:
kubecost_network_costs_parsed_entries
is the last number of ConnTrack entries parsed kubecost_network_costs_parse_time
is the last recorded parse time
Today this feature is supported on Unix-based images with ConnTrack
Actively tested against GCP, AWS, and Azure
Pods that use hostNetwork share the host IP address
As of v1.67, the persistent volume attached to Kubecost's primary pod (cost-analyzer) contains as well as product configuration data. While it's technically optional (because all configurations can be set via ConfigMap), it dramatically reduces the load against your Prometheus/Thanos installations on pod restart/redeploy. For this reason, it's strongly encouraged on larger clusters.
If you are creating a new installation of Kubecost:
We recommend that you back Kubecost with at least a 32GB disk. This is the default as of 1.72.0.
If you are upgrading an existing version of Kubecost:
If your provisioner supports volume expansion, we will automatically resize you to a 32GB disk when upgrading to 1.72.0.
If your provisioner does not support volume expansion:
If all your configs are supplied via values.yaml in Helm or via ConfigMap and have not been added from the front end, you can safely delete the PV and upgrade.
We suggest you delete the old PV, then run Kubecost with a 32GB disk. This is the default in 1.72.0
If you cannot safely delete the PV storing your configs and configure them on a new PV:
If you are not on a regional cluster, provision a second PV by setting persistentVolume.dbPVEnabled=true
If you are on a regional cluster, provision a second PV using a topology-aware storage class (). You can set this disk’s storage class by setting persistentVolume.dbStorageClass=your-topology-aware-storage-class-name
If you're using just one PV and still see issues with Kubecost being rescheduled on zones outside of your disk, consider using a topology-aware storage class. You can set the Kubecost disk’s storage class by setting persistentVolume.storageClass
to your topology-aware storage class name.
In the standard deployment of , Kubecost is deployed with a bundled Prometheus instance to collect and store metrics of your Kubernetes cluster. Kubecost also provides the flexibility to connect with your time series database or storage. is an open-source, horizontally scalable, highly available, multi-tenant TSDB for long-term storage for Prometheus.
This document will show you how to integrate the Grafana Mimir with Kubecost for long-term metrics retention. In this setup, you need to use Grafana Agent to collect metrics from Kubecost and your Kubernetes cluster. The metrics will be re-written to your existing
You have access to a running Kubernetes cluster
You have an existing Grafana Mimir setup
Install the Grafana Agent for Kubernetes on your cluster. On the existing K8s cluster that you intend to install Kubecost, run the following commands to install the Grafana Agent to scrape the metrics from Kubecost /metrics
endpoint. The script below installs the Grafana Agent with the necessary scraping configuration for Kubecost, you may want to add additional scrape configuration for your setup.
You can also verify if grafana-agent
is scraping data with the following command (optional):
Run the following command to deploy Kubecost. Please remember to update the environment variables values with your Mimir setup information.
The process is complete. By now, you should have successfully completed the Kubecost integration with your Grafana Mimir setup.
This specific parameter can be configured under kubecostProductConfigs.productKey.key
in your .
You can find additional configurations at our main file.
From your , You can run the following query to verify if Kubecost metrics are collected:
Additionally, read our if you run into any other errors while setting up the integration. For support from GCP, you can submit a support request at the .
A Grafana dashboard is included with the Kubecost installation, but you can also find it in our .
To enable this feature, set the following parameter in values.yaml during or after :
You can view a list of common config options in this template.
Alternatively, a serviceMonitor is also .
Ref:
AWS
Azure
GCP
For traffic routed to addresses outside of your cluster but inside your VPC, Kubecost supports the ability to directly classify network traffic to a particular IP address or CIDR block. This feature can be configured in under networkCosts.config
. Classifications are defined as follows:
The network costs DaemonSet requires a privileged and hostNetwork: true
in order to leverage an underlying kernel module to capture network data.
Additionally, the network costs DaemonSet mounts to the following directories on the host filesytem. It needs both read & write access. The network costs DaemonSet will only write to the filesystem to enable conntrack
()
To learn more about how to install and configure the Grafana agent, as well as additional scrape configuration, please refer to documentation, or you can view the Kubecost Prometheus scrape config at this .
OIDC and RBAC are only officially supported on Kubecost Enterprise plans.
The OIDC integration in Kubecost is fulfilled via the .Values.oidc
configuration parameters in the Helm chart.
authURL
may require additional request parameters depending on the provider. Some commonly required parameters are client_id=***
and response_type=code
. Please check the provider documentation for more information.
Please refer to the following references to find out more about how to configure the Helm parameters to suit each OIDC identity provider integration.
Auth0 does not support Introspection; therefore we can only validate the access token by calling /userinfo within our current remote token validation flow. This will cause the Kubecost UI to not function under an Auth0 integration, as it makes a large number of continuous calls to load the various components on the page and the Auth0 /userinfo endpoint is rate limited. Independent calls against Kubecost endpoints (eg. via cURL or Postman) should still be supported.
Once the Kubecost application has been successfully integrated with OIDC, we will expect requests to Kubecost endpoints to contain the JWT access token, either:
As a cookie named token
,
As a cookie named id_token
(Set .Values.oidc.useIDToken = true
),
Or as part of the Authorization header Bearer token
The token is then validated remotely in one of two ways:
POST request to Introspect URL configured by identity provider
If no Introspect URL configured, GET request to /userinfo configured by identity provider
If skipOnlineTokenValidation
is set to true, Kubecost will skip accessing the OIDC introspection endpoint for online token validation and will instead attempt to locally validate the JWT claims.
Setting skipOnlineTokenValidation
to true
will prevent tokens from being manually revoked.
This parameter is only supported if using the Google OAuth 2.0 identity provider
If the hostedDomain
parameter is configured in the Helm chart, the application will deny access to users for which the identified domain is not equal to the specified domain. The domain is read from the hd
claim in the ID token commonly returned alongside the access token.
If the domain is configured alongside the access token, then requests should contain the JWT ID token, either:
As a cookie named id_token
As part of an Identification
header
The JWT ID token must contain a field (claim) named hd
with the desired domain value. We verify that the token has been properly signed (using provider certificates) and has not expired before processing the claim.
To remove a previously set Helm value, you will need to set the value to an empty string: .Values.oidc.hostedDomain = ""
. To validate that the config has been removed, you can check the /var/configs/oidc/oidc.json
inside the cost-model container.
Kubecost's OIDC supports read-only mode. This leverages OIDC for authentication, then assigns all authenticated users as read-only users.
Use your browser's devtools to observe network requests made between you, your Identity Provider, and your Kubecost. Pay close attention to cookies, and headers.
Search for oidc
in your logs to follow events
Pay attention to any WRN
related to OIDC
Search for Token Response
, and try decoding both the access_token
and id_token
to ensure they are well formed (https://jwt.io/)
Code reference for the below example can be found here.
For further assistance, reach out to support@kubecost.com and provide both logs and a HAR file.
Create a new Keycloak Realm.
Navigate to Realm Settings > General > Endpoints > OpenID Endpoint Configuration > Clients.
Select Create to add Kubecost to the list of clients. Define a clientID
. Ensure the Client Protocol is set to openid-connect
.
Select your newly created client, then go to Settings.
Set Access Type to confidential
.
Set Valid Redirect URIs to http://YOUR_KUBECOST_ADDRESS/model/oidc/authorize
.
Set Base URL to http://YOUR_KUBECOST_ADDRESS
.
The .Values.oidc
for Keycloak should be as follows:
OIDC is only officially supported on Kubecost Enterprise plans.
This guide will take you through configuring OIDC for Kubecost using a Microsoft Entra ID (formerly Azure AD) integration for SSO and RBAC.
Before following this guide, ensure that:
Kubecost is already installed
Kubecost is accessible via a TLS-enabled ingress
You are established as a Cloud Application Administrator in Microsoft. This may otherwise prevent you from accessing certain features required in this tutorial.
In the Microsoft Entra admin center, select Microsoft Entra ID (Azure AD).
In the left navigation, select Applications > App registrations. Then, on the App registrations page, select New registration.
Select an appropriate name, and provide supported account types for your app.
To configure Redirect URI
, select Web from the dropdown, then provide the URI as https://{your-kubecost-address}/model/oidc/authorize.
Select Register at the bottom of the page to finalize your changes.
After creating your application, you should be taken directly to the app's Overview page. If not, return to the App registrations page, then select the application you just created.
On the Overview page for your application, obtain the Application (client) ID and the Directory (tenant) ID. These will be needed in a later step.
Next to 'Client credentials', select Add a certificate or secret. The 'Certificates & secrets' page opens.
Select New client secret. Provide a description and expiration time, then select Add.
Obtain the value created with your secret.
Add the three saved values, as well as any other values required relating to your Kubecost/Microsoft account details, into the following values.yaml template:
If you are using one Entra ID app to authenticate multiple Kubecost endpoints, you must to pass an additional redirect_uri
parameter in your authURL
, which will include the URI you configured in Step 1.4. Otherwise, Entra ID may redirect to an incorrect endpoint. You can read more about this in Microsoft Entra ID's troubleshooting docs. View the example below to see how you should format your URI:
First, you need to configure an admin role for your app. For more information on this step, see Microsoft's documentation.
Return to the Overview page for the application you created in Step 1.
Select App roles > Create app role. Provide the following values:
Display name: admin
Allowed member types: Users/Groups
Value: admin
Description: Admins have read/write permissions via the Kubecost frontend (or provide a custom description as needed)
Do you want to enable this app role?: Select the checkbox
Select Apply.
Then, you need to attach the role you just created to users and groups.
In the Azure AD left navigation, select Applications > Enterprise applications. Select the application you created in Step 1.
Select Users & groups.
Select Add user/group. Select the desired group. Select the admin role you created, or another relevant role. Then, select Assign to finalize changes.
Update your existing values.yaml with this template:
Use your browser's devtools to observe network requests made between you, your Identity Provider, and Kubecost. Pay close attention to cookies and headers.
Run the following command:
Search for oidc
in your logs to follow events. Pay attention to any WRN related to OIDC. Search for Token Response, and try decoding both the access_token
and id_token
to ensure they are well formed. Learn more about JSON web tokens.
You can find more details on these flags in Kubecost's cost-analyzer-helm-chart repo README.
SSO and RBAC are only officially supported on Kubecost Enterprise plans.
Kubecost supports single sign-on (SSO) and role-based access control (RBAC) with SAML 2.0. Kubecost works with most identity providers including Okta, Auth0, Microsoft Entra ID (formerly Azure AD), PingID, and KeyCloak.
User authentication (.Values.saml
): SSO provides a simple mechanism to restrict application access internally and externally
Pre-defined user roles (.Values.saml.rbac
):
admin
: Full control with permissions to manage users, configure model inputs, and application settings.
readonly
: User role with read-only permission.
editor
: Role can change and build alerts and reports, but cannot edit application settings and otherwise functions as read-only.
Custom access roles (filters.json): Limit users based on attributes or group membership to view a set of namespaces, clusters, or other aggregations
All SAML 2.0 providers also work. The above guides can be used as templates for what is required.
When SAML SSO is enabled in Kubecost, ports 9090 and 9003 of service/kubecost-cost-analyzer
will require authentication. Therefore user API requests will need to be authenticated with a token. The token can be obtained by logging into the Kubecost UI and copying the token from the browser’s local storage. Alternatively, a long-term token can be issued to users from your identity provider.
For admins, Kubecost additionally exposes an unauthenticated API on port 9004 of service/kubecost-cost-analyzer
.
You will be able to view your current SAML Group in the Kubecost UI by selecting Settings from the left navigation, then scrolling to 'SAML Group'. Your access level will be displayed in the 'Current SAML Group' box.
Disable SAML and confirm that the cost-analyzer
pod starts.
If step 1 is successful, but the pod is crashing or never enters the ready state when SAML is added, it is likely that there is panic loading or parsing SAML data.
kubectl logs deployment/kubecost-cost-analyzer -c cost-model -n kubecost
If you’re supplying the SAML from the address of an Identity Provider Server, curl
the SAML metadata endpoint from within the Kubecost pod and ensure that a valid XML EntityDescriptor is being returned and downloaded. The response should be in this format:
The URL returns a 404 error or returning HTML
Contact your SAML admin to find the URL on your identity provider that serves the raw XML file.
Returning an EntitiesDescriptor instead of an EntityDescriptor
Certain metadata URLs could potentially return an EntitiesDescriptor, instead of an EntityDescriptor. While Kubecost does not currently support using an EntitiesDescriptor, you can instead copy the EntityDescriptor into a new file you create called metadata.xml:
Download the XML from the metadata URL into a file called metadata.xml
Copy all the attributes from EntitiesDescriptor
to the EntityDescriptor
that are not present.
Remove the <EntitiesDescriptor>
tag from the beginning.
Remove the </EntitiesDescriptor>
from the end of the XML file.
You are left with data in a similar format to the example below:
Then, you can upload the EntityDescriptor to a secret in the same namespace as kubecost and use that directly.
kubectl create secret generic metadata-secret --from-file=./metadata.xml --namespace kubecost
To use this secret, in your helm values set metadataSecretName to the name of the secret created above, and set idpMetadataURL to the empty string:
Invalid NameID format
On Keycloak, if you receive an “Invalid NameID format” error, you should set the option “force nameid format” in Keycloak. See Keycloak docs for more details.
Users of CSI driver for storing SAML secret
For users who want to use CSI driver for storing SAML secret, we suggest this guide.
InvalidNameIDPolicy format
From a PingIdentity article:
An alternative solution is to add an attribute called "SAML_SP_NAME_QUALIFIER" to the connection's attribute contract with a TEXT value of the requested SPNameQualifier. When you do this, select the following for attribute name format:
urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
On the PingID side: specify an attribute contract “SAML_SP_NAME_QUALIFIER” with the format urn:oasis:names:tc:SAML:1.1:nameid-format:unspecified
.
On the Kubecost side: in your Helm values, set saml.nameIDFormat
to the same format set by PingID:
Make sure audienceURI
and appRootURL
match the entityID configured within PingFed.
SSO and RBAC are only officially supported on Kubecost Enterprise plans.
This guide will show you how to configure Kubecost integrations for SAML and RBAC with Microsoft Entra ID.
In the Azure Portal, go to the Microsoft Entra ID Overview page and select Enterprise applications in the left navigation underneath Manage.
On the Enterprise applications page, select New application.
On the Browse Microsoft Entra ID Gallery page, select Create your own application and select Create. The 'Create your own application window' opens.
Provide a custom name for your app. Then, select Integrate any other application you don't find in the gallery. Select Create.
Return to the Enterprise applications page from Step 1.2. Find and select your Enterprise application from the table.
Select Properties in the left navigation under Manage to begin editing the application. Start by updating the logo, then select Save. Feel free to use an official Kubecost logo.
Select Users and groups in the left navigation. Assign any users or groups you want to have access to Kubecost, then select Assign.
Select Single sign-on from the left navigation. In the 'Basic SAML Configuration' box, select Edit. Populate both the Identifier and Reply URL with the URL of your Kubecost environment without a trailing slash (ex: http://localhost:9090), then select Save. If your application is using OpenId Connect and OAuth, most of the SSO configuration will have already been completed.
(Optional) If you intend to use RBAC, you also need to add a group claim. Without leaving the SAML-based Sign-on page, select Edit next to Attributes & Claims. Select Add a group claim. Configure your group association, then select Save. The claim name will be used as the assertionName
value in the values-saml.yaml file.
On the SAML-based Sign-on page, in the SAML Certificates box, copy the login of 'App Federation Metadata Url' and add it to your values-saml.yaml as the value of idpMetadataURL
.
In the SAML Certificates box, select the Download link next to Certificate (Base64) to download the X.509 cert. Name the file myservice.cert.
Create a secret using the cert with the following command:
With your existing Helm install command, append -f values-saml.yaml
to the end.
At this point, test your SSO configuration to make sure it works before moving on to the next section. There is a Troubleshooting section at the end of this doc for help if you are experiencing problems.
The simplest form of RBAC in Kubecost is to have two groups: admin and read only. If your goal is to simply have these two groups, you do not need to configure filters. If you do not configure filters, this message in the logs is expected: file corruption: '%!s(MISSING)'
The values-saml.yaml file contains the admin
and readonly
groups in the RBAC section:
Remember the value of assertionName
needs to match the claim name given in Step 2.5 above.
Filters are used to give visibility to a subset of objects in Kubecost. RBAC filtering is capable can filter for any types as the Allocation API. Examples of the various filters available are these files:
These filters can be configured using groups or user attributes in your Entra ID directory. It is also possible to assign filters to specific users. The example below is using groups.
You can combine filtering with admin/read only rights, and it can be configured the same way. The same assertionName
and values will be used, as is the case in this example.
The values-saml.yaml file contains this customGroups
section for filtering:
The array of groups obtained during the authentication request will be matched to the subject key in the filters.yaml. See this example filters.json (linked above) to understand how your created groups will be formatted:
As an example, we will configure the following:
Admins will have full access to the Kubecost UI and have visibility to all resources
Kubecost users, by default, will not have visibility to any namespace and will be read only. If a group doesn't have access to any resources, the Kubecost UI may appear to be broken.
The dev-namespaces group will have read only access to the Kubecost UI and only have visibility to namespaces that are prefixed with dev-
or are exactly nginx-ingress
In the Entra ID left navigation, select Groups. Select New group to create a new group.
For Group type, select Security. Enter a name your group. For this demonstration, create groups for kubecost_users
, kubecost_admin
and kubecost_dev-namespaces
. By selecting No members selected, Azure will pull up a list of all users in your organization for you to add (you can add or remove members after creating the group also). Add all users to the kubecost_users
group, and the appropriate users to each of the other groups for testing. Kubecost admins will be part of both the read only kubecost_users
and kubecost_admin
groups. Kubecost will assign the most rights/least restrictions when there are conflicts.
When you are done, select Create at the bottom of the page. Repeat Steps 1-2 as needed for all groups.
Return to your created Enterprise application and select Users and groups from the left navigation. Select Add user/group. Select and add all relevant groups you created. Then select Assign at the bottom of the page to confirm.
Modify filters.json as depicted above.
Replace {group-object-id-a}
with the Object Id for kubecost_admin
Replace {group-object-id-b}
with the Object Id for kubecost_users
Replace {group-object-id-c}
with the Object Id for kubecost_dev-namespaces
Create the ConfigMap:
You can modify the ConfigMap without restarting any pods.
You can look at the logs on the cost-model container. This script is currently a work in progress.
When the group has been matched, you will see:
This is what a normal output looks like:
Kubecost can run on clusters with thousands of nodes when resource consumption is properly tuned. Here's a chart with some of the steps you can take to tune Kubecost, along with descriptions of each.
Cloud cost metrics for all accounts can be pulled in on your primary cluster by pointing Kubecost to one or more management accounts. Therefore, you can disable CloudCost on secondary clusters by setting the following Helm value:
-- cloudCost.enabled=false
This method is only available for AWS cloud billing integrations. Kubecost is capable of tracking each individual cloud billing line item. However on certain accounts this can be quite large. If provider IDs are excluded, Kubecost won't cache granular data. Instead, Kubecost caches aggregate data and make an ad-hoc query to the AWS Cost and Usage Report to get granular data resulting in slow load times but less memory consumption.
--set kubecostModel.maxQueryConcurrency=1
--set kubecostModel.maxPrometheusQueryDurationMinutes=300
Lowering query resolution will reduce memory consumption but will cause short running pods to be sampled and rounded to the nearest interval for their runtime. The default value is: 300s
. This can be tuned with the Helm value:
--set kubecostModel.etlResolutionSeconds=600
--set prometheus.server.nodeExporter.enabled=false
--set prometheus.serviceAccounts.nodeExporter.create=false
Optionally enabling impactful memory thresholds can ensure the Go runtime garbage collector throttles at more aggressive frequencies at or approaching the soft limit. There is not a one-size fits all value here, and users looking to tune the parameters should be aware that lower values may reduce overall performance if setting the value too low. If users set the the resources.requests
memory values appropriately, using the same value for softMemoryLimit
will instruct the Go runtime to keep its heap acquisition and release within the same bounds as the expectations of the pod memory use. This can be tuned with the Helm value:
--set kubecostModel.softMemoryLimit=<Units><B, KiB, MiB, GiB>
Staging builds for the Kubecost Helm Chart are produced at least daily before changes are moved to production. To upgrade an existing Kubecost Helm Chart deployment to the latest staging build, follow these quick steps:
Add the with the following command:
Upgrade Kubecost to use the staging repo:
Gluu is an open-source Identity and Access Management (IAM) platform that can be used to authenticate and authorize users for applications and services. It can be configured to use the OpenID Connect (OIDC) protocol, which is an authentication layer built on top of OAuth 2.0 that allows applications to verify the identity of users and obtain basic profile information about them.
To configure a Gluu server with OIDC, you will need to install and set up the Gluu server software on a suitable host machine. This will typically involve performing the following steps:
Install the necessary dependencies and packages.
Download and extract the Gluu server software package.
Run the installation script to set up the Gluu server.
Configure the Gluu server by modifying the /etc/gluu/conf/gluu.properties
file and setting the values for various properties, such as the hostname, LDAP bind password, and OAuth keys.
Start the Gluu server by running the /etc/init.d/gluu-serverd start
command.
You can read for more detailed help with these steps.
Note: Later versions of Gluu Server also support deployment to Kubernetes environments. You can read more about their Kubernetes support .
Once the Gluu server is up and running, you can connect it to a Kubecost cluster by performing the following steps:
Obtain the OIDC client ID and client secret for the Gluu server. These can be found in the /etc/gluu/conf/gluu.properties
file under the oxAuthClientId
and oxAuthClientPassword
properties, respectively.
In the Kubecost cluster, create a new OIDC identity provider by running kubectl apply -f oidc-provider.yaml
command, where oidc-provider.yaml is a configuration file that specifies the OIDC client ID and client secret, as well as the issuer URL and authorization and token endpoints for the Gluu server.
In this file, you will need to replace the following placeholders with the appropriate values:
<OIDC_CLIENT_ID>
: The OIDC client ID for the Gluu server. This can be found in the /etc/gluu/conf/gluu.properties
file under the oxAuthClientId
property.
<OIDC_CLIENT_SECRET>
: The OIDC client secret for the Gluu server. This can be found in the /etc/gluu/conf/gluu.properties
file under the oxAuthClientPassword
property.
<GLUU_SERVER_HOSTNAME>
: The hostname of the Gluu server.
<BASE64_ENCODED_OIDC_CLIENT_ID>
: The OIDC client ID, encoded in base64.
<BASE64_ENCODED_OIDC_CLIENT_SECRET>
: The OIDC client secret, encoded in base64.
Set up a Kubernetes service account and bind it to the OIDC identity provider. This can be done by running the kubectl apply -f service-account.yaml
command, where service-account.yaml is a configuration file that specifies the name of the service account and the OIDC identity provider.
In this file, you will need to replace the following placeholders with the appropriate values:
<SERVICE_ACCOUNT_NAME>
: The name of the service account. This can be any name that you choose.
<GLUU_SERVER_HOSTNAME>
: The hostname of the Gluu server.
<OIDC_CLIENT_ID>
: The OIDC client ID for the Gluu server. This can be found in the /etc/gluu/conf/gluu.properties file under the oxAuthClientId
property.
Note: You should also ensure that the
kubernetes.io/oidc-issuer-url
,kubernetes.io/oidc-client-id
,kubernetes.io/oidc-username-claim
, andkubernetes.io/oidc-groups-claim
annotations are set to the correct values for your Gluu server and configuration. These annotations specify the issuer URL and client ID for the OIDC identity provider, as well as the claims to use for the username and group membership of authenticated users.
Once these steps are completed, the Gluu server should be configured to use OIDC and connected to the Kubecost cluster, allowing users to authenticate and authorize themselves using their Gluu credentials.
This feature is in currently in alpha. Please read the documentation carefully.
Kubecost's Kubescaler implements continuous request right-sizing: the automatic application of Kubecost's high-fidelity to your containers' resource requests. This provides an easy way to automatically improve your allocation of cluster resources by improving efficiency.
Kubescaler can be enabled and configured on a per-workload basis so that only the workloads you want edited will be edited.
Kubescaler is part of , and should be configured after the Cluster Controller is enabled.
Kubescaler is configured on a workload-by-workload basis via annotations. Currently, only deployment workloads are supported.
Notable Helm values:
Kubescaler supports:
apps/v1 Deployments
apps/v1 DaemonSets
batch/v1 CronJobs (K8s v1.21+). No attempt will be made to autoscale a CronJob until it has run at least once.
Kubescaler cannot support:
Kubescaler will take care of the rest. It will apply the best-available recommended requests to the annotated controller every 11 hours. If the recommended requests exceed the current limits, the update is currently configured to set the request to the current limit.
To check current requests for your Deployments, use the following command:
Cluster turndown is currently in beta. Please read the documentation carefully.
Cluster turndown is an automated scale down and scaleup of a Kubernetes cluster's backing nodes based on a custom schedule and turndown criteria. This feature can be used to reduce spend during down hours and/or reduce surface area for security reasons. The most common use case is to scale non-production (prod) environments (e.g. development (dev) clusters) to zero during off hours.
If you are upgrading from a pre-1.94 version of the Kubecost Helm chart, you will have to migrate your custom resources. turndownschedules.kubecost.k8s.io
has been changed to turndownschedules.kubecost.com
and finalizers.kubecost.k8s.io
has been changed to finalizers.kubecost.com
. See the for an explanation.
Cluster turndown is only available for clusters on GKE, EKS, or Kops-on-AWS.
You can verify that the cluster-turndown
pod is running with the following command:
Turndown uses a Kubernetes Custom Resource Definition to create schedules. Here is an example resource located at artifacts/example-schedule.yaml:
none
: Single schedule turndown and turnup.
daily
: Start and end times will reschedule every 24 hours.
weekly
: Start and end times will reschedule every 7 days.
To create this schedule, you may modify example-schedule.yaml to your desired schedule and run:
Currently, updating a resource is not supported, so if the scheduling of the example-schedule.yaml fails, you will need to delete the resource via:
Then make the modifications to the schedule and re-apply.
The turndownschedule
resource can be listed via kubectl
as well:
or using the shorthand:
Details regarding the status of the turndown schedule can be found by outputting as a JSON or YAML:
The status
field displays the current status of the schedule including next schedule times, specific schedule identifiers, and the overall state of schedule.
state
: The state of the turndown schedule. This can be:
ScheduleSuccess
: The schedule has been set and is waiting to run.
ScheduleFailed
: The scheduling failed due to a schedule already existing, scheduling for a date-time in the past.
ScheduleCompleted
: For schedules with repeat: none, the schedule will move to a completed state after turn up.
current
: The next action to run.
lastUpdated
: The last time the status was updated on the schedule.
nextScaleDownTime
: The next time a turndown will be executed.
nextScaleUpTime: The next time at turn up will be executed.
scaleDownId
: Specific identifier assigned by the internal scheduler for turndown.
scaleUpId
: Specific identifier assigned by the internal scheduler for turn up.
scaleDownMetadata
: Metadata attached to the scaledown job, assigned by the turndown scheduler.
scaleUpMetadata
: Metadata attached to the scale up job, assigned by the turndown scheduler.
A turndown can be canceled before turndown actually happens or after. This is performed by deleting the resource:
Canceling while turndown is currently scaling down or scaling up will result in a delayed cancellation, as the schedule must complete its operation before processing the deletion/cancellation.
If the turndown schedule is canceled between a turndown and turn up, the turn up will occur automatically upon cancellation.
The internal scheduler only allows one schedule at a time to be used. Any additional schedule resources created will fail (kubectl get tds -o yaml
will display the status).
Do not attempt to kubectl edit
a turndown schedule. This is currently not supported. Recommended approach for modifying is to delete and then create a new schedule.
There is a 20-minute minimum time window between start and end of turndown schedule.
The Cluster Controller is currently in beta. Please read the documentation carefully.
Kubecost's Cluster Controller allows you to access additional Savings features through automated processes. To function, the Cluster Controller requires write permission to certain resources on your cluster, and for this reason, the Cluster Controller is disabled by default.
The Cluster Controller enables features like:
The Cluster Controller can be enabled on any cluster type, but certain functionality will only be enabled based on the cloud service provider (CSP) of the cluster and its type:
The Cluster Controller can only be enabled on your primary cluster.
The Controller itself and container RRS are available for all cluster types and configurations.
Cluster turndown, cluster right-sizing, and Kubecost Actions are only available for GKE, EKS, and Kops-on-AWS clusters, after setting up a provider service key.
Therefore, the 'Provider service key setup' section below is optional depending on your cluster environment, but will limit functionality if you choose to skip it. Read the caution banner in the below section for more details.
If you are enabling the Cluster Controller for a GKE/EKS/Kops AWS cluster, follow the specialized instructions for your CSP(s) below. If you aren't using a GKE/EKS Kops AWS cluster, skip ahead to the section below.
You can now enable the Cluster Controller in the Helm chart by finding the clusterController
Helm flag and setting enabled: true
You may also enable via --set
when running Helm install:
You can verify that the Cluster Controller is running by issuing the following:
Once the Cluster Controller has been enabled successfully, you should automatically have access to the listed Savings features.
allows Kubecost to pull in spend data from your integrated cloud service providers.
Secondary clusters can be configured strictly as metric emitters to save memory. Learn more about how to best configure them in our .
Lowering query concurrency for the Kubecost ETL build will mean ETL takes longer to build, but consumes less memory. The default value is: 5
. This can be adjusted with the :
Lowering query duration results in Kubecost querying for smaller windows when building ETL data. This can lead to slower ETL build times, but lower memory peaks because of the smaller datasets. The default values is: 1440
This can be tuned with the :
Fewer data points scraped from Prometheus means less data to collect and store, at the cost of Kubecost making estimations that possibly miss spikes of usage or short running pods. The default value is: 60s
. This can be tuned in our for the Prometheus scrape job.
Node-exporter is optional. Some health alerts will be disabled if node-exporter is disabled, but savings recommendations and core cost allocation will function normally. This can be disabled with the following :
More info on this environment variable can be found in .
"Uncontrolled" Pods. Learn more .
Enable the
You will receive full turndown functionality once the Cluster Controller is enabled via a provider service key setup and Helm upgrade. Review the Cluster Controller doc linked above under Prerequisites for more information, then return here when you've .
This definition will create a schedule that starts by turning down at the designated start
date-time and turning back up at the designated end
date-time. Both the start
and end
times should be in format, i.e. times based on offsets to UTC. There are three possible values for repeat
:
Cluster turndown has limited functionality via the Kubecost UI. To access cluster turndown in the UI, you must first enable . Once this is completed, you will be able to create and delete turndown schedules instantaneously for your supported clusters. Read more about turndown's UI functionality in of the above Kubecost Actions doc. Review the entire doc for more information on Kubecost Actions functionality and limitations.
The following command performs the steps required to set up a service account. .
To use , provide the following required parameters:
For EKS cluster provisioning, if using eksctl
, make sure that you use the --managed
option when creating the cluster. Unmanaged node groups should be upgraded to managed. .
request.autoscaling.kubecost.com/enabled
Whether to autoscale the workload. See note on KUBESCALER_RESIZE_ALL_DEFAULT
.
true
, false
request.autoscaling.kubecost.com/frequencyMinutes
How often to autoscale the workload, in minutes. If unset, a conservative default is used.
73
request.autoscaling.kubecost.com/scheduleStart
Optional augmentation to the frequency parameter. If both are set, the workload will be resized on the scheduled frequency, aligned to the start. If frequency is 24h and the start is midnight, the workload will be rescheduled at (about) midnight every day. Formatted as RFC3339.
2022-11-28T00:00:00Z
cpu.request.autoscaling.kubecost.com/targetUtilization
Target utilization (CPU) for the recommendation algorithm. If unset, the backing recommendation service's default is used.
0.8
memory.request.autoscaling.kubecost.com/targetUtilization
Target utilization (Memory/RAM) for the recommendation algorithm. If unset, the backing recommendation service's default is used.
0.8
request.autoscaling.kubecost.com/recommendationQueryWindow
Value of the window
parameter to be used when acquiring recommendations. See Request sizing API for explanation of window parameter. If setting up autoscaling for a CronJob, it is strongly recommended to set this to a value greater than the duration between Job runs. For example, if you have a weekly CronJob, this parameter should be set to a value greater than 7d
to ensure a recommendation is available.
2d
clusterController.kubescaler.resizeAllDefault
If true, Kubescaler will switch to default-enabled for all workloads unless they are annotated with request.autoscaling.kubecost.com/enabled=false
. This is recommended for low-stakes clusters where you want to prioritize workload efficiency without reworking deployment specs for all workloads.
true
Availability Tiers impact capacity recommendations, health ratings and more in the Kubecost product. As an example, production jobs receive higher resource request recommendations than dev workloads. Another example is health scores for high availability workloads are heavily penalized for not having multiple replicas available.
Today our product supports the following tiers:
Highly Available
or Critical
0
If true, recommendations and health scores heavily prioritize availability. This is the default tier if none is supplied.
Production
1
Intended for production jobs that are not necessarily mission-critical.
Staging
or Dev
2
Meant for experimental or development resources. Redundancy or availability is not a high priority.
To apply a namespace tier, add a tier
namespace label to reflect the desired value.
In v1.94 of Kubecost, the turndownschedules.kubecost.k8s.io/v1alpha1
Custom Resource Definition (CRD) was moved to turndownschedules.kubecost.com/v1alpha1
to adhere to Kubernetes policy for CRD domain namespacing. This is a breaking change for users of Cluster Controller's turndown functionality. Please follow this guide for a successful migration of your turndown schedule resources.
Note: As part of this change, the CRD was updated to use
apiextensions.k8s.io/v1
becausev1beta1
was removed in K8s v1.22. If using Kubecost v1.94+, Cluster Controller's turndown functionality will not work on K8s versions before the introduction ofapiextensions.k8s.io/v1
.
In this situation, you've deployed Kubecost's Cluster Controller at some point using --set clusterController.enabled=true
, but you don't use the turndown functionality.
That means that this command should return one line:
And this command should return no resources:
This situation is easy! You can do nothing, and turndown should continue to behave correctly because kubectl get turndownschedule
and related commands will correctly default to the new turndownschedules.kubecost.com/v1alpha1
CRD after you upgrade to Kubecost v1.94 or higher.
If you would like to be fastidious and clean up the old CRD, simply run kubectl delete crd turndownschedules.kubecost.k8s.io
after upgrading Kubecost to v1.94 or higher.
In this situation, you've deployed Kubecost's Cluster Controller at some point using --set clusterController.enabled=true
and you have at least one turndownschedule.kubecost.k8s.io
resource currently present in your cluster.
That means that this command should return one line:
And this command should return at least one resource:
We have a few steps to perform if you want Cluster Controller's turndown functionality to continue to behave according to your already-defined turndown schedules.
Upgrade Kubecost to v1.94 or higher with --set clusterController.enabled=true
Make sure the new CRD has been defined after your Kubecost upgrade
This command should return a line:
Copy your existing turndownschedules.kubecost.k8s.io
resources into the new CRD
(optional) Delete the old turndownschedules.kubecost.k8s.io
CRD
Because the CRDs have a finalizer on them, we have to follow this workaround to remove the finalizer from our old resources. This lets us clean up without locking up.
Note: The following command may be unnecessary because Helm should automatically remove the
turndownschedules.kubecost.k8s.io
resource during the upgrade. The removal will remain in a pending state until the finalizer patch above is implemented.
Kubecost can run on clusters with mixed Linux and Windows nodes. The Kubecost pods must run on a Linux node.
When using a Helm install, this can be done simply with:
The cluster must have at least one Linux node for the Kubecost pods to run on:
Use a nodeSelector for all Kubecost deployments:
For DaemonSets, set the affinity to only allow scheduling on Windows nodes:
See the list of all deployments and DaemonSets in this values-windows-node-affinity.yaml file.
Collecting data about Windows nodes is supported by Kubecost as of v1.93.0.
Accurate node and pod data exists by default, since they come from the Kubernetes API.
Kubecost requires cAdvisor for pod utilization data to determine costs at the container level.
Currently, for pods on Windows nodes: pods will be billed based on request size.
High availability mode is only officially supported on Kubecost Enterprise plans.
Running Kubecost in high availability (HA) mode is a feature that relies on multiple Kubecost replica pods implementing the ETL Bucket Backup feature combined with a Leader/Follower implementation which ensures that there always exists exactly one leader across all replicas.
The Leader/Follower implementation leverages a coordination.k8s.io/v1
Lease
resource to manage the election of a leader when necessary. To control access of the backup from the ETL pipelines, a RWStorageController
is implemented to ensure the following:
Followers block on all backup reads, and poll bucket storage for any backup reads every 30 seconds.
Followers no-op on any backup writes.
Followers who receive Queries in a backup store will not stack on pending reads, preventing external queries from blocking.
Followers promoted to Leader will drop all locks and receive write privileges.
Leaders behave identically to a single Kubecost install.
In order to enable the leader/follower and HA features, the following must also be configured:
Replicas are set to a value greater than 1
ETL FileStore is Enabled (enabled by default)
ETL Bucket Backup is configured
For example, using our Helm chart, the following is an acceptable configuration:
This can also be done in the values.yaml
file within the chart:
This feature is only officially supported on Kubecost Enterprise plans.
The following steps allow Kubecost to use custom prices with a CSV pipeline. This feature allows for individual assets (e.g. nodes) to be supplied at unique prices. Common uses are for on-premise clusters, service-providers, or for external enterprise discounts.
Create a CSV file in this format (also in the below table). CSV changes are picked up hourly by default.
EndTimeStamp
: currently unused
InstanceID
: identifier used to match asset
Region
: filter match based on topology.kubernetes.io/region
AssetClass
: node pv, gpu are supported
InstanceIDField
: field in spec or metadata that will contain the relevant InstanceID. For nodes, often spec.providerID , for pv’s often metadata.name
InstanceType
: optional field to define the asset type, e.g. m5.12xlarge
MarketPriceHourly
: hourly price to charge this asset
Version
: field for schema version, currently unused
If the node label topology.kubernetes.io/region is present, it must also be in the Region
column.
This section is only required for nodes with GPUs.
The node the GPU is attached to must be matched by a CSV node price. Typically this will be matched on instance type (node.kubernetes.io/instance-type)
Supported GPU labels are currently:
gpu.nvidia.com/class
nvidia.com/gpu_type
Verification:
Connect to the Kubecost Prometheus: kubectl port-forward --namespace kubecost services/kubecost-cost-analyzer 9090:9090
Run the following query: curl localhost:9090/model/prometheusQuery?query=node_gpu_hourly_cost
You should see output similar to this: {instance="ip-192-168-34-166.us-east-2.compute.internal",instance_type="test.xlarge",node="ip-192-168-34-166.us-east-2.compute.internal",provider_id="aws:///us-east-2b/i-055274d3576800444",region="us-east-2"} 10 | YOUR_HOURLY_COST
Provide a file path for your CSV pricing data in your values.yaml. This path can reference a local PV or an S3 bucket.
Alternatively, mount a ConfigMap with the CSV:
Then set the following Helm values:
For S3 locations, provide file access. Required IAM permissions:
There are two options for adding the credentials to the Kubecost pod:
Service key: Create an S3 service key with the permissions above, then add its ID and access key as a K8s secret:
kubectl create secret generic pricing-schema-access-secret -n kubecost --from-literal=AWS_ACCESS_KEY_ID=id --from-literal=AWS_SECRET_ACCESS_KEY=key
The name of this secret should be the same as csvAccessCredentials
in values.yaml above
AWS IAM (IRSA) service account annotation
Negotiated discounts are applied after cost metrics are written to Prometheus. Discounts will apply to all node pricing data, including pricing data read directly from the custom provider CSV pipeline. Additionally, all discounts can be updated at any time and changes are applied retroactively.
The following logic is used to match node prices accurately:
First, search for an exact match in the CSV pipeline
If an exact match is not available, search for an existing CSV data point that matches region, instanceType, and AssetClass
If neither is available, fall back to pricing estimates
You can check a summary of the number of nodes that have matched with the CSV by visiting /model/pricingSourceCounts. The response is a JSON object of the form:
For teams interested in reducing their Kubernetes costs, it's beneficial to first understand how provisioned resources have been used. There are two major concepts to start with: pod resource efficiency and cluster idle costs.
Pod resource efficiency is defined as the resource utilization versus the resource request over a given time window. It is cost-weighted and can be expressed as follows:
(((CPU Usage / CPU Requested) * CPU Cost) + ((RAM Usage / RAM Requested) * RAM Cost)) / (RAM Cost + CPU Cost)
where CPU Usage = rate(container_cpu_usage_seconds_total) over the time window RAM Usage = avg(container_memory_working_set_bytes) over the time window
For example, if a pod is requesting 2CPU and 1GB, using 500mCPU and 500MB, CPU on the node costs $10/CPU, and RAM on the node costs $1/GB, we have ((0.5/2) * 20 + (0.5/1) * 1) / (20 + 1) = 5.5 / 21 = 26%
Cluster idle cost is defined as the difference between the cost of allocated resources and the cost of the hardware they run on. Allocation is defined as the max of usage and requests. It can also be expressed as follows:
idle_cost = sum(cluster_cost) - (cpu_allocation_cost + ram_allocation_cost + gpu_allocation_cost)
where allocation = max(request, usage)
Node idle cost can be expressed as:
idle_cost = sum(node_cost) - (cpu_allocation_cost + ram_allocation_cost + gpu_allocation_cost)
where allocation = max(request, usage)
So, idle costs can also be thought of as the cost of the space that the Kubernetes scheduler could schedule pods, without disrupting any existing workloads, but it is not currently.
Idle can be charged back to pods on a cost-weighted basis or viewed as a separate line item. As an example, consider the following representations:
[ ... ] = cluster
( ... ) = node
wN = workload
-- = idle capacity
Then, a cluster might look like:
[ ( w1, w2, w3, w4, --, --), (w5, --, --, --, --, --) ]
In total, there are 12 units of resources, and idle can be shared as follows:
Separate: In this single cluster across two nodes, there are 7 total idles.
Share By Node: The first node has 4 resources used and 2 idle. The second node has 1 resource used and 5 idle. If you share idle by node, then w1-4 will share 2 idles, and w5 will get 5 idles.
Share By Cluster: The single cluster has 5 resources used and 7 idle. If you share idle by cluster, then w1-5 will share the 7 idles.
If for example you are aggregating by namespace, idle costs will be distributed to each namespace proportional to how much that namespace costs. Specifically:
namespace_cpu_idle_cost = (namespace_cpu_cost / (total_cpu_cost - idle_cpu_cost)) * idle_cpu_cost
This same principle applies for ram, and also applies to any aggregation that is used (e.g. Deployment, Label, Service, Team).
The most common pattern for cost reduction is to ensure service owners tune the efficiency of their pods, and ensure cluster owners scale resources to appropriately minimize idle.
It's recommended to target idle in the following ranges:
CPU: 50%-65%
Memory: 45%-60%
Storage: 65%-80%
Target figures are highly dependent on the predictability and distribution of your resource usage (e.g. P99 vs median), the impact of high utilization on your core product/business metrics, and more. While too low resource utilization is wasteful, too high utilization can lead to latency increases, reliability issues, and other negative behavior.
As of v1.104, cloud data is parsed through the dashboard instead of through Assets. Read our announcement for more information.
The Kubecost Assets dashboard shows Kubernetes cluster costs broken down by the individual backing assets in your cluster (e.g. cost by node, disk, and other assets). It’s used to identify spend drivers over time and to audit Allocation data. This view can also optionally show out-of-cluster assets by service, tag/label, etc.
Similar to our Allocation API, the Assets API uses our ETL pipeline which aggregates data daily. This allows for enterprise-scale with much higher performance.
Kubecost provides a variety of options for configuring your assets queries to view the information you need. Below is a table of the major configuration options, with in-depth explanations in this article for how they work.
Select the date range of the report by setting specific start and end dates, or using one of the preset options.
Here you can aggregate cost by native Kubernetes concepts. While selecting Single Aggregation, you will only be able to select one concept at a time. While selecting Multi Aggregation, you will be able to filter for multiple concepts at the same time. Assets will be by default aggregated by Service.
The Edit icon has additional options to filter your search:
Change the display of your recent assets by service. Daily provides a day-by-day breakdown of assets. Entire window creates a semicircle that shows each asset as a sizable portion based on total cost within the displayed time frame.
View either cumulative or run rate costs measured over the selected time window based on the assets being filtered for.
Cumulative Cost: represents the actual/historical spend captured by the Kubecost agent over the selected time window
Rate metrics: Monthly, daily, or hourly “run rate” cost, also used for projected cost figures, based on samples in the selected time window
Filter assets by category, service, or other means. When a filter is applied, only resources with this matching value will be shown.
The three horizontal dots icon will provide additional options for handling your reports:
Open Report: Open one of your saved reports
Download CSV: Download your current report as a CSV file
The assets metrics table displays your aggregate assets, with four columns to organize by.
Name: Name of the aggregate group
Credits: Amount deducted from total cost due to provider-applied credit. A negative number means the total cost was reduced.
Adjusted: Amount added to total cost based on reconciliation with cloud provider’s billing data.
Total cost: Shows the total cost of the aggregate asset factoring in additions or subtractions from the Credits and Adjusted columns.
Hovering over the gray info icon next to each asset will provide you with the hours run and hourly cost of the asset. To the left of each asset name is one of several Category icons (you can aggregate by these): Storage, Network, Compute, Management, and Other.
Gray bubble text may appear next to an asset. These are all manually-assigned labels to an asset. To filter assets for a particular label, select the Edit search parameters icon, then select Label/Tag from the Filters dropdown and enter the complete name of the label.
You can select an aggregate asset to view all individual assets comprising it. Each individual asset should have a ProviderID.
After granting Kubecost permission to access cloud billing data, Kubecost adjusts its asset prices once cloud billing data becomes available, e.g. AWS Cost and Usage Report and the spot data feed. Until this data is available from cloud providers, Kubecost uses data from public cloud APIs to determine cost, or alternatively custom pricing sheets. This allows teams to have highly accurate estimates of asset prices in real-time and then become even more precise once cloud billing data becomes available, which is often 1-2 hours for spot nodes and up to a day for reserved instances/savings plans.
While cloud adjustments typically lag by roughly a day, there are certain adjustments, e.g. credits, that may continue to come in over the course of the month, and in some cases at the very end of the month, so reconciliation adjustments may continue to update over time.
This document describes how Kubecost calculates network costs.
Kubecost uses best-effort to allocate network transfer costs to the workloads generating those costs. The level of accuracy has several factors described below.
There are two primary factors when determining how network costs are calculated:
A default installation of Kubecost will use the onDemand rates for internet egress and proportionally assign those costs by pod using the metric container_network_transmit_bytes_total
. This is not exactly the same as costs obtained via the network costs DaemonSet, but will be approximately similar.
When you enable the network costs DaemonSet, Kubecost has the ability to attribute the network-byte traffic to specific pods. This will allow the most accurate cost distribution, as Kubecost has per-pod metrics for source and destination traffic.
Kubecost uses cloud integration to pull actual cloud provider billing information. Without enabling cloud integration, these prices will be based on public onDemand pricing.
Cloud providers allocate data transfers as line-items on a per-node basis. Kubecost will allocate network transfer costs based on each pod's share of container_network_transmit_bytes_total
of its node.
This will result in a accurate node-based costs. However, it is only estimating the actual pod/application responsible for the network-transfer costs.
Enabling both cloud-integration and the networkCosts DaemonSet allows Kubecost to give the most accurate data transfer costs to each pod.
At this time, there is a minor limitation where Kubecost cannot determine accurate costs for pods that use hostNetwork. These pods, today, will share all costs with the costs with the node.
This section of the docs will break down how to navigate the Kubecost UI. The UI is composed of several primary dashboards which provide cost visualization, as well as multiple savings and governance tools. Below is the main Overview page, which contains several helpful panels for observing workload stats and trends.
The Kubecost Allocations dashboard allows you to quickly see allocated spend across all native Kubernetes concepts, e.g. namespace, k8s label, and service. It also allows for allocating cost to organizational concepts like team, product/project, department, or environment. This document explains the metrics presented and describes how you can control the data displayed in this view.
Kubecost provides a variety of options for configuring your allocations queries to view the information you need. Below is a table of the major configuration options, with in-depth explanations in this article for how they work.
Select the date range of the report, called the window, by setting specific start and end dates, or by using one of the preset options. You can use Select Start and Select End to establish custom date ranges as well.
Step size refers to the length of time of each group of data displayed on your dashboard across the window. Options are Default, Daily, Weekly, Monthly, and Quarterly. When retaining long periods of data through custom configurations (such as Prometheus), consider using larger step sizes to avoid potential display errors. The step size when selecting Default is dependent on the size of your window.
Here you can aggregate cost by namespace, deployment, service, and other native Kubernetes concepts. While selecting Single Aggregation, you will only be able to select one concept at a time. While selecting Multi Aggregation, you will be able to filter for multiple concepts at the same time.
Service in this context refers to a Kubernetes object that exposes an interface to outside consumers.
When aggregating by namespace, the Allocations dashboard will only display namespaces that have or have had workloads running in them. If you don't see a namespace on this dashboard, you should confirm whether the namespace is running a workload.
To find what pods are not part of the relevant label set, you can either apply an __unallocated__
label filter in this allocation view or explore variations of the following kubectl commands:
The Edit icon has additional options for configuring your query such as how to display your data, adding filters, and configuring shared resources.
As an example, if your cluster is only 25% utilized, as measured by the max of resource usage and requests, applying idle costs would proportionately increase the cost of each pod/namespace/deployment by 4x. This feature can be enabled by default in Settings.
The idle costs dropdown allows you to choose how you wish your idle costs to be displayed:
Hide: Hide idle costs completely.
Separate: Idle costs appear as their own cost, visualized as a gray-colored bar in your display table.
Share By Cluster: Idle costs are grouped by the cluster they belong to.
Share By Node: Idle costs are grouped by the node they belong to.
View Allocation data in the following formats:
Cost: Total cost per aggregation over date range
Cost over time: Cost per aggregation broken down over days or hours depending on date range
Efficiency over time: Shows resource efficiency over given date range
Proportional cost: Cost per aggregate displayed as a percentage of total cost over date range
Cost Treemap: Hierarchically structured view of costs in current aggregation
You can select Edit > Chart > Cost over time from the dropdown to have your data displayed on a per-day basis. Hovering over any day's data will provide a breakdown of your spending.
View either cumulative or run rate costs measured over the selected time window based on the resources allocated.
Cumulative Cost: represents the actual/historical spend captured by the Kubecost agent over the selected time window
Rate metrics: Monthly, daily, or hourly "run rate" cost, also used for projected cost figures, based on samples in the selected time window
Costs allocations are based on the following:
Resources allocated, i.e. max of resource requests and usage
The cost of each resource
The amount of time resources were provisioned
Filter resources by namespace, clusterID, and/or Kubernetes label to more closely investigate a rise in spend or key cost drivers at different aggregations such as deployments or pods. When a filter is applied, only resources with this matching value will be shown. These filters are also applied to external out-of-cluster (OOC) asset tags. Supported filters are as follows:
Comma-separated lists are supported to filter by multiple categories, e.g. namespace filter equals kube-system,kubecost
. Wild card filters are also supported, indicated by a * following the filter, e.g. namespace=kube*
to return any namespace beginning with kube
.
Select how shared costs set on the settings page will be shared among allocations. Pick from default shared resources, or select a custom shared resource. A custom shared resource can be selected in the Configure custom shared resources feature at the bottom of the Edit window.
The three horizontal dots icon (directly next to Save) will provide additional options for handling your report:
Open Report: Allows you to open one of your saved reports without first navigating to the Reports page
Alerts: Send one of four reports routinely: recurring, efficiency, budget, and spend change
Download CSV: Download your current report as a CSV file
Download PDF: Download your current report as a PDF file
Cost allocation metrics are available for both in-cluster and OOC resources:
The rightmost column in the Allocations metrics table allows you to perform additional actions on individual line items (functionality will vary based on how you aggregate):
Inspect: Opens an advanced cost overview of the namespace in a new tab.
Inspect Shared Costs: Opens an advanced cost overview of your shared costs in a new tab.
Cloud provider service keys can be used in various aspects of the Kubecost installation. This includes configuring , , and . While automated IAM authentication via a Kubernetes service account like AWS IRSA is recommended, there are some scenarios where key-based authentication is preferred. When this method is used, rotating the keys at a pre-defined interval is a security best practice. Combinations of these features can be used, and therefore you may need to follow one or more of the below steps.
There are multiple methods for adding cloud provider keys to Kubecost when configuring a cloud integration. This article will cover all three procedures. Be sure to use the same method that was used during the initial installation of Kubecost when rotating keys. See the doc for additional details.
The preferred and most common is via the multi-cloud cloud-integration.json Kubernetes secret.
The second method is to define the appropriate secret in Kubecost's .
The final method to configure keys is via the Kubecost Settings page.
The primary sequence for setting up your key is:
Modify the appropriate Kubernetes secret, Helm value, or update via the Settings page.
Restart the Kubecost cost-analyzer
pod.
Verify the new key is working correctly. Any authentication errors should be present early in the cost-model
container logs from the cost-analyzer
pod. Additionally, you can check the status of the cloud integration in the Kubecost UI via Settings > View Full Diagnostics.
There are two methods for enabling multi-clustering in Kubecost:
Depending on which method you are using, the key rotation process differs.
With Federated ETL objects, storage keys can be provided in two ways. The preferred method is using the secret defined by the Helm value .Values.kubecostModel.federatedStorageConfigSecret
. The alternate method is to re-use the ETL backup secret defined with the .Values.kubecostModel.etlBucketConfigSecret
Helm value.
Update the appropriate Kubernetes secret with the new key on each cluster.
Restart the Kubecost cost-analyzer
pod.
Restart the Kubecost federator
pod.
Verify the new key is working correctly by checking the cost-model
container logs from the cost-analyzer
pod for any object storage authentication errors. Additionally, verify there are no object storage errors in the federator
pod logs.
Update the kubecost-thanos
Kubernetes secret with the new key on each cluster.
Restart the prometheus
server pod installed with Kubecost on all clusters (including the primary cluster) that write data to the Thanos object store. This will ensure the Thanos sidecar has the new key.
On the primary Kubecost cluster, restart the thanos-store
pod.
Verify the new key is working correctly by checking the thanos-sidecar
logs in the prometheus
server pods for authentication errors to ensure they are able to write new block data to the object storage.
Verify the new key is working correctly by checking thanos-store
pod logs on the primary cluster for authentication errors to ensure it is able to read block data from the object storage.
Modify the appropriate Kubernetes secret.
Restart the Kubecost cost-analyzer
pod.
Verify the backups are still being written to the object storage.
Efficiency targets can depend on the SLAs of the application. See our for more details.
: Must be enabled in order to view network costs
: Optional, allows for accurate cloud billing information
Learn how to enable the network costs DaemonSet in seconds .
Costs aggregations are also visible by other meaningful organizational concepts, e.g. Team, Department, and Product. These aggregations are based on Kubernetes labels, referenced at both the pod and namespace-level, with labels at the pod-level being favored over the namespace label when both are present. The Kubernetes label name used for these concepts can be configured in Settings or in after setting kubecostProductConfigs.labelMappingConfigs.enabled
to true
. Workloads without the relevant label will be shown as __unallocated__
.
Kubernetes annotations can also be used for cost allocation purposes, but this requires enabling a Helm flag. . To see the annotations, you must add them to the label groupings via Settings or in . Annotations will not work as one-off Labels added into reports directly, they will only work when added to the label groups in Settings or within the .
Allocating proportionately distributes slack or idle cluster costs to tenants. Idle refers to resources that are provisioned but not being fully used or requested by a tenant.
To learn more about sharing idle costs, see .
For more information, refer to the .
You can also implement more advanced forms of filtering to include or exclude values including prefixes or suffixes for any of the above categories in the table. Selecting the filtering dropdown (default Equals) will show you all available filtering options. These are reflective of Kubecost's .
View Right-Sizing: Opens the page in a new tab.
Thanos federation makes use of the kubecost-thanos
Kubernetes secret as described .
ETL backups rely on the secret defined by the Helm value .Values.kubecostModel.etlBucketConfigSecret
. More details can be found on the .
Date Range (Last 7 days)
Will report Last 7 days by default. Manually select your start and end date, or pick one of twelve preset options
Aggregate By
Aggregate costs by one or several concepts. Add custom labels
Save/Unsave
Save or unsave the current report
Edit
Adjust cost metrics and how data is displayed
Additional options icon
Additional options for opening and downloading reports
Date Range
Will report Last 7 days by default. Manually select your start and end date, or choose a preset option
Aggregate By
Aggregate costs by one or several concepts. Add custom labels
Save/Unsave
Save or unsave the current report
Edit
Includes multiple filtering tools including cost metric and shared resources
Additional options icon
Additional options for opening and downloading reports
Cluster
Limit results to workloads in a set of clusters with matching IDs. Note: clusterID is passed in values at install-time.
Node
Limit results to workloads where the node name is filtered for.
Namespace
Limit results to workloads in a set of Kubernetes namespaces.
Label
Limit results to workloads with matching Kubernetes labels. Namespace labels are applied to all of its workloads. Supports filtering by __unallocated__
field as well.
Service
Limit results to workloads based on Kubernetes service name.
Controller
Limit results to workloads based on Kubernetes controller name.
Controllerkind
Limit results to workloads based on Kubernetes controller (Daemonset, Deployment, Job, Statefulset, Replicaset, etc) type.
Pod
Limit results to workloads where the Kubernetes pod name is filtered for.
CPU
The total cost of CPU allocated to this object, e.g. namespace or deployment. The amount of CPU allocated is the greater of CPU usage and CPU requested over the measured time window. The price of allocated CPU is based on cloud billing APIs or custom pricing sheets. Learn more.
GPU
The cost of GPUs requested by this object, as measured by resource limits. Prices are based on cloud billing prices or custom pricing sheets for on-prem deployments. Learn more.
RAM
The total cost of memory allocated to this object, e.g. namespace or deployment. The amount of memory allocated is the greater of memory usage and memory requested over the measured time window. The price of allocated memory is based on cloud billing APIs or custom pricing sheets. Learn more
Persistent Volume (PV) Cost
The cost of persistent storage volumes claimed by this object. Prices are based on cloud billing prices or custom pricing sheets for on-prem deployments.
Network
The cost of network traffic based on internet egress, cross-zone egress, and other billed transfer. Note: these costs must be enabled. Learn more. When Network Traffic Cost are not enabled, the Node network costs from the cloud service provider's billing integration will be spread proportionally based on cost weighted usage.
Load Balancer (LB) cost
The cost of cloud-service load balancer that has been allocated.
Shared
The cost of shared resources allocated to this tenant. This field covers shared overhead, shared namespaces, and shared labels. Can be explored further via Inspect Shared Costs. Idle costs are not included in Shared costs.
Cost Efficiency
The percentage of requested CPU & memory dollars utilized over the measured time window. Values range from 0 to above 100 percent. Workloads with no requests but with usage OR workloads with usage > request can report efficiency above 100%.
Advanced Reporting is a beta feature. Read the documentation carefully.
Advanced Reporting allows teams to sculpt and tailor custom reports to easily view the information they care about. Providing an intersection between Kubernetes allocation and cloud assets data, this tool provides insight into important cost considerations for both workload and external infrastructure costs.
Begin by accessing the Reports page. Select Create a report, then select Advanced Report. The Advanced Reporting page opens.
Advanced Reporting will display your Allocations data and allow for similar configuring and editing. However, that data can now also intersect your cloud service, provider, or accounts.
Some line items will display a magnifying lens icon next to the name. Selecting this icon will provide a Cloud Breakdown which compares Kubernetes costs and out-of-cluster (OOC) costs. You will also see OOC costs broken down by cloud service provider (CSP).
The Advanced Reporting page manages the configurations which make up a report. Review the following tools which specify your query:
Date Range
Manually select your start and end date, or choose a preset option. Default is Last 7 days.
Aggregate By
Field by which to aggregate results, such as by Namespace, Cluster, etc.
The Service aggregation in this context refers to a Kubernetes object that exposes an interface to outside consumers, not a CSP feature.
Selecting Edit will open a slide panel with additional configuration options.
When a filter is applied, only results matching that value will display.
Field to handle default and custom shared resources (adjusted on the Settings page). Configure custom shared overhead costs, namespaces, and labels
After completing all configurations for your report, select Save. A name for your report based on your configuration will be auto-generated, but you have the option to provide a custom name. Finalize by selecting Save.
Reports can be saved via your organization like Allocations and Assets reports, instead of locally.
Line items that possess any out-of-cluster (OOC) costs, ie. cloud costs, will display a magnifying lens icon next to their name. Selecting this icon will open a slide panel that compares your K8s and OOC costs.
You can choose to aggregate those OOC costs by selecting the Cloud Breakdown button next to Aggregate By then selecting from one of the available options. You can aggregate by Provider, Service, Account, or use Custom Data Mapping to override default label mappings.
This feature is in beta. Please read the documentation carefully.
Kubecost can automatically implement its recommendations for container resource requests if you have the Cluster Controller component enabled. Using container request right-sizing (RRS) allows you to instantly optimize resource allocation across your entire cluster. You can easily eliminate resource over-allocation in your cluster, which paves the way for vast savings via cluster right-sizing and other optimizations.
There are no restrictions to receive container RRS recommendations.
To adopt these recommendations, you must enable the Cluster Controller on that cluster. In order for Kubecost to apply a recommendation, it needs write access to your cluster, which is enabled with the Cluster Controller.
Select Savings in the left navigation, then select Right-size your container requests. The Request right-sizing recommendations page opens.
Select Customize to modify the right-sizing settings. Your customization settings will tell Kubecost how to calculate its recommendations, so make sure it properly represents your environment and activity:
Window: Duration of deployment activity Kubecost should observe
Profile: Select from Development, Production, or High Availability*, which come with preconfigured values for CPu/RAM target utilization fields. Selecting Custom will allow you to manually configure these fields.
CPU/RAM recommendation algorithm: Always configured to Max.
CPU/RAM target utilization: Refers to the percentage of used resources over total resources available.
Add Filters: Optional configuration to limit the deployments which will have right-sizing recommendations applied. This will provide greater flexibility in optimizing your environment. Ensure you select the plus icon next to the filter value text box to add the filter. Multiple filters can be added.
When finished, select Save.
Your configured recommendations can also be downloaded as a CSV file by selecting the three dots button > Download CSV.
There are several ways to adopt Kubecost's container RRS recommendations, depending on how frequently you wish to utilize this feature for your container requests.
To apply RRS as you configured in one instance, select Resize Requests Now > Yes, apply the recommendation.
Also referred to as continuous container RRS, autoscaling allows you to configure a schedule to routinely apply RRS to your deployments. You can configure this by selecting Enable Autoscaling, selecting your Start Date and schedule, then confirming with Apply.
Both one-click and continuous container RRS can be configured via Savings Actions. On the Actions page, select Create Action, then select either:
Request Sizing: Will open the Container RRS page with the schedule window open to configure and apply.
Guided Sizing: Will open the Guided Sizing page and allow you to apply both one-click RRS, then continous cluster sizing
Actions is currently in beta. Please read the documentation carefully.
Actions is only available with a Kubecost Enterprise plan.
The Actions page is where you can create scheduled savings actions that Kubecost will execute for you. The Actions page supports creating actions for multiple turndown and right-sizing features.
Actions are only able to be applied to your primary cluster. To use Actions on a secondary cluster, you must manually switch to that cluster via front end.
The Actions page will exist inside the Savings folder in the left navigation, but must first be enabled before it appears. The two steps below which enable Kubecost Actions do not need to be performed sequentially as written.
Because the Actions page is currently a beta feature, it does not appear as part of Kubecost's base functionality. To enable alpha features, select Settings from the left navigation. Then toggle on the Enable experimental features switch. Select Save at the bottom of the Settings page to confirm your changes. The Actions page will now appear in your left navigation, but you will not be able to perform any actions until you've enabled the Cluster Controller (see below).
Some features included in Kubecost Actions are only available in GKE/EKS environments. See the Cluster Controller doc for more clarity on which features you will have access to after enabling the Cluster Controller.
On the Actions page, select Create Action in the top right. The Create New Action window opens.
You will have the option to perform one of several available Actions:
Cluster Turndown: Schedule clusters to spin down when unused and back up when needed
Request Sizing: Ensure your containers aren't over-provisioned
Cluster Sizing: Configure your cluster in the most cost-effective way
Namespace Turndown: Schedule unused workloads to spin down
Guided Sizing: Continuous container and node right-sizing
Selecting one of these Actions will take you off the Actions page to a Action-specific page which will allow to perform the action in moments.
If the Cluster Controller was not properly enabled, the Create New Action window will inform you and limit functionality until the Cluster Controller has been successfully enabled.
Cluster Turndown is a scheduling feature that allows you to reduce costs for clusters when they are not actively being used, without spinning them down completely. This is done by temporarily removing all existing nodes except for master nodes. The Cluster Turndown page allows you to create a schedule for when to turn your cluster down and up again.
Selecting Cluster Turndown from the Create new action window will take you to the Cluster Turndown page. The page should display available clusters for turndown. Begin by selecting Create Schedule next to the cluster you wish to turn down. Select what date and time you wish to turn down the cluster, and what date and time you wish to turn it back up. Select Apply to finalize.
You can delete an existing turndown schedule by selecting the trash can icon.
Learn more about cluster turndown's advanced functionality here.
See the existing documentation on Automatic Request Right-Sizing to learn more about this feature. If you have successfully enabled the Cluster Controller, you can skip the Setup section of that article.
Cluster Sizing will provide right-sizing recommendations for your cluster by determining the cluster's needs based on the type of work running, and the resource requirements. You will receive a simple (uses one node type) and a complex (uses two or more node types) recommendation.
Kubecost may hide the complex recommendation when it is more expensive than the simple recommendation, and present a single recommendation instead.
Visiting the Cluster Sizing Recommendations page from the Create New Action window will immediately prompt you with a suggested recommendation that will replace your current node pools with the displayed node pools. You can select Adopt to immediately resize, or select Cancel if you want to continue exploring.
Learn more about cluster right-sizing functionality here.
Namespace turndown allows you to take action to delete your abandoned workloads. Instead of requiring the user to manually size down or delete their unused workloads, Kubecost can delete namespaces full of idle pods in one moment or on a continual basis. This can be helpful for routine cleanup of neglected resources. Namespace turndown is supported on all cluster types.
Selecting Namespace Turndown from the Create new action window will open the Namespace Turndown page.
Begin by providing a name for your Action in the Job Name field. For the schedule, provide a cron string that determines when the turndown occurs (leave this field as 0 0 * * *
by default to perform turndown every night at midnight).
For schedule type, select Scheduled or Smart from the dropdown.
Scheduled turndown will delete all non-ignored namespaces.
Smart turndown will confirm that all workloads in the namespace are idle before deleting.
Then you can provide optional values for the following fields:
Ignore Targets: Filter out namespaces you don't want turned down. Supports "wildcard" filtering: by ending your filter with *
, you can filter for multiple namespaces which include that filter. For example, entering kube*
will prevent any namespace featuring kube
from being turned down. Namespace turndown will ignore namespaces named kube-*
, the default
namespace, and the namespace the Cluster Controller is enabled on.
Ignore labels: Filter out key-alue labels that you don't want turned down.
Select Create Schedule to finalize.
Guided Kubernetes Sizing provides a one-click or continuous right-sizing solution in two steps, request sizing and then cluster sizing. These implementations function exactly like Kubecost's existing container and cluster right-sizing features.
In the first collapsible tab, you can configure your container request sizing.
The Auto resizing toggle switch will determine whether you want to perform a one-time resize, or a continuous auto-resize. Default is one-time (off).
Frequency: Only available when Auto resizing is toggled on. Determines how frequently right-sizing will occur. Options are Day, Week, Monthly, or Quarterly.
Start Time: Only available when Auto resizing is toggled on. Determines the day, and time of day, that auto-resizing will begin occurring. Will default to the current date and time if left blank.
Select Start One-Time Resize/Start Auto-Resizing Now to finalize.
In the second collapsible tab, you can configure continuous cluster sizing.
Architecture: Supports x86 or ARM.
Target Utilization: How much excess resource nodes should be configured with to account for variable or increasing resource consumption. Default is 0.8.
Frequency: Determines how frequently right-sizing will occur. Options are Day, Week, Monthly, or Quarterly.
Start Time: Determines the day, and time of day, that auto-resizing will begin occurring. Will default to the current date and time if left blank.
Select Enable Auto-Resizing Now to finalize.
Once you have successfully created an Action, you will see it on the Actions page under Scheduled Actions. Here you will be able to view a Schedule, the Next Run, Affected Workloads, and the Status. You can select Details to view more information about a specific Action, or delete the scheduled Action by selecting the trash can icon.
Kubecost alerts allow teams to receive updates on real-time Kubernetes spend. They are configurable via the Kubecost UI or Helm values. This resource gives an overview of how to configure alerts sent through email, Slack, and Microsoft Teams using Kubecost Helm chart values. Alerts are either created to monitor specific data sets and trends, or they must be toggled on or off. The following alert types are supported:
Allocation Budget: Sends an alert when spending crosses a defined threshold
Allocation Efficiency: Detects when a Kubernetes tenant is operating below a target cost-efficiency threshold
Allocation Recurring Update: Sends an alert with cluster spending across all or a subset of Kubernetes resources.
Allocation Spend Change: Sends an alert reporting unexpected spend increases relative to moving averages
Asset Budget: Sends an alert when spend for a particular set of assets crosses a defined threshold.
Asset Recurring Update: Sends an alert with asset spend across all or a subset of cloud resources.
Cloud Cost Budget: Sends an alert when the total cost of cloud spend goes over a set budget limit.
Monitor Cluster Health: Used to determine if the cluster's health score changes by a specific threshold. Can only be toggled on/off.
Monitor Kubecost Health: Used for production monitoring for the health of Kubecost itself. Can only be toggled on/off.
values.yaml is a source of truth. Alerts set through values.yaml will continually overwrite any manual alert settings set through the Kubecost UI.
The alert settings, under global.notifications.alertConfigs
in cost-analyzer/values.yaml, accept four global fields:
frontendUrl
optional, your cost analyzer front-end URL used for linkbacks in alert bodies
globalSlackWebhookUrl
optional, a global Slack webhook used for alerts, enabled by default if provided
globalMsTeamWebhookUrl
optional, a global Microsoft Teams webhook used for alerts, enabled by default if provided
globalAlertEmails
a global list of emails for alerts
Example Helm values.yaml:
In addition to all global...
fields, every alert allows optional individual ownerContact
(a list of email addresses), slackWebhookUrl
(if different from globalSlackWebhookUrl
), and msTeamsWebhookUrl
(if different from globalMsTeamsWebhookUrl
) fields. Alerts will default to the global settings if these optional fields are not supplied.
Defines spend budgets and alerts on budget overruns.
type
budget
Alert type.
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values.
threshold
<amount>
Cost threshold in configured currency units.
Example Helm values.yaml:
Alerts when Kubernetes tenants, e.g. namespaces or label sets, are running below defined cost-efficiency thresholds.
type
efficiency
Alert type.
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values.
efficiencyThreshold
<value>
Optional. Efficiency threshold ranging from 0.0 to 1.0.
spendThreshold
<amount>
The cost threshold (ie. budget) in configured currency units.
The example below sends a Slack alert when any namespace spending is running below 40% cost efficiency and has spent more than $100 during the last day.
Sends a recurring alert with a summary report of cost and efficiency metrics.
type
recurringUpdate
Alert type.
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values
window
values:<N>d
where N in [1, 7)
for every N days
7d
or weekly
for 0:00:00 UTC every Monday
30d
or monthly
for 0:00:00 UTC on the first day of the month.
aggregation
values:label
requires the following format: label:<label_name>
annotation
requires the following format: annotation:<annotation_name>
This example sends a recurring alert for allocation data for all namespaces every seven days:
Detects unexpected spend increases/decreases relative to historical moving averages.
type
spendChange
Alert type.
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values.
baselineWindow
<N>d
Collect data from N days prior to queried items to establish cost baseline. Configurable, where N ≥ 1.
relativeThreshold
<N>
Percentage of change from the baseline (positive or negative) which will trigger the alert. Configurable where N ≥ -1.
Example Helm values.yaml:
Defines asset budgets and alerts when Kubernetes assets overrun the threshold set.
type
assetBudget
Alert type
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values
threshold
<amount>
The cost threshold (ie. budget) in configured currency units.
Example Helm values.yaml:
Sends a recurring alert with a Kubernetes assets summary report.
type
cloudReport
Alert type
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values
window
values:<N>d
where N in [1, 7)
for every N days
7d
or weekly
for 0:00:00 UTC every Monday
30d
or monthly
for 0:00:00 UTC on the first day of the month.
aggregation
values:label
requires the following format: label:<label_name>
annotation
requires the following format: annotation:<annotation_name>
Two example alerts, one which provides weekly summaries of Kubernetes asset spend data aggregated by cluster, and one which provides weekly summaries of asset spend data for one specific cluster:
Defines cloud cost budgets and alerts when cloud spend overruns the threshold set.
type
cloudCostBudget
Alert type
window
<N>d
or <M>h
The date range over which to query items. Configurable where 1 ≤ N ≤ 7, or 1 ≤ M ≤ 24.
aggregation
<agg-parameter>
Configurable, accepts service
, account
, provider
, invoiceEntity
, or label
.
filter
<value>,<value2>...
Optional. Configurable, accepts any 1 or more values of aggregate type as comma-separated values
threshold
<amount>
The cost threshold (ie. budget) in configured currency units.
costMetric
<metric-type>
Cost metric type. Accepts ListCost
, NetCost
, AmortizedNetCost
, InvoicedCost
and AmortizedCost.
Cluster health alerts occur when the cluster health score changes by a specific threshold. The health score is calculated based on the following criteria:
Low Cluster Memory
Low Cluster CPU
Too Many Pods
Crash Looping Pods
Out of Memory Pods
Failed Jobs
Example Helm values.yaml:
Enabling diagnostic alerts in Kubecost occursthe when an event impacts product uptime. This feature can be enabled in seconds from a values file. The following events are grouped into distinct categories that each result in a separate alert notification:
Prometheus is unreachable
Kubecost Metrics Availability:
Kubecost exported metrics missing over last 5 minutes
cAdvisor exported metrics missing over last 5 minutes
cAdvisor exported metrics missing expected labels in the last 5 minutes
Kubestate Metrics (KSM) exported metrics missing over last 5 minutes
Kubestate Metrics (KSM) unexpected version
Node Exporter metrics are missing over last 5 minutes.
Scrape Interval prometheus self-scraped metrics missing over last 5 minutes
CPU Throttling detected on cost-model in the last 10 minutes
Clusters Added/Removed (Enterprise Multicluster Support Only)
Required parameters:
type: diagnostic
window: <N>m
-- configurable, N > 0
Optional parameters:
diagnostics
-- object containing specific diagnostic checks to run (default is true
for all). See configuration example below for options:
Example Helm values.yaml:
Cluster Health Alerts and Kubecost Health Alerts work differently from other alert types. While other alerts monitor cost data for cost or efficiency anomalies, these two monitor the health of Kubecost itself, as well as the health of the cluster running Kubecost. For this reason, multiple of these alert types cannot be created. In the UI, switches for these alert types can be toggled either on or off, managing a single instance of each, and allowing the settings of these single instances to be adjusted.
There is no validation around Cluster Health Alerts. If a Health Alert configuration is invalid, it will appear to save, but will not actually take effect. Please check carefully that the alert has a Window and Threshold properly specified.
Global recipients specify a default fallback recipient for each type of message. If an alert does not define any email recipients, its messages will be sent to any emails specified in the Global Recipients email list. Likewise, if an alert does not define a webhook, its messages will be sent to the global webhook, if one is present. Alerts that do define recipients will ignore the global setting for recipients of that type.
The remaining alert types all target a set of allocation data with window
, aggregation
and filter
parameters, and trigger based on the target data. The table results can be filtered using the Filter alerts search bar next to + Create Alert. This input can be used to filter based on alert name, type, aggregation, window, and/or filter.
Select + Create Alert to open the Create Alert window where you configure the details of your alert.
The fields for each alert type should resemble their corresponding Helm values in the above tables.
Alerts can also be edited, removed, and tested from the table. Editing opens a dialog similar to the alert creation dialog, for editing the chosen alert.
When creating an alert, you can have these alerts sent through email, Slack, or Microsoft Teams. You can customize the subject field for an email, and attach multiple recipients. Alerts sent via email will contain a PDF of your report which shows the Kubecost UI for your Allocation/Asset page(s). This can be helpful for distributing visual information to those without immediate access to Kubecost.
The Test arrow icons, as well as a separate Test Alert button in the Edit Alert window, can be used to issue a "test" alert. This can be useful to ensure that alerting infrastructure is working correctly and that an alert is properly configured. Issuing a test from the alert edit modal tests the alert with any modifications that have not yet been saved.
All times in UTC. Alert send times are determined by parsing the supplied window
parameter. Alert diagnostics with the next and last scheduled run times are available via <your-kubecost-url>/model/alerts/status
.
Supported: weekly
and daily
special cases, <N>d
, <M>h
(1 ≤ N ≤ 7, 1 ≤ M ≤ 24) Currently Unsupported: time zone adjustments, windows greater than 7d
, windows less than 1h
An <N>d
alert sends at 00:00 UTC N day(s) from now, i.e., N days from now rounded down to midnight.
For example, a
5d
alert scheduled on Monday will send on Saturday at 00:00, and subsequently the next Thursday at 00:00
An <N>h
alert sends at the earliest time of day after now that is a multiple of N.
For example, a
6h
alert scheduled at any time between 12 pm and 6 pm will send next at 6 pm and subsequently at 12 am the next day.
If 24 is not divisible by the hourly window, schedule at next multiple of <N>h
after now, starting from the current day at 00:00.
For example, a
7h
alert scheduled at 22:00 checks 00:00, 7:00, 14:00, and 21:00, before arriving at the next send time of 4:00 tomorrow.
Review these steps to verify alerts are being passed to the Kubecost application correctly.
Check /model/alerts/configs
to ensure the alerts system has been configured properly.
Check /model/alerts/status
to ensure alerts have been scheduled correctly.
The status endpoint returns all of the running alerts including schedule metadata:
scheduledOn
: The date and time (UTC) that the alert was scheduled.
lastRun
: The date and time (UTC) that the alert last ran checks (will be set to 0001-01-01T00:00:00Z
if the alert has never run).
nextRun
: The date and time (UTC) that the alert will next run checks.
lastError
: If running the alert checks fails for unexpected reasons, this field will contain the error message.
If using Helm:
Run kubectl get configmap alert-configs -n kubecost -o json
to view the alerts ConfigMap.
Ensure that the Helm values are successfully read into the ConfigMap under alerts.json under the data
field. See below:
Ensure that the .JSON string is successfully mapped to the appropriate configs
Confirm that Kubecost has received configuration data:
Visit the Alerts page in the Kubecost UI to view configured alert settings as well as any of the alerts configured from Helm.
Alerts set up in the UI will be overwritten by Helm values.yaml
if the pod restarts.
Additionally, confirm that the alerts scheduler has properly parsed and scheduled the next run for each alert by visiting <your-kubecost-url>/model/alerts/status
to view individual alert parameters as well as the next and last scheduled run times for individual alerts.
Confirm that nextRun
has been updated from "0001-01-01T00:00:00Z"
If nextRun
fails to update, or alerts are not sent at the nextRun
time, check pod logs by running kubectl logs $(kubectl get pods -n kubecost | awk '{print $1}' | grep "^kubecost-cost-analyzer.\{16\}") -n kubecost -c cost-model > kubecost-logs.txt
Common causes of misconfiguration include the following:
Unsupported CSV filters: spendChange
alerts accept multiple filter
values when comma-separated; other alert types do not.
Unsupported alert type: all alert type names are in camelCase. Check spelling and capitalization for all alert parameters.
Unsupported aggregation parameters: see the Allocation API doc for details.
Kubecost can provide and implement recommendations for right-sizing your supported clusters to ensure they are configured in the most cost-effective way. Recommendations are available for any and all clusters. Kubecost in certain configurations is also capable of taking a recommendation and applying it directly to your cluster in one moment. These two processes should be distinguished respectively as viewing cluster recommendations vs. adopting cluster recommendations.
Kubecost is also able to implement cluster sizing recommendations on a user-scheduled interval, known as continuous cluster right-sizing.
You can access cluster right-sizing by selecting Savings in the left navigation, then select the Right-size your cluster nodes panel.
Kubecost will offer two recommendations: simple (uses one node type) and complex (uses two or more node types). Kubecost may hide the complex recommendation when it is more expensive than the simple recommendation, and present a single recommendation instead. These recommendations and their metrics will be displayed in a chart next to your existing configuration in order to compare values like total cost, node count, and usage.
Kubecost provides its right-sizing recommendations based on the characteristics of your cluster. You have the option to edit certain properties to generate relevant recommendations.
There are multiple dropdown menus to consider:
In the Cluster dropdown, you can select the individual cluster you wish to apply right-sizing recommendations to.
In the Window dropdown, select the number of days to query for your cluster's most recent activity. Options range from 1 day to 7 days. If your cluster has varying performance on different days of the week, it's better to select a longer interval for the most consistent recommendations.
You can toggle on Show optimization inputs to view resources which will determine the minimum size of your nodes. These resources are:
DaemonSet VCPUs/RAM: Resources allocated by DaemonSets on each node.
Max pod VCPUs/RAM: Largest resource allocation by any single Pod in the cluster.
Non-DaemonSet/static VCPUs/RAM: Sum of resources allocated to Pods not controlled by DaemonSets.
Finally, you can select Edit to provide information about the function of your cluster.
In the Profile dropdown, select the most relevant category of your cluster. You can select Production, Development, or High Availability.
Production: Stable cluster activity, will provide some extra space for potential spikes in activity.
Development: Cluster can tolerate small amount of instability, will run cluster somewhat close to capacity.
High availability: Cluster should avoid instability at all costs, will size cluster with lots of extra space to account for unexpected spikes in activity.
In the Architecture dropdown, select either x86 or ARM. You may only see x86 as an option. This is normal. At the moment, ARM architecture recommendations are only supported on AWS clusters.
With this information provided, Kubecost can provide the most accurate recommendations for running your clusters efficiently. By following some additional steps, you will be able to adopt Kubecost's recommendation, applying it directly to your cluster.
To receive cluster right-sizing recommendations, you must first:
Have a GKE/EKS/AWS Kops cluster
To adopt cluster right-sizing recommendations, you must:
Have a GKE/EKS/AWS Kops cluster
In order for Kubecost to apply a recommendation, it needs write access to your cluster. Write access to your cluster is enabled with the Cluster Controller.
To adopt a recommendation, select Adopt recommendation > Adopt. Implementation of right-sizing for your cluster should take roughly 10-30 minutes.
Recommendations via Kubecost Actions can only be adopted on your primary cluster. To adopt recommendations on a secondary cluster via Kubecost Actions, you must first manually switch to that cluster's Kubecost frontend.
Continuous cluster right-sizing has the same requirements needed as implementing any cluster right-sizing recommendations. See above for a complete description of prerequisites.
If you are using Persistent Volumes (PVs) with AWS's Elastic Block Store (EBS) Container Storage Interface (CSI), you may run into a problem post-resize where pods are in a Pending state because of a "volume node affinity conflict". This may be because the pod needs to mount an already-created PV which is in an Availability Zone (AZ) without node capacity for the pod. This is a limitation of the EBS CSI.
Kubecost mitigates this problem by ensuring continuous cluster right-sizing creates at least one node per AZ by forcing NodeGroups to have a node count greater than or equal to the number of AZs of the EKS cluster. This will also prevent you from setting a minimum node count for your recommendation below the number of AZs for your cluster. If the EBS CSI continues to be problematic, you can consider switching your CSI to services like Elastic File System (EFS) or FSx for Lustre.
The Spot Readiness Checklist investigates your Kubernetes workloads to attempt to identify those that are candidates to be schedulable on Spot (preemptible) nodes. Spot nodes are deeply-discounted nodes (up to 90% cheaper) from your cloud provider that do not come with an availability guarantee. They can disappear at any time, though most cloud providers guarantee some sort of alert and a small shutdown window, on the order of tens of seconds to minutes, before the node disappears.
Spot-ready workloads, therefore, are workloads that can tolerate some level of instability in the nodes they run on. Examples of Spot-ready workloads are usually state-free: many microservices, Spark/Hadoop nodes, etc.
The Spot Checklist performs a series of checks that use your own workload configuration to determine readiness:
Controller Type (Deployment, StatefulSet, etc.)
Replica count
Local storage
Controller Pod Disruption Budget
Rolling update strategy (Deployment-only)
Manual annotation overrides
You can access the Spot Checklist in the Kubecost UI by selecting Settings > Spot Instances > Spot Checklist.
The checklist is configured to investigate a fixed set of controllers, currently only Deployments and StatefulSets.
Deployments are considered Spot-ready because they are relatively stateless, intended to only ensure a certain number of pods are running at a given time.
StatefulSets should generally be considered not Spot ready; they, as their name implies, usually represent stateful workloads that require the guarantees that StatefulSets. Scheduling StatefulSet pods on Spot nodes can lead to data loss.
Workloads with a configured replica count of 1 are not considered Spot-ready because if the single replica is removed from the cluster due to a Spot node outage, the workload goes down. Replica counts greater than 1 signify a level of Spot-readiness because workloads that can be replicated tend to also support a variable number of replicas that can occur as a result of replicas disappearing due to Spot node outages.
Currently, workloads are only checked for the presence of an emptyDir
volume. If one is present, the workload is assumed to be not Spot-ready.
More generally, the presence of a writable volume implies a lack of Spot readiness. If a pod is shut down non-gracefully while it is in the middle of a write, data integrity could be compromised. More robust volume checks are currently under consideration.
If you are considering this check while evaluating your workloads for Spot-readiness, do not immediately discount them because of this check failing. Workloads should always be evaluated on a case-by-case basis and it is possible that an unnecessarily strict PDB was configured.
Deployments have multiple options for update strategies and by default they are configured with a Rolling Update Strategy (RUS) with 25% max unavailable. If a deployment has an RUS configured, we do a similar min available (calculated from max unavailable in rounded-down integer form and replica count) calculation as with PDBs, but threshold it at 0.9 instead of 0.5. Doing so ensures that default-configured deployments with replica counts greater than 3 will pass the check.
We also support manually overriding the Spot readiness of a controller by annotating the controller itself or the namespace it is running in with spot.kubecost.com/spot-ready=true
.
Kubecost marking a workload as Spot ready is not a guarantee. A domain expert should always carefully consider the workload before approving it to run on Spot nodes.
Most cloud providers support a mix of Spot and non-Spot nodes in the cluster and they have guides:
Different cloud providers have different guarantees on shutdown windows and automatic draining of Spot nodes that are about to be removed. Consult your provider’s documentation before introducing Spot nodes to your cluster.
Additionally, it is generally wise to use smaller size Spot nodes. This minimizes the scheduling impact of individual Spot nodes being reclaimed by your cloud provider. Consider one Spot node of 20 CPU cores and 120 GB RAM against 5 Spot nodes of 4 CPU and 24 GB. In the first case, that single node being reclaimed could force tens of pods to be rescheduled, potentially causing scheduling problems, especially if capacity is low and spinning up a new node takes too long. In the second case, fewer pods are forced to be rescheduled if a reclaim event occurs, thus lowering the likelihood of scheduling problems.
The Cloud Cost Explorer is a dashboard which provides visualization and filtering of your cloud spending. This dashboard includes the costs for all assets in your connected cloud accounts by pulling from those providers' Cost and Usage Reports (CURs) or other cloud billing reports.
If you haven't performed a successful billing integration with a cloud service provider, the Cloud Cost Explorer won't have cost data to display. Before using the Cloud Cost Explorer, make sure to read our Cloud Billing Integrations guide to get started, then see our specific articles for the cloud service providers you want to integrate with.
As of v1.104, Cloud Cost is enabled by default. If you are using v1.04+, you can skip the Installation and Configuration section.
For versions of Kubecost up to v1.103, Cloud Cost needs to be enabled first through Helm, using the following parameters:
Enabling Cloud Cost is required. Optional parameters include:
labelList.labels
: Comma-separated list of labels; empty string indicates that the list is disabled
labelList.IsIncludeList
: If true, label list is a white list; if false, it is a black list
topNItems
: number of sampled "top items" to collect per day
While Cloud Cost is enabled, it is recommended to disable Cloud Usage, which is more memory-intensive.
Disabling Cloud Usage will restrict functionality of your Assets dashboard. This is intentional. Learn more about Cloud Usage here.
topNitems
Item-level data in the Cloud Cost Explorer is only a sample of the most expensive entries, determined by the Helm flag topNitems
. This value can be increased substantially but can lead to higher memory consumption. If you receive a message in the UI "We don't have item-level data with the current filters applied" when attempting to filter, you may need to expand the value of topNitems
(default is 1,000), or reconfigure your query.
You can adjust your displayed metrics using the date range feature, represented by Last 7 days, the default range. This will control the time range of metrics that appear. Select the date range of the report by setting specific start and end dates, or by using one of the preset options.
You can adjust your displayed metrics by aggregating your cost by category. Supported fields are Workspace, Provider, Billing Account, Service Item, as well as custom labels. The Cloud Cost Explorer dashboard supports single and multi-aggregation. See the table below for descriptions of each field.
Account
The ID of the billing account your cloud provider bill comes from. (ex: AWS Management/Payer Account ID, GCP Billing Account ID, Azure Billing Account ID)
Provider
Cloud service provider (ex: AWS, Azure, GCP)
Invoice Entity
Cloud provider account (ex: AWS Account, Azure Subscription, GCP Project)
Service
Cloud provider services (ex: S3, microsoft.compute, BigQuery)
Item
Individual items from your cloud billing report(s)
Labels
Labels/tags on your cloud resources (ex: AWS tags, Azure tags, GCP labels)
Selecting the Edit button will allow for additional filtering and pricing display options for your cloud data.
You can filter displayed dashboard metrics by selecting Edit, then adding a filter. Filters can be created for the following categories to view costs exclusively for items (see descriptions of each category in the Aggregate filters table above):
Service
Account
Invoice Entity
Provider
Labels
Cost Metric
The Cost Metric dropdown allows you to adjust the displayed cost data based on different calculations. Cost Metric values are based on and calculated following standard FinOps dimensions and metrics, but may be calculated differently depending on your CSP. Learn more about how these metrics are calculated by each CSP in the Cloud Cost Metrics doc. The five available metrics supported by the Cloud Cost Explorer are:
Amortized Net Cost
Net Cost with removed cash upfront fees and amortized (default)
Net Cost
Costs inclusive of discounts and credits. Will also include one-time and recurring charges.
List Cost
CSP pricing without any discounts
Invoiced Cost
Pricing based on usage during billing period
Amortized Cost
Effective/upfront cost across the billing period
Your cloud cost spending will be displayed across your dashboard with several key metrics:
K8s Utilization: Percent of cost which can be traced back to Kubernetes cluster
Total cost: Total cloud spending
Sum of Sample Data: Only when aggregating by Item. Only lists the top cost for the timeframe selected. Displays that may not match your CUR.
All line items, after aggregation, should be selectable, allowing you to drill down to further analyze your spending. For example, when aggregating cloud spend by Service, you can select an individual cloud service (AmazonEC2, for example) and view spending, K8s utilization, and other details unique to that item.
The Clusters dashboard provides a list of all your monitored clusters, as well as additional clusters detected in your cloud bill. The dashboard provides details about your clusters including cost, efficiency, and cloud provider. You are able to filter your list of clusters by when clusters were last seen, activity status, and by name (see below).
Monitoring of multiple clusters is only supported in Kubecost Enterprise plans. Learn more about Kubecost Enterprise's multi-cluster view here.
To enable the Clusters dashboard, you must perform these two steps:
Enable cloud integration for any and all cloud service providers you wish to view clusters with
Enable Cloud Costs
Enabling Cloud Costs through Helm can be done using the following parameters:
Clusters are primarily distinguished into three categories:
Clusters monitored by Kubecost (green circle next to cluster name)
Clusters not monitored by Kubecost (yellow circle next to cluster name)
Inactive clusters (gray circle next to cluster name)
For detail on how Kubecost identifies clusters, see Cloud Cost Metrics.
Monitored clusters are those that have cost metrics which will appear within your other Monitoring dashboards, like Allocations and Assets. Unmonitored clusters are clusters whose existence is determined from cloud integration, but haven't been added to Kubecost. Inactive clusters are clusters Kubecost once monitored, but haven't reported data over a certain period of time. This time period is three hours for Thanos-enabled clusters, and one hour for non-Thanos clusters.
Efficiency and Last Seen metrics are only provided for monitored clusters.
Efficiency is calculated as the amount of node capacity that is used, compared to what is available.
Selecting any metric in a specific cluster's row will take you to a Cluster Details page for that cluster which provides more extensive metrics, including assets and namespaces associated with that cluster and their respective cost metrics.
You are able to filter clusters through a window of when all clusters were last seen (default is Last 7 days). Although unmonitored clusters will not provide a metric for Last Seen, they will still appear in applicable windows.
You can also filter your clusters for Active, Inactive, or Unmonitored status, and search for clusters by name.
Reports are saved queries from your various Monitoring dashboards which can be referenced at a later date for convenience. Aggregation, filters, and other details of your query will be saved in the report, and the report can be opened at any time. Reports are currently supported by the Allocations, Assets, and Cloud Cost Explorer dashboards.
Reports can be managed via values.yaml or the Kubecost UI. This reference outlines the process of configuring saved reports through a values file, and provides documentation on the required and optional parameters.
Begin by selecting Create a report. There are five report types available. Three of these correspond to Kubecost's different monitoring dashboards. The other two are specialized beta features.
Allocation Report
Asset Report
Advanced Report (beta)
Cloud Cost Report
Selecting a monitoring report type will take you to the respective dashboard. Provide the details of the query, then select Save. The report will now be saved on your Reports page for easy access.
For help creating an Advanced Report (either type), select the respective hyperlink above for a step-by-step process.
After creating a report, you are able to share that report in recurring intervals via email as a PDF or CSV file. Shared reports replicate your saved query parameters every interval so you can view cost changes over time.
Sharing reports is only available for Allocations, Assets, and Cloud Cost Reports, not either type of Advanced Report.
In the line for the report you want to share, select the three horizontal dots icon in the Actions column. Select Share report from the menu. The Share Report window opens. Provide the following fields:
Interval: Interval that recurring reports will be sent out. Supports Daily, Weekly, and Monthly. Weekly reports default to going out Sunday at midnight. Monthly reports default to midnight on the first of the month. When selecting Monthly and resetting on a day of the month not found in every month, the report will reset at the latest available day of that month. For example, if you choose to reset on the 31st, it will reset on the 30th for months with only 30 days.
Format: Supports PDF or CSV.
Add email: Email(s) to distribute the report to.
Select Apply to finalize. When you have created a schedule for your report, the selected interval will be displayed in the Interval column of your Reports page.
The saved report settings, under global.savedReports
, accept two parameters:
enabled
determines whether Kubecost will read saved reports configured via values.yaml; default value is false
reports
is a list of report parameter maps
The following fields apply to each map item under the reports
key:
title
the title/name of your custom report; any non-empty string is accepted
window
the time window the allocation report covers, the following values are supported:
keywords: today
, week
(week-to-date), month
(month-to-date), yesterday
, lastweek
, lastmonth
number of days: {N}d
(last N days)
e.g. 30d
for the last 30 days
date range: {start},{end}
(comma-separated RFC-3339 date strings or Unix timestamps)
e.g. 2021-01-01T00:00:00Z,2021-01-02T00:00:00Z
for the single day of 1 January 2021
e.g. 1609459200,1609545600
for the single day of 1 January 2021
Note: for all window options, if a window is requested that spans "partial" days, the window will be rounded up to include the nearest full date(s).
e.g. 2021-01-01T15:04:05Z,2021-01-02T20:21:22Z
will return the two full days of 1 January 2021 and 2 January 2021
aggregateBy
the desired aggregation parameter -- equivalent to Breakdown in the Kubecost UI. Supports:
cluster
container
controller
controllerKind
daemonset
department
deployment
environment
job
label
requires the following format: label:<label_name>
namespace
node
owner
pod
product
service
statefulset
team
chartDisplay
-- Can be one of category
, series
, efficiency
, percentage
, or treemap
. See Cost Allocation Charts for more info.
idle
idle cost allocation, supports hide
, shareByNode
, shareByCluster
, and separate
rate
-- Can be one of cumulative
, monthly
, daily
, hourly
accumulate
determines whether or not to sum Allocation costs across the entire window -- equivalent to Resolution in the UI, supports true
(Entire window resolution) and false
(Daily resolution)
sharedNamespaces
-- a list containing namespaces to share costs for.
sharedOverhead
-- an integer representing overhead costs to share.
sharedLabels
-- a list of labels to share costs for, requires the following format: label:<label_name>
filters
-- a list of maps consisting of a property and value
property
-- supports cluster
, node
, namespace
, and label
value
-- property value(s) to filter on, supports wildcard filtering with a *
suffix
Special case label
value
examples: app:cost-analyzer
, app:cost*
Wildcard filters only apply for the label value. e.g., ap*:cost-analyzer
is not valid
Note: multiple filter properties evaluate as ANDs, multiple filter values evaluate as ORs
e.g., (namespace=foo,bar), (node=fizz) evaluates as (namespace == foo || namespace == bar) && node=fizz
Important: If no filters used, supply an empty list []
When defining reports via values.yaml, by setting global.savedReports.enabled = true
in the values file, the reports defined in values.yaml are created when the Kubecost pod starts. Reports can still be freely created/deleted via the UI while the pod is running. However, when the pod restarts, whatever is defined the values file supersedes any UI changes.
Generally, the ConfigMap, if present, serves as the source of truth at startup.
If saved reports are not provided via values.yaml, meaning global.savedReports.enabled = false
, reports created via the UI are saved to a persistent volume and persist across pod restarts.
Review these steps to verify that saved reports are being passed to the Kubecost application correctly:
Confirm that global.savedReports.enabled
is set to true
Ensure that the Helm values are successfully read into the ConfigMap
Run helm template ./cost-analyzer -n kubecost > test-saved-reports-config.yaml
Open test-saved-reports-config
Find the section starting with # Source: cost-analyzer/templates/cost-analyzer-saved-reports-configmap.yaml
Ensure that the Helm values are successfully read into the ConfigMap under the data
field. Example below.
3. Ensure that the JSON string is successfully mapped to the appropriate configs
Navigate to your Reports page in the Kubecost UI and ensure that the configured report parameters have been set by selecting the Report name.
Cost Center Report is a beta feature. Please share your feedback as we are in active development of this feature.
A Cost Center Report (CCR) allows you to join your Kubernetes resource costs with cloud-native services. For example, it allows combining S3 and/or BigQuery costs with the Kubernetes namespace that is consuming those services.
The reporting supports multiple types of resource matches in terms of labels/tags/accounts/K8s object names/etc.
Begin by selecting Reports in the left navigation. Then, select Create a report > Advanced Report - Cost Centers. The Cost Center Report page opens.
In the Report name field, enter a custom value name for your report. This name will appear on your Reports page for quick access after creation.
In the Cost center name field, enter the desired name for your Cost Center. Once a Report name and Cost center name have been provided, it should appear at the bottom of the page in the Report Preview. However, continue with this section to learn how to customize your Cost Center Report and complete its creation.
You can aggregate your cloud costs by a variety of fields (default is Service). Single and multi-aggregation, and custom labels, are supported. Then, select the desired cloud cost metric. Cloud cost metrics are calculated differently depending on your cloud service provider (CSP).
Certain selected cloud cost metrics may produce errors forming your report preview. Use Net Amortized Cost, the default option, if you experience this error.
You can also provide custom filters to display only resources which match the filter value in your Cost Center Report. Select Filter and choose a filter type from the dropdown, then provide your filter value in the text field. Select the plus sign icon to add your filter.
Your Kubernetes workload data can be read as your Kubernetes allocations. You can aggregate and filter for your allocation data in the same way as your cloud cost data as described above. Default aggregation is Namespace.
Your cost center should automatically appear in the Report Preview. There is no need to finalize its creation; it will exist as long as all required fields have been provided. The Report Preview provides cost data for each cost center.
After configuring a cost center, you can select Collapse to close that configuration (this is only to condense the page view, it will not affect your overall reported data).
Any cloud provider tag or label can be used, but be sure to follow the Cloud Billing Integrations guide for any respective CSPs to ensure that they are included with the billing data.
when using tags and labels, separate the key and value with a :
. Example: owner:frontend
.
A single CCR allows for the creation of multiple cost centers within it. To create an additional cost center, select Add cost center. This will open a new cost center tab and the functionality of creating a cost center will be the same.
You can delete a cost center by selecting Delete Cost Center in its tab, or selecting the trash can icon in the line of that cost center in the Report Preview.
When you are finished adding or deleting cost centers, select Done to finalize your CCR. You will be taken to a page for your reports. You can select individual cost centers for breakdowns of cloud costs and Kubernetes costs.
A cost center name is required in order for your cost center to appear in the Report Preview. However, if you select Done without giving a name to a cost center, it will appear in your Report with a blank space for a name. It can still be interacted with, but it is recommended to name all cost centers.
The Cost column per line item is the total cost of all other columns.
You can also adjust the window of spend data by selecting the Time window box and choosing either a preset or entering a custom range.
When viewing a breakdown of your cloud costs, you may see the same aggregate repeated multiple times. These are of the same property across multiple different days. When you expand the window range, you should naturally see the number of line items increase.
If you return to the Reports page, you will now see your CCR displayed amongst your other reports. Selecting the three horizontal dots in the Actions column of your CCR will allow you to Edit or Delete the CCR.
The Savings page provides miscellaneous functionality to help you use resources more effectively and assess wasteful spending. In the center of the page, you will see your estimated monthly savings available. The savings value is calculated from all enabled Savings features, across your clusters and the designated cluster profile via dropdowns in the top right of the page.
The Savings page provides an array of panels containing different insights capable of lowering your Kubernetes and cloud spend.
The monthly savings values on this page are precomputed every hour for performance reasons, while per-cluster views of these numbers, and the numbers on each individual Savings insight page, are computed live. This may result in some discrepancies between estimated savings values of the Savings page and the pages of individual Savings insights.
Reserve instances
You can archive individual Savings insights if you feel they are not helpful, or you cannot perform those functions within your organization or team. Archived Savings insights will not add to your estimated monthly savings available.
To temporarily archive a Savings insight, select the three horizontal dots icon inside its panel, then select Archive. You can unarchive an insight by selecting Unarchive.
You can also adjust your insight panels display by selecting View. From the View dropdown, you have the option to filter your insight panels by archived or unarchived insights. This allows you to effectively hide specific Savings insights after archiving them. Archived panels will appear grayed out, or disappear depending on your current filter.
By default, the Savings page and any displayed metrics (For example, estimated monthly savings available) will apply to all connected clusters. You can view metrics and insights for a single cluster by selecting it from the dropdown in the top right of the Savings page.
Functionality for most cloud insight features only exists when All Clusters is selected in the cluster dropdown. Individual clusters will usually only have access to Kubernetes insight features.
On the Savings page, as well as on certain individual Savings insights, you have the ability to designate a cluster profile. Savings recommendations such as right-sizing are calculated in part based on your current cluster profile:
Production: Expects stable cluster activity, will provide some extra space for potential spikes in activity.
Development: Cluster can tolerate small amount of instability, will run cluster somewhat close to capacity.
High availability: Cluster should avoid instability at all costs, will size cluster with lots of extra space to account for unexpected spikes in activity.
The Abandoned Workloads page can detect workloads which have not sent or received a meaningful rate of traffic over a configurable duration.
You can access the Abandoned Workloads page by selecting Savings in the left navigation, then selecting Manage abandoned workloads.
The Abandoned Workloads page will display front and center an estimated savings amount per month based on a number of detected workloads considered abandoned, defined by two values:
Traffic threshold (bytes/sec): This slider will determine a meaningful rate of traffic (bytes in and out per second) to detect activity of workloads. Only workloads below the threshold will be taken into account, therefore, as you increase the threshold, you should observe the total detected workloads increase.
Window (days): From the main dropdown, you will be able to select the duration of time to check for activity. Presets include 2 days, 7 days, and 30 days. As you increase the duration, you should observe the total detected workloads increase.
Beneath your total savings value and slider scale, you will see a dashboard containing all abandoned workloads. The number of total line items should be equal to the number of workloads displayed underneath your total savings value.
You can filter your workloads through four dropdowns; across clusters, namespaces, owners, and owner kinds.
Selecting an individual line item will expand the item, providing you with additional traffic data for that item.
Kubecost displays all local disks it detects with low usage, with recommendations for resizing and predicted cost savings.
You can access the Local Disks page by selecting Settings in the left navigation, then selecting Manage local disks.
You will see a table of all disks in your environment which fall under 20% current usage. For each disk, the table will display its connected cluster, its current utilization, resizing recommendation, and potential savings. Selecting an individual line item will take you offsite to a Grafana dashboard for more metrics relating to that disk.
In the Cluster dropdown, you can filter your table of disks to an individual cluster in your environment.
In the Profile dropdown, you can configure your desired overhead percentage, which refers to the percentage of extra usage you would like applied to each disk in relation to its current usage. The following overhead percentages are:
Development (25%)
Production (50%)
High Availability (100%)
The value of your overhead percentage will affect your resizing recommendation and estimated savings, where a higher overhead percentage will result in higher average resize recommendation, and lower average estimated savings. The overhead percentage is applied to your current usage (in GiB), then added to your usage obtain a value which Kubecost should round up to for its resizing recommendation. For example, for a disk with a usage of 12 GiB, with Production (50%) selected from the Profile dropdown, 6 GiB (50% of 12) will be added to the usage, resulting in a resizing recommendation of 18 GiB.
Kubecost can only provide detection of underused disks with recommendations for resizing. It does not assist with node turndown.
Kubecost displays all nodes with both low CPU/RAM utilization, indicating they may need to be turned down or resized, while providing checks to ensure safe drainage can be performed.
You can access the Underutilized Nodes page by selecting Savings in the left navigation, then selecting Manage underutilized nodes.
To receive accurate recommendations, you should set the maximum utilization percentage for CPU/RAM for your cluster. This is so Kubecost can determine if your environment can perform successfully below the selected utilization once a node has been drained. This is visualized by the Maximum CPU/RAM Request Utilization slider bar. In the Profile dropdown, you can select three preset values, or a custom option:
Development: Sets the utilization to 80%.
Production: Sets the utilization to 65%.
High Availability: Sets the utilization to 50%.
Custom: Allows you to manually move the slider.
Kubecost provides recommendations by performing a Node Check and a Pod Check to determine if a node can be drained without creating problems for your environment. For example, if draining the node would put the cluster above the utilization request threshold, the Node Check will fail. Only a node that passes both Checks will be recommended for safe drainage. For nodes that fail at least one Check, selecting the node will provide a window of potential pod issues.
Kubecost does not directly assist in turning nodes down.
Kubecost displays all disks and IP addresses that are not utilized by any cluster. These may still incur charges, and so you should consider these orphaned resources for deletion.
You can access the Orphaned Resources page by selecting Savings in the left navigation, then selecting Manage orphaned resources.
Disks and IP addresses (collectively referred to as resources) will be displayed in a single table. Selecting an individual line item will expand its tab and provide more metrics about the resource, including cost per month, size (disks only), region, and a description of the resource.
You can filter your table of resources using two dropdowns:
The Resource dropdown will allow you to filter by resource type (Disk or IP Address).
The Region dropdown will filter by the region associated with the resource. Resources with the region “Global” cannot be filtered, and will only display when All has been selected.
Above your table will be an estimated monthly savings value. This value is the sum of all displayed resources’ savings. As you filter your table of resources, this value will naturally adjust.
For cross-functional convenience, you can copy the name of any resource by selecting the copy icon next to it.
Spot Commander is a Savings feature which identifies workloads where it is available and cost-effective to switch to Spot nodes, resizing the cluster in the process. Spot-readiness is determined through a which analyzes the workload and assesses the minimal cost required. It also generates CLI commands to help you implement the recommendation.
The recommended Spot cluster configuration uses all of the data available to Kubecost to compute a "resizing" of your cluster's nodes into a set of on-demand (standard) nodes O
and a set of spot (preemptible) nodes S
. This configuration is produced from applying a scheduling heuristic to the usage data for all of your workloads. This recommendation offers a more accurate picture of the savings possible from implementing spot nodes because nodes are what the cost of a cluster is made up of; once O
and S
have been determined, the savings are the current cost of your nodes minus the estimated cost of O
and S
.
The recommended configuration assumes that all workloads considered spot-ready by the will be schedulable on spot nodes and that workloads considered not spot-ready will only be schedulable on on-demand nodes. Kubernetes has for achieving this behavior. Cloud providers usually have guides for using spot nodes with taints and tolerations in your managed cluster:
Different cloud providers have different guarantees on shutdown windows and automatic draining of spot nodes that are about to be removed. Consult your provider’s documentation before introducing spot nodes to your cluster.
Kubecost marking a workload as spot ready is not a guarantee. A domain expert should always carefully consider the workload before approving it to run on spot nodes.
Determining O
and S
is achieved by first partitioning all workloads on the cluster (based on the results of the Checklist) into sets: spot-ready workloads R
and non-spot-ready workloads N
. Kubecost consults its maximum resource usage data (in each Allocation, Kubecost records the MAXIMUM CPU and RAM used in the window) and determines the following for each of R
and N
:
The maximum CPU used by any workload
The maximum RAM used by any workload
The total CPU (sum of all individual maximums) required by non-DaemonSet workloads
The total RAM (sum of all individual maximums) required by non-DaemonSet workloads
The total CPU (sum of all individual maximums) required by DaemonSet workloads
The total RAM (sum of all individual maximums) required by DaemonSet workloads
Kubecost uses this data with a configurable target utilization (e.g., 90%) for R
and N
to create O
and S
:
Every node in O
and S
must reserve 100% - target utilization
(e.g., 100% - 90% = 10%
) of its CPU and RAM
Every node in O
must be able to schedule the DaemonSet requirements in R
and N
Every node in S
must be able to schedule the DaemonSet requirements in R
With the remaining resources:
The largest CPU requirement in N
must be schedulable on a node in O
The largest RAM requirement in N
must be schedulable on a node in O
The largest CPU requirement in R
must be schedulable on a node in S
The largest RAM requirement in R
must be schedulable on a node in S
The total CPU requirements of N
must be satisfiable by the total CPU available in O
The total RAM requirements of N
must be satisfiable by the total RAM available in O
The total CPU requirements of R
must be satisfiable by the total CPU available in S
The total RAM requirements of R
must be satisfiable by the total RAM available in S
It is recommended to set the target utilization at or below 95% to allow resources for the operating system and the kubelet.
The configuration currently only recommends one node type for O
and one node type for S
but we are considering adding multiple node type support. If your cluster requires specific node types for certain workloads, consider using Kubecost's recommendation as a launching point for a cluster configuration that supports your specific needs.
Configurable, accepts all aggregations supported by the .
Configurable, accepts all aggregations supported by the .
Configurable, accepts all aggregations supported by the .
Configurable, accepts all aggregations supported by the .
Configurable, accepts all aggregations supported by the .
Configurable, accepts all aggregations supported by the .
Enable the on that cluster and perform the
If you have enabled, you can also perform immediate right-sizing by selecting Savings, then selecting Actions. On the Actions page, select Create Action > Cluster Sizing to receive immediate recommendations and the option to adopt them.
Continuous Cluster Right-Sizing is accessible via . On the Actions page, select Create Action > Guided Sizing. This feature implements both cluster right-sizing and .
For a tutorial on using Guided Sizing, see .
Using Cluster Autoscaler on AWS may result in a similar error. See more .
It is possible to configure a for controllers that causes the scheduler to (where possible) adhere to certain availability requirements for the controller. If a controller has a PDB set up, we read it and compute its minimum available replicas and use a simple threshold on the ratio min available / replicas
to determine if the PDB indicates readiness. We chose to interpret a ratio of > 0.5 to indicate a lack of readiness because it implies a reasonably high availability requirement.
The Checklist is now deployed alongside a which automatically suggests a set of Spot and on-demand nodes to use in your cluster based on the Checklist. If you do not want to use that, read the following for some important information:
It is a good idea to use to schedule only Spot-ready workloads on Spot nodes.
This article is the primary reference for installing Kubecost in an air-gapped environment with a user-managed container registry.
This section details all required and optional Kubecost images. Optional images are used depending on the specific configuration needed.
Please substitute the appropriate version for prod-x.xx.x. Latest releases can be found here.
To find the exact images used for each Kubecost release, a command such as this can be used:
The alpine/k8s image is not used in real deployments. It is only in the Helm chart for testing purposes.
Frontend: gcr.io/kubecost1/frontend
CostModel: gcr.io/kubecost1/cost-model
NetworkCosts: gcr.io/kubecost1/kubecost-network-costs (used for network-allocation)
Cluster controller: gcr.io/kubecost1/cluster-controller:v0.9.0 (used for write actions)
BusyBox: registry.hub.docker.com/library/busybox:latest (only for NFS)
quay.io/prometheus/prometheus
prom/node-exporter
quay.io/prometheus-operator/prometheus-config-reloader
grafana/grafana
kiwigrid/k8s-sidecar
thanosio/thanos
There are two options to configure asset prices in your on-premise Kubernetes environment:
Per-resource prices can be configured in a Helm values file (reference) or directly in the Kubecost Settings page. This allows you to directly supply the cost of a certain Kubernetes resources, such as a CPU month, a RAM Gb month, etc.
Use quotes if setting "0.00" for any item under kubecostProductConfigs.defaultModelPricing
. Failure to do so will result in the value(s) not being written to the Kubecost cost-model's PV (/var/configs/default.json).
When setting CPU and RAM monthly prices, the values will be broken down to the hourly rate for the total monthly price set under kubecost.ProductConfigs.defaultModelPricing. The values will adjust accordingly in /var/configs/default.json in the kubecost cost-model container.
This method allows each individual asset in your environment to have a unique price. This leverages the Kubecost custom CSV pipeline which is available on Enterprise plans.
Use a proxy for the AWS pricing API. You can set AWS_PRICING_URL
via the extra env var
to the address of your proxy.
SSO and RBAC are only officially supported on Kubecost Enterprise plans.
This guide will show you how to configure Kubecost integrations for SSO and RBAC with Okta.
To enable SSO for Kubecost, this tutorial will show you how to create an application in Okta.
Go to the Okta admin dashboard (https://[your-subdomain]okta.com/admin/dashboard) and select Applications from the left navigation. On the Applications page, select Create App Integration > SAML 2.0 > Next.
On the 'Create SAML Integration' page, provide a name for your app. Feel free to also use this for the App logo field. Then, select Next.
Your SSO URL should be your application root URL followed by '/saml/acs', like: https://[your-kubecost-address].com/saml/acs
Your Audience URI (SP Entity ID) should be set to your application root without a trailing slash: https://[your-kubecost-address.com
(Optional) If you intend to use RBAC: under Group Attribute Statements, enter a name (ex: kubecost_group) and a filter based on your group naming standards (example Starts with kubecost_). Then, select Next.
Provide any feedback as needed, then select Finish.
Return to the Applications page, select your newly-created app, then select the Sign On tab. Copy the URL for Identity Provider metadata, and add that value to .Values.saml.idMetadataURL
in this file.
To fully configure SAML 2.0, select View Setup Instructions, download the X.509 certificate, and name the file myservice.cert.
Create a secret using the certificate with the following command:
kubectl create secret generic kubecost-okta --from-file myservice.cert --namespace kubecost
Finally, add -f values-saml.yaml
to your Kubecost Helm upgrade command:
At this point, test your SSO to ensure it is working properly before moving on to the next section.
The simplest form of RBAC in Kubecost is to have two groups: admin
and readonly
. If your goal is to simply have these two groups, you do not need to configure filters. This will result in the logs message: file corruption: '%!s(MISSING)'
, but this is expected.
The assertionName: "kubecost_group"
value needs to match the name given in Step 5 of the Okta SSO Configuration section.
It's possible to combine filtering with admin/readonly rights
These filters can be configured using groups or user attributes in your Okta directory. It is also possible to assign filters to specific users. The example below is using groups.
Filtering is configured very similarly to the admin/readonly above. The same group pattern match (kubecost_group) can be used for both, as is the case in this example:
The array of groups obtained during the authorization request will be matched to the subject key in the filters.json:
As an example, we will configure the following:
Admins will have full access to the Kubecost UI and have visibility to all resources
Kubecost users, by default, will not have visibility to any namespace and will be readonly
. If a group doesn't have access to any resources, the Kubecost UI may appear to be broken
The dev-namespaces group will have read only access to the Kubecost UI and only have visibility to namespaces that are prefixed with dev-
or are exactly nginx-ingress
Go to the Okta admin dashboard (https://[your-subdomain]okta.com/admin/dashboard) and select Directory > Groups from the left navigation. On the Groups page, select Add group.
Create groups for kubecost_users, kubecost_admin and kubecost_dev-namespaces by providing each value as the name with an optional description, then select Save. You will need to perform this step three times, one for each group.
Select each group, then select Assign people and add the appropriate users for testing. Select Done to confirm edits to a group. Kubecost admins will be part of both the read only kubecost_users and kubecost_admin groups. Kubecost will assign the most rights if there are conflicts.
Return to the Groups page. Select kubecost_users, then in the Applications tab, assign the Kubecost application. You do not need to assign the other kubecost_ groups to the Kubecost application because all users already have access in the kubecost_users group.
Modify filters.json as depicted above.
Create the ConfigMap using the following command:
You can modify the ConfigMap without restarting any pods.
Generate an X509 certificate and private key. Below is an example using OpenSSL:
openssl genpkey -algorithm RSA -out saml-encryption-key.pem -pkeyopt rsa_keygen_bits:2048
Generate a certificate signing request (CSR)
openssl req -new -key saml-encryption-key.pem -out request.csr
Request your organization's domain owner to sign the certificate, or generate a self-signed certificate:
openssl x509 -req -days 365 -in request.csr -signkey saml-encryption-key.pem -out saml-encryption-cert.cer
Go to your application, then under the General tab, edit the following SAML Settings:
Assertion Encryption: Encrypted
In the Encryption Algorithm box that appears, select AES256-CBC.
Select Browse Files in the Encryption Certificate field and upload an image file of your certifcate.
Create a secret with the certificate. The file name must be saml-encryption-cert.cer.
kubectl create secret generic kubecost-saml-cert --from-file saml-encryption-cert.cer --namespace kubecost
Create a secret with the private key. The file name must be saml-encryption-key.pem.
kubectl create secret generic kubecost-saml-decryption-key --from-file saml-encryption-key.pem --namespace kubecost
Pass the following values via Helm into your values.yaml:
You can view the logs on the cost-model container. In this example, the assumption is that the prefix for Kubecost groups is kubecost_
. This command is currently a work in progress.
kubectl logs deployment/kubecost-cost-analyzer -c cost-model --follow |grep -v -E 'resourceGroup|prometheus-server'|grep -i -E 'group|xmlname|saml|login|audience|kubecost_'
When the group has been matched, you will see:
This is what you should expect to see:
For configuring single app logout, read on the subject. then, update the values.saml:redirectURL
value in your file.
Use this to assign individuals or groups access to your Kubecost application.
The file contains the admin
and readonly
groups in the RBAC section:
Filters are used to give visibility to a subset of objects in Kubecost. Examples of the various filters available are in and . RBAC filtering is capable of all the same types of filtering features as that of the .
The metrics listed below are emitted by Kubecost and scraped by Prometheus to help monitor the status of Kubecost data pipelines:
kubecost_allocation_data_status
, which presents the active allocation data's time series status
kubecost_asset_data_status
, which presents the time series status of the active asset data
These metrics provide data status through to proactively alert and analyze the allocation and asset data at a point in time.
The metrics below depict the status of active allocation data at a point in time. The resolution is either daily or hourly, which aligns one-to-one with the data status of allocation daily and hourly store. Each hourly and daily stores have four types of status
Empty: Depicts the total number of empty allocationSet in each store hourly or daily at a point in time.
Error: Depicts the total number of errors in the allocationSet in each store hourly or daily at a point in time.
Success: Depicts the total number of successful allocationSet in each store hourly or daily at a point in time.
Warning: Depicts the total number of warnings in all allocationSet in each store hourly or daily at a point in time.
The metrics below depict the status of active asset data at a point in time. The resolution is either daily or hourly, which aligns one-to-one with the data status of asset daily and hourly store. Each hourly and daily stores have four types of status
Empty: Depicts the total number of empty assetSet in each store hourly or daily at a point in time.
Error: Depicts the total number of errors in the assetSet in each store hourly or daily at a point in time.
Success: Depicts the total number of successful assetSet in each store hourly or daily at a point in time.
Warning: Depicts the total number of warnings in all assetSet in each store hourly or daily at a point in time.
kubecost_asset_data_status
is written to Prometheus during the assetSet and assetLoad events.
kubecost_allocation_data_status
is written to Prometheus during the allocationSet and allocationLoad events.
During the cleanup operation, the corresponding entries of each allocation and asset are deleted to avoid the metrics having those particular entries having parity with respective allocation and asset stores.
Kubecost will display volumes unused by any pod. You can consider these volumes for deletion, or move them to a cheaper storage tier.
You can access the Unclaimed Volumes page by selecting Savings in the left navigation, then selecting Manage unclaimed volumes.
Volumes will be displayed in a table, and can be sorted By Owner or By Namespace. You can view owner, storage class, and size for your volumes.
Using the Cluster dropdown, you can filter volumes connected to an individual cluster in your environment.