1 of 4

Cluster Controller

The Cluster Controller is currently in beta. Please read the documentation carefully.

Kubecost's Cluster Controller allows you to access additional Savings features through automated processes. To function, the Cluster Controller requires write permission to certain resources on your cluster, and for this reason, the Cluster Controller is disabled by default.

The Cluster Controller enables features like:

Feature functionality

The Cluster Controller can be enabled on any cluster type, but certain functionality will only be enabled based on the cloud service provider (CSP) of the cluster and its type:

The Cluster Controller can only be enabled on your primary cluster.
The Controller itself and container RRS are available for all cluster types and configurations.
Cluster turndown, cluster right-sizing, and Kubecost Actions are only available for GKE, EKS, and Kops-on-AWS clusters, after setting up a provider service key.

Therefore, the 'Provider service key setup' section below is optional depending on your cluster environment, but will limit functionality if you choose to skip it. Read the caution banner in the below section for more details.

Provider service key setup

If you are enabling the Cluster Controller for a GKE/EKS/Kops AWS cluster, follow the specialized instructions for your CSP(s) below. If you aren't using a GKE/EKS Kops AWS cluster, skip ahead to the Deploying section below.

GKE setup

The following command performs the steps required to set up a service account. More info.

/bin/bash -c "$(curl -fsSL https://github.com/kubecost/cluster-turndown/releases/latest/download/gke-create-service-key.sh)" -- <Project ID> <Service Account Name> <Namespace> cluster-controller-service-key

To use this setup script, provide the following required parameters:

Project ID: The GCP project identifier. Can be found via: gcloud config get-value project
Namespace: The namespace which Kubecost will be installed, e.g kubecost
Service Account Name: The name of the service account to be created. Should be between 6 and 20 characters, e.g. kubecost-controller
Secret Name: The Kubecost will automatically look for a secret called cluster-controller-service-key. This can be changed by setting .Values.clusterController.secretName.

EKS setup

For EKS cluster provisioning, if using eksctl, make sure that you use the --managed option when creating the cluster. Unmanaged node groups should be upgraded to managed. More info.

Create a new User with AutoScalingFullAccess permissions, plus the following EKS-specific permissions:

{
    "Effect": "Allow",
    "Action": [
        "eks:ListClusters",
        "eks:DescribeCluster",
        "eks:DescribeNodegroup",
        "eks:ListNodegroups",
        "eks:CreateNodegroup",
        "eks:UpdateClusterConfig",
        "eks:UpdateNodegroupConfig",
        "eks:DeleteNodegroup",
        "eks:ListTagsForResource",
        "eks:TagResource",
        "eks:UntagResource"
    ],
    "Resource": "*"
},
{
    "Effect": "Allow",
    "Action": [
        "iam:GetRole",
        "iam:ListAttachedRolePolicies",
        "iam:PassRole"
    ],
    "Resource": "*"
}

Create a new file, service-key.json, and use the access key ID and secret access key to fill out the following template:

{
    "aws_access_key_id": "<ACCESS_KEY_ID>",
    "aws_secret_access_key": "<SECRET_ACCESS_KEY>"
}

Then, run the following to create the secret:

$ kubectl create secret generic cluster-controller-service-key -n <NAMESPACE> --from-file=service-key.json

Here is a full example of this process using the AWS CLI and a simple IAM user (requires jq):

NEW_IAM_USER
aws iam create-user \
    --user-name $NEW_IAM_USER

aws iam attach-user-policy \
    --user-name $NEW_IAM_USER \
    --policy-arn arn:aws:iam::aws:policy/AutoScalingFullAccess

read -r -d '' EKSPOLICY << EOM
{
    "Version": "2012-10-17",
    "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "eks:ListClusters",
            "eks:DescribeCluster",
            "eks:DescribeNodegroup",
            "eks:ListNodegroups",
            "eks:CreateNodegroup",
            "eks:UpdateClusterConfig",
            "eks:UpdateNodegroupConfig",
            "eks:DeleteNodegroup",
            "eks:ListTagsForResource",
            "eks:TagResource",
            "eks:UntagResource"
        ],
        "Resource": "*"
    },
    {
        "Effect": "Allow",
        "Action": [
            "iam:GetRole",
            "iam:ListAttachedRolePolicies",
            "iam:PassRole"
        ],
        "Resource": "*"
    }
    ]
}
EOM

aws iam put-user-policy \
    --user-name $NEW_IAM_USER \
    --policy-name "eks-permissions" \
    --policy-document "${EKSPOLICY}"

aws iam create-access-key \
    --user-name $NEW_IAM_USER --output json \
    > /tmp/aws-key.json

AAKI="$(jq -r '.AccessKey.AccessKeyId' /tmp/aws-key.json)"
ASAK="$(jq -r '.AccessKey.SecretAccessKey' /tmp/aws-key.json)"
kubectl create secret generic \
    cluster-controller-service-key \
    -n kubecost \
    --from-literal="service-key.json={\"aws_access_key_id\": \"${AAKI}\", \"aws_secret_access_key\": \"${ASAK}\"}"

Kops-on-AWS setup

Create a new user or IAM role with AutoScalingFullAccess permissions. JSON definition of those permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "autoscaling:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricAlarm",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeImages",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstances",
                "ec2:DescribeKeyPairs",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DescribePlacementGroups",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcClassicLink"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": "autoscaling.amazonaws.com"
                }
            }
        }
    ]
}

Create a new file, service-key.json, and use the access key ID and secret access key to fill out the following template:

{
    "aws_access_key_id": "<ACCESS_KEY_ID>",
    "aws_secret_access_key": "<SECRET_ACCESS_KEY>"
}

Then run the following to create the secret:

$ kubectl create secret generic cluster-controller-service-key -n <NAMESPACE> --from-file=service-key.json

Deploying

You can now enable the Cluster Controller in the Helm chart by finding the clusterController Helm flag and setting enabled: true

clusterController:
  enabled: true

You may also enable via --set when running Helm install:

--set clusterController.enabled=true

Verify the Cluster Controller is running

You can verify that the Cluster Controller is running by issuing the following:

kubectl get pods -n kubecost -l app=kubecost-cluster-controller

Once the Cluster Controller has been enabled successfully, you should automatically have access to the listed Savings features.

Cluster Turndown

Cluster turndown is currently in beta. Please read the documentation carefully.

Cluster turndown is an automated scale down and scaleup of a Kubernetes cluster's backing nodes based on a custom schedule and turndown criteria. This feature can be used to reduce spend during down hours and/or reduce surface area for security reasons. The most common use case is to scale non-production (prod) environments (e.g. development (dev) clusters) to zero during off hours.

If you are upgrading from a pre-1.94 version of the Kubecost Helm chart, you will have to migrate your custom resources. turndownschedules.kubecost.k8s.io has been changed to turndownschedules.kubecost.com and finalizers.kubecost.k8s.io has been changed to finalizers.kubecost.com. See the TurndownSchedule Migration Guide for an explanation.

How it works

Cluster turndown is only available for clusters on GKE, EKS, or Kops-on-AWS.

Managed cluster strategy (e.g. GKE + EKS)

When the turndown schedule occurs, a new node pool with a single g1-small node is created. Taints are added to this node to only allow specific pods to be scheduled there. The cluster-turndown pod deployment is updated so the pod is allowed to schedule on the singleton node. Once the pod is moved to the new node, it will start back up and resume scale down. This is done by cordoning all nodes in the cluster (other than our new g1-small node), and then reducing the node pool sizes to 0.

GKE autoscaler strategy

Whenever there exists at least one NodePool with the cluster-autoscaler enabled, the cluster-turndown pod will:

Resize all non-autoscaling nodepools to 0
Schedule the turndown on one of the autoscaler nodepool nodes
Once it is brought back up (rescheduled to the selected node), the turndown pod will start a process called "flattening" which attempts to set deployment replicas to 0, turn off jobs, and annotate pods with labels that allow the autoscaler to do the rest of the work. Flattening persists pre-turndown values in the annotations of Kubernetes objects. The GKE autoscaler behavior is expected to handle the rest: removing now-unneeded nodes from the node pools. A limitation of this strategy is that the autoscaled node pools won't go below their configured minimum node count.
When turn up occurs, deployments and DaemonSets are "expanded" to their original sizes/replicas.

There are four annotations that can be applied for this process:

kubecost.kubernetes.io/job-suspend: Stores a bool containing the previous paused state of a kubernetes CronJob.
kubecost.kubernetes.io/turn-down-replicas: Stores the previous number of replicas set on the deployment.
kubecost.kubernetes.io/turn-down-rollout: Stores the previous maxUnavailable for the deployment rollout.
kubecost.kubernetes.io/safe-evict: Uses the cluster-autoscaler.kubernetes.io/safe-to-evict for autoscaling clusters to have the autoscaler preserve any deployments that previously had this annotation set, so scale up occurs, this value isn't unintentionally reset.

AWS Kops strategy

This turndown strategy schedules the cluster-turndown pod on the Master node, then resizes all Auto Scaling Groups (ASG) other than the master to 0. Similar to flattening in GKE (see above), the previous min/max/current values of the ASG prior to turndown will be set on the tag. When turn up occurs, those values can be read from the tags and restored to their original sizes. For the standard strategy, turn up will reschedule the turndown pod off the Master upon completion (occurs 5 minutes after turn up). This is to allow any modifications via Kops without resetting any cluster specific scheduling setup by turndown. The tag label used to store the min/max/current values for a node group is cluster.turndown.previous. Once turn up happens and the node groups are resized to their original size, the tag is deleted.

Setup

Prerequisites

kubectl
Enable the Cluster Controller

You will receive full turndown functionality once the Cluster Controller is enabled via a provider service key setup and Helm upgrade. Review the Cluster Controller doc linked above under Prerequisites for more information, then return here when you've confirmed the Cluster Controller is running.

Verify the pod is running

You can verify that the cluster-turndown pod is running with the following command:

$ kubectl get pods -l app=cluster-turndown -n turndown

Setting a turndown schedule

Turndown uses a Kubernetes Custom Resource Definition to create schedules. Here is an example resource located at artifacts/example-schedule.yaml:

apiVersion: kubecost.com/v1alpha1
kind: TurndownSchedule
metadata:
  name: example-schedule
  finalizers:
  - "finalizer.kubecost.com"
spec:
  start: 2020-03-12T00:00:00Z
  end: 2020-03-12T12:00:00Z
  repeat: daily

This definition will create a schedule that starts by turning down at the designated start date-time and turning back up at the designated end date-time. Both the start and end times should be in RFC3339 format, i.e. times based on offsets to UTC. There are three possible values for repeat:

none: Single schedule turndown and turnup.
daily: Start and end times will reschedule every 24 hours.
weekly: Start and end times will reschedule every 7 days.

To create this schedule, you may modify example-schedule.yaml to your desired schedule and run:

$ kubectl apply -f artifacts/example-schedule.yaml

Currently, updating a resource is not supported, so if the scheduling of the example-schedule.yaml fails, you will need to delete the resource via:

$ kubectl delete tds example-schedule

Then make the modifications to the schedule and re-apply.

Viewing a turndown schedule

The turndownschedule resource can be listed via kubectl as well:

$ kubectl get turndownschedules

or using the shorthand:

$ kubectl get tds

Details regarding the status of the turndown schedule can be found by outputting as a JSON or YAML:

$ kubectl get tds example-schedule -o yaml

apiVersion: kubecost.com/v1alpha1
kind: TurndownSchedule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubecost.com/v1alpha1","kind":"TurndownSchedule","metadata":{"annotations":{},"finalizers":["finalizer.kubecost.com"],"name":"example-schedule"},"spec":{"end":"2020-03-17T00:35:00Z","repeat":"daily","start":"2020-03-17T00:20:00Z"}}
  creationTimestamp: "2020-03-17T00:18:39Z"
  finalizers:
  - finalizer.kubecost.com
  generation: 1
  name: example-schedule
  resourceVersion: "33573"
  selfLink: /apis/kubecost.com/v1alpha1/turndownschedules/example-schedule
  uid: d9b16aed-67e4-11ea-b591-42010a8e0075
spec:
  end: "2020-03-17T00:35:00Z"
  repeat: daily
  start: "2020-03-17T00:20:00Z"
status:
  current: scaledown
  lastUpdated: "2020-03-17T00:36:39Z"
  nextScaleDownTime: "2020-03-18T00:21:38Z"
  nextScaleUpTime: "2020-03-18T00:36:38Z"
  scaleDownId: 38ebf595-4e2b-46e9-951a-1e3ceff30536
  scaleDownMetadata:
    repeat: daily
    type: scaledown
  scaleUpID: 869ec89f-a8d8-450b-9ebb-71cd4d7fbaf8
  scaleUpMetadata:
    repeat: daily
    type: scaleup
  state: ScheduleSuccess

The status field displays the current status of the schedule including next schedule times, specific schedule identifiers, and the overall state of schedule.

state: The state of the turndown schedule. This can be:
- ScheduleSuccess: The schedule has been set and is waiting to run.
- ScheduleFailed: The scheduling failed due to a schedule already existing, scheduling for a date-time in the past.
- ScheduleCompleted: For schedules with repeat: none, the schedule will move to a completed state after turn up.
current: The next action to run.
lastUpdated: The last time the status was updated on the schedule.
nextScaleDownTime: The next time a turndown will be executed.
nextScaleUpTime: The next time at turn up will be executed.
scaleDownId: Specific identifier assigned by the internal scheduler for turndown.
scaleUpId: Specific identifier assigned by the internal scheduler for turn up.
scaleDownMetadata: Metadata attached to the scaledown job, assigned by the turndown scheduler.
scaleUpMetadata: Metadata attached to the scale up job, assigned by the turndown scheduler.

Canceling a turndown schedule

A turndown can be canceled before turndown actually happens or after. This is performed by deleting the resource:

$ kubectl delete tds example-schedule

Canceling while turndown is currently scaling down or scaling up will result in a delayed cancellation, as the schedule must complete its operation before processing the deletion/cancellation.

If the turndown schedule is canceled between a turndown and turn up, the turn up will occur automatically upon cancellation.

Using cluster turndown via UI

Cluster turndown has limited functionality via the Kubecost UI. To access cluster turndown in the UI, you must first enable Kubecost Actions. Once this is completed, you will be able to create and delete turndown schedules instantaneously for your supported clusters. Read more about turndown's UI functionality in this section of the above Kubecost Actions doc. Review the entire doc for more information on Kubecost Actions functionality and limitations.

Limitations

The internal scheduler only allows one schedule at a time to be used. Any additional schedule resources created will fail (kubectl get tds -o yaml will display the status).
Do not attempt to kubectl edit a turndown schedule. This is currently not supported. Recommended approach for modifying is to delete and then create a new schedule.
There is a 20-minute minimum time window between start and end of turndown schedule.

Kubescaler

This feature is in currently in alpha. Please read the documentation carefully.

Kubecost's Kubescaler implements continuous request right-sizing: the automatic application of Kubecost's high-fidelity recommendations to your containers' resource requests. This provides an easy way to automatically improve your allocation of cluster resources by improving efficiency.

Kubescaler can be enabled and configured on a per-workload basis so that only the workloads you want edited will be edited.

Setup

Kubescaler is part of Cluster Controller, and should be configured after the Cluster Controller is enabled.

Usage

Kubescaler is configured on a workload-by-workload basis via annotations. Currently, only deployment workloads are supported.

Annotation

Description

Example(s)

request.autoscaling.kubecost.com/enabled

Whether to autoscale the workload. See note on KUBESCALER_RESIZE_ALL_DEFAULT.

true, false

request.autoscaling.kubecost.com/frequencyMinutes

How often to autoscale the workload, in minutes. If unset, a conservative default is used.

73

request.autoscaling.kubecost.com/scheduleStart

Optional augmentation to the frequency parameter. If both are set, the workload will be resized on the scheduled frequency, aligned to the start. If frequency is 24h and the start is midnight, the workload will be rescheduled at (about) midnight every day. Formatted as RFC3339.

2022-11-28T00:00:00Z

cpu.request.autoscaling.kubecost.com/targetUtilization

Target utilization (CPU) for the recommendation algorithm. If unset, the backing recommendation service's default is used.

0.8

memory.request.autoscaling.kubecost.com/targetUtilization

Target utilization (Memory/RAM) for the recommendation algorithm. If unset, the backing recommendation service's default is used.

0.8

request.autoscaling.kubecost.com/recommendationQueryWindow

Value of the window parameter to be used when acquiring recommendations. See Request sizing API for explanation of window parameter. If setting up autoscaling for a CronJob, it is strongly recommended to set this to a value greater than the duration between Job runs. For example, if you have a weekly CronJob, this parameter should be set to a value greater than 7d to ensure a recommendation is available.

2d

Notable Helm values:

Helm value

Description

Example(s)

clusterController.kubescaler.resizeAllDefault

If true, Kubescaler will switch to default-enabled for all workloads unless they are annotated with request.autoscaling.kubecost.com/enabled=false. This is recommended for low-stakes clusters where you want to prioritize workload efficiency without reworking deployment specs for all workloads.

true

Supported workload types

Kubescaler supports:

apps/v1 Deployments
apps/v1 DaemonSets
batch/v1 CronJobs (K8s v1.21+). No attempt will be made to autoscale a CronJob until it has run at least once.

Kubescaler cannot support:

"Uncontrolled" Pods. Learn more here.

Example

export NS="kubecost"
export DEP="kubecost-cost-analyzer"
export AN_ENABLE="request.autoscaling.kubecost.com/enabled=true"
export AN_FREQ="request.autoscaling.kubecost.com/frequencyMinutes=660"
export AN_TCPU="cpu.request.autoscaling.kubecost.com/targetUtilization=0.9"
export AN_TMEM="memory.request.autoscaling.kubecost.com/targetUtilization=0.9"
export AN_WINDOW="request.autoscaling.kubecost.com/recommendationQueryWindow=3d"

kubectl annotate -n "${NS}" deployment "${DEP}" "${AN_ENABLE}"
kubectl annotate -n "${NS}" deployment "${DEP}" "${AN_FREQ}"
kubectl annotate -n "${NS}" deployment "${DEP}" "${AN_TCPU}"
kubectl annotate -n "${NS}" deployment "${DEP}" "${AN_TMEM}"
kubectl annotate -n "${NS}" deployment "${DEP}" "${AN_WINDOW}"

Kubescaler will take care of the rest. It will apply the best-available recommended requests to the annotated controller every 11 hours. If the recommended requests exceed the current limits, the update is currently configured to set the request to the current limit.

To check current requests for your Deployments, use the following command:

kubectl get deployment -n "kubecost" -o=jsonpath="{range .items[*]}"deployment/"{.metadata.name}{'\n'}{range .spec.template.spec.containers[*]}{.name}{'\t'}{.resources.requests}{'\n'}{end}{'\n'}{end}"

TurndownSchedule Migration Guide

In v1.94 of Kubecost, the turndownschedules.kubecost.k8s.io/v1alpha1 Custom Resource Definition (CRD) was moved to turndownschedules.kubecost.com/v1alpha1 to adhere to Kubernetes policy for CRD domain namespacing. This is a breaking change for users of Cluster Controller's turndown functionality. Please follow this guide for a successful migration of your turndown schedule resources.

Note: As part of this change, the CRD was updated to use apiextensions.k8s.io/v1 because v1beta1 was removed in K8s v1.22. If using Kubecost v1.94+, Cluster Controller's turndown functionality will not work on K8s versions before the introduction of apiextensions.k8s.io/v1.

Scenario 1: You have deployed Cluster Controller but don't use turndown

In this situation, you've deployed Kubecost's Cluster Controller at some point using --set clusterController.enabled=true, but you don't use the turndown functionality.

That means that this command should return one line:

kubectl get crd turndownschedules.kubecost.k8s.io

And this command should return no resources:

kubectl get turndownschedules.kubecost.k8s.io

This situation is easy! You can do nothing, and turndown should continue to behave correctly because kubectl get turndownschedule and related commands will correctly default to the new turndownschedules.kubecost.com/v1alpha1 CRD after you upgrade to Kubecost v1.94 or higher.

If you would like to be fastidious and clean up the old CRD, simply run kubectl delete crd turndownschedules.kubecost.k8s.io after upgrading Kubecost to v1.94 or higher.

Scenario 2: You currently use turndown

In this situation, you've deployed Kubecost's Cluster Controller at some point using --set clusterController.enabled=true and you have at least one turndownschedule.kubecost.k8s.io resource currently present in your cluster.

That means that this command should return one line:

kubectl get crd turndownschedules.kubecost.k8s.io

And this command should return at least one resource:

kubectl get turndownschedules.kubecost.k8s.io

We have a few steps to perform if you want Cluster Controller's turndown functionality to continue to behave according to your already-defined turndown schedules.

Upgrade Kubecost to v1.94 or higher with --set clusterController.enabled=true
Make sure the new CRD has been defined after your Kubecost upgrade
This command should return a line:
```
kubectl get crd turndownschedules.kubecost.com
```

Copy your existing turndownschedules.kubecost.k8s.io resources into the new CRD

kubectl get turndownschedules.kubecost.k8s.io -o yaml \
    | sed 's|kubecost.k8s.io|kubecost.com|' \
    | kubectl apply -f -

(optional) Delete the old turndownschedules.kubecost.k8s.io CRD
Because the CRDs have a finalizer on them, we have to follow this workaround to remove the finalizer from our old resources. This lets us clean up without locking up.
```
kubectl patch \
    crd/turndownschedules.kubecost.k8s.io \
    -p '{"metadata":{"finalizers":[]}}' \
    --type=merge
```
Note: The following command may be unnecessary because Helm should automatically remove the turndownschedules.kubecost.k8s.io resource during the upgrade. The removal will remain in a pending state until the finalizer patch above is implemented.
```
kubectl delete crd turndownschedules.kubecost.k8s.io
```

Cluster Controller

The Cluster Controller is currently in beta. Please read the documentation carefully.

The Cluster Controller enables features like:

Feature functionality

The Cluster Controller can be enabled on any cluster type, but certain functionality will only be enabled based on the cloud service provider (CSP) of the cluster and its type:

The Cluster Controller can only be enabled on your primary cluster.
The Controller itself and container RRS are available for all cluster types and configurations.
Cluster turndown, cluster right-sizing, and Kubecost Actions are only available for GKE, EKS, and Kops-on-AWS clusters, after setting up a provider service key.

Provider service key setup

GKE setup

The following command performs the steps required to set up a service account. More info.

/bin/bash -c "$(curl -fsSL https://github.com/kubecost/cluster-turndown/releases/latest/download/gke-create-service-key.sh)" -- <Project ID> <Service Account Name> <Namespace> cluster-controller-service-key

To use this setup script, provide the following required parameters:

Project ID: The GCP project identifier. Can be found via: gcloud config get-value project
Namespace: The namespace which Kubecost will be installed, e.g kubecost
Service Account Name: The name of the service account to be created. Should be between 6 and 20 characters, e.g. kubecost-controller
Secret Name: The Kubecost will automatically look for a secret called cluster-controller-service-key. This can be changed by setting .Values.clusterController.secretName.

EKS setup

For EKS cluster provisioning, if using eksctl, make sure that you use the --managed option when creating the cluster. Unmanaged node groups should be upgraded to managed. More info.

Create a new User with AutoScalingFullAccess permissions, plus the following EKS-specific permissions:

{
    "Effect": "Allow",
    "Action": [
        "eks:ListClusters",
        "eks:DescribeCluster",
        "eks:DescribeNodegroup",
        "eks:ListNodegroups",
        "eks:CreateNodegroup",
        "eks:UpdateClusterConfig",
        "eks:UpdateNodegroupConfig",
        "eks:DeleteNodegroup",
        "eks:ListTagsForResource",
        "eks:TagResource",
        "eks:UntagResource"
    ],
    "Resource": "*"
},
{
    "Effect": "Allow",
    "Action": [
        "iam:GetRole",
        "iam:ListAttachedRolePolicies",
        "iam:PassRole"
    ],
    "Resource": "*"
}

Create a new file, service-key.json, and use the access key ID and secret access key to fill out the following template:

{
    "aws_access_key_id": "<ACCESS_KEY_ID>",
    "aws_secret_access_key": "<SECRET_ACCESS_KEY>"
}

Then, run the following to create the secret:

$ kubectl create secret generic cluster-controller-service-key -n <NAMESPACE> --from-file=service-key.json

Here is a full example of this process using the AWS CLI and a simple IAM user (requires jq):

NEW_IAM_USER
aws iam create-user \
    --user-name $NEW_IAM_USER

aws iam attach-user-policy \
    --user-name $NEW_IAM_USER \
    --policy-arn arn:aws:iam::aws:policy/AutoScalingFullAccess

read -r -d '' EKSPOLICY << EOM
{
    "Version": "2012-10-17",
    "Statement": [
    {
        "Effect": "Allow",
        "Action": [
            "eks:ListClusters",
            "eks:DescribeCluster",
            "eks:DescribeNodegroup",
            "eks:ListNodegroups",
            "eks:CreateNodegroup",
            "eks:UpdateClusterConfig",
            "eks:UpdateNodegroupConfig",
            "eks:DeleteNodegroup",
            "eks:ListTagsForResource",
            "eks:TagResource",
            "eks:UntagResource"
        ],
        "Resource": "*"
    },
    {
        "Effect": "Allow",
        "Action": [
            "iam:GetRole",
            "iam:ListAttachedRolePolicies",
            "iam:PassRole"
        ],
        "Resource": "*"
    }
    ]
}
EOM

aws iam put-user-policy \
    --user-name $NEW_IAM_USER \
    --policy-name "eks-permissions" \
    --policy-document "${EKSPOLICY}"

aws iam create-access-key \
    --user-name $NEW_IAM_USER --output json \
    > /tmp/aws-key.json

AAKI="$(jq -r '.AccessKey.AccessKeyId' /tmp/aws-key.json)"
ASAK="$(jq -r '.AccessKey.SecretAccessKey' /tmp/aws-key.json)"
kubectl create secret generic \
    cluster-controller-service-key \
    -n kubecost \
    --from-literal="service-key.json={\"aws_access_key_id\": \"${AAKI}\", \"aws_secret_access_key\": \"${ASAK}\"}"

Kops-on-AWS setup

Create a new user or IAM role with AutoScalingFullAccess permissions. JSON definition of those permissions:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": "autoscaling:*",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "cloudwatch:PutMetricAlarm",
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "ec2:DescribeAccountAttributes",
                "ec2:DescribeAvailabilityZones",
                "ec2:DescribeImages",
                "ec2:DescribeInstanceAttribute",
                "ec2:DescribeInstances",
                "ec2:DescribeKeyPairs",
                "ec2:DescribeLaunchTemplateVersions",
                "ec2:DescribePlacementGroups",
                "ec2:DescribeSecurityGroups",
                "ec2:DescribeSpotInstanceRequests",
                "ec2:DescribeSubnets",
                "ec2:DescribeVpcClassicLink"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "elasticloadbalancing:DescribeLoadBalancers",
                "elasticloadbalancing:DescribeTargetGroups"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": "iam:CreateServiceLinkedRole",
            "Resource": "*",
            "Condition": {
                "StringEquals": {
                    "iam:AWSServiceName": "autoscaling.amazonaws.com"
                }
            }
        }
    ]
}

Create a new file, service-key.json, and use the access key ID and secret access key to fill out the following template:

{
    "aws_access_key_id": "<ACCESS_KEY_ID>",
    "aws_secret_access_key": "<SECRET_ACCESS_KEY>"
}

Then run the following to create the secret:

$ kubectl create secret generic cluster-controller-service-key -n <NAMESPACE> --from-file=service-key.json

Deploying

You can now enable the Cluster Controller in the Helm chart by finding the clusterController Helm flag and setting enabled: true

clusterController:
  enabled: true

You may also enable via --set when running Helm install:

--set clusterController.enabled=true

Verify the Cluster Controller is running

You can verify that the Cluster Controller is running by issuing the following:

kubectl get pods -n kubecost -l app=kubecost-cluster-controller

Once the Cluster Controller has been enabled successfully, you should automatically have access to the listed Savings features.

Cluster Turndown

Cluster turndown is currently in beta. Please read the documentation carefully.

How it works

Cluster turndown is only available for clusters on GKE, EKS, or Kops-on-AWS.

Managed cluster strategy (e.g. GKE + EKS)

GKE autoscaler strategy

Whenever there exists at least one NodePool with the cluster-autoscaler enabled, the cluster-turndown pod will:

Resize all non-autoscaling nodepools to 0
Schedule the turndown on one of the autoscaler nodepool nodes
Once it is brought back up (rescheduled to the selected node), the turndown pod will start a process called "flattening" which attempts to set deployment replicas to 0, turn off jobs, and annotate pods with labels that allow the autoscaler to do the rest of the work. Flattening persists pre-turndown values in the annotations of Kubernetes objects. The GKE autoscaler behavior is expected to handle the rest: removing now-unneeded nodes from the node pools. A limitation of this strategy is that the autoscaled node pools won't go below their configured minimum node count.
When turn up occurs, deployments and DaemonSets are "expanded" to their original sizes/replicas.

There are four annotations that can be applied for this process:

kubecost.kubernetes.io/job-suspend: Stores a bool containing the previous paused state of a kubernetes CronJob.
kubecost.kubernetes.io/turn-down-replicas: Stores the previous number of replicas set on the deployment.
kubecost.kubernetes.io/turn-down-rollout: Stores the previous maxUnavailable for the deployment rollout.
kubecost.kubernetes.io/safe-evict: Uses the cluster-autoscaler.kubernetes.io/safe-to-evict for autoscaling clusters to have the autoscaler preserve any deployments that previously had this annotation set, so scale up occurs, this value isn't unintentionally reset.

AWS Kops strategy

Setup

Prerequisites

kubectl
Enable the Cluster Controller

Verify the pod is running

You can verify that the cluster-turndown pod is running with the following command:

$ kubectl get pods -l app=cluster-turndown -n turndown

Setting a turndown schedule

Turndown uses a Kubernetes Custom Resource Definition to create schedules. Here is an example resource located at artifacts/example-schedule.yaml:

apiVersion: kubecost.com/v1alpha1
kind: TurndownSchedule
metadata:
  name: example-schedule
  finalizers:
  - "finalizer.kubecost.com"
spec:
  start: 2020-03-12T00:00:00Z
  end: 2020-03-12T12:00:00Z
  repeat: daily

none: Single schedule turndown and turnup.
daily: Start and end times will reschedule every 24 hours.
weekly: Start and end times will reschedule every 7 days.

To create this schedule, you may modify example-schedule.yaml to your desired schedule and run:

$ kubectl apply -f artifacts/example-schedule.yaml

Currently, updating a resource is not supported, so if the scheduling of the example-schedule.yaml fails, you will need to delete the resource via:

$ kubectl delete tds example-schedule

Then make the modifications to the schedule and re-apply.

Viewing a turndown schedule

The turndownschedule resource can be listed via kubectl as well:

$ kubectl get turndownschedules

or using the shorthand:

$ kubectl get tds

Details regarding the status of the turndown schedule can be found by outputting as a JSON or YAML:

$ kubectl get tds example-schedule -o yaml

apiVersion: kubecost.com/v1alpha1
kind: TurndownSchedule
metadata:
  annotations:
    kubectl.kubernetes.io/last-applied-configuration: |
      {"apiVersion":"kubecost.com/v1alpha1","kind":"TurndownSchedule","metadata":{"annotations":{},"finalizers":["finalizer.kubecost.com"],"name":"example-schedule"},"spec":{"end":"2020-03-17T00:35:00Z","repeat":"daily","start":"2020-03-17T00:20:00Z"}}
  creationTimestamp: "2020-03-17T00:18:39Z"
  finalizers:
  - finalizer.kubecost.com
  generation: 1
  name: example-schedule
  resourceVersion: "33573"
  selfLink: /apis/kubecost.com/v1alpha1/turndownschedules/example-schedule
  uid: d9b16aed-67e4-11ea-b591-42010a8e0075
spec:
  end: "2020-03-17T00:35:00Z"
  repeat: daily
  start: "2020-03-17T00:20:00Z"
status:
  current: scaledown
  lastUpdated: "2020-03-17T00:36:39Z"
  nextScaleDownTime: "2020-03-18T00:21:38Z"
  nextScaleUpTime: "2020-03-18T00:36:38Z"
  scaleDownId: 38ebf595-4e2b-46e9-951a-1e3ceff30536
  scaleDownMetadata:
    repeat: daily
    type: scaledown
  scaleUpID: 869ec89f-a8d8-450b-9ebb-71cd4d7fbaf8
  scaleUpMetadata:
    repeat: daily
    type: scaleup
  state: ScheduleSuccess

The status field displays the current status of the schedule including next schedule times, specific schedule identifiers, and the overall state of schedule.

state: The state of the turndown schedule. This can be:
- ScheduleSuccess: The schedule has been set and is waiting to run.
- ScheduleFailed: The scheduling failed due to a schedule already existing, scheduling for a date-time in the past.
- ScheduleCompleted: For schedules with repeat: none, the schedule will move to a completed state after turn up.
current: The next action to run.
lastUpdated: The last time the status was updated on the schedule.
nextScaleDownTime: The next time a turndown will be executed.
nextScaleUpTime: The next time at turn up will be executed.
scaleDownId: Specific identifier assigned by the internal scheduler for turndown.
scaleUpId: Specific identifier assigned by the internal scheduler for turn up.
scaleDownMetadata: Metadata attached to the scaledown job, assigned by the turndown scheduler.
scaleUpMetadata: Metadata attached to the scale up job, assigned by the turndown scheduler.

Canceling a turndown schedule

A turndown can be canceled before turndown actually happens or after. This is performed by deleting the resource:

$ kubectl delete tds example-schedule

Canceling while turndown is currently scaling down or scaling up will result in a delayed cancellation, as the schedule must complete its operation before processing the deletion/cancellation.

If the turndown schedule is canceled between a turndown and turn up, the turn up will occur automatically upon cancellation.

Using cluster turndown via UI

Limitations

The internal scheduler only allows one schedule at a time to be used. Any additional schedule resources created will fail (kubectl get tds -o yaml will display the status).
Do not attempt to kubectl edit a turndown schedule. This is currently not supported. Recommended approach for modifying is to delete and then create a new schedule.
There is a 20-minute minimum time window between start and end of turndown schedule.