NVIDIA GPU Monitoring Configurations

Monitoring GPU utilization

In order for Kubecost to understand GPU utilization, Kubecost depends on metrics being available from NVIDIA DCGM Exporter. Kubecost will search for GPU metrics by default, but since DCGM Exporter is the provider of those metrics it is a required component when GPU monitoring is used with Kubecost and must be installed if it is not already. In many cases, DCGM Exporter may already be installed in your cluster, for example if you currently monitor NVIDIA GPUs with other software. But if not, follow the below instructions to install and configure DCGM Exporter on each of your GPU-enabled clusters.

Install DCGM Exporter

DCGM Exporter is an implementation of NVIDIA Data Center GPU Manager (DCGM) for Kubernetes which exports metrics in Prometheus format. DCGM Exporter allows for running the DCGM software under Kubernetes on nodes which contain NVIDIA devices and takes care of the task of making DCGM metrics available to external tools such as Kubecost.

DCGM Exporter runs as a DaemonSet and its pods are intended to run only on nodes with one or more NVIDIA GPUs. Because Kubernetes clusters commonly have a mixture of nodes with GPUs and those without GPUs, you use label(s) to affine the DCGM Exporter pods to only those nodes containing NVIDIA GPUs. If DCGM Exporter pods run on nodes without NVIDIA GPUs, they enter a CrashLoopBackoff state. The label(s) you use may vary by Kubernetes cloud provider, platform, or more. There are multiple approaches to selecting the appropriate label(s) used to attract the DCGM Exporter pods to applicable nodes.

  1. Use a pre-provided label by your cloud provider (if applicable, varies by cloud provider).

  2. Use a custom label you define on your GPU nodes. For example, by defining a custom label at the node pool level in your cloud provider.

  3. Use a label assigned automatically by Kubernetes Node Feature Discovery (NFD).

The first two options require no additional cluster components be installed while the third requires the Kubernetes Node Feature Discovery (NFD) component. Kubecost recommends using an existing label assigned to your GPU nodes (provided by the cloud provider or yourself), if possible, as this is a simpler installation path.

In addition to the label requirement, there may be additional values required for a successful installation of DCGM Exporter which may vary by cloud provider and worker node operating system. This guide includes the following installation instructions.

DCGM Exporter may also be deployed via the NVIDIA GPU operator, however the operator is a more complex component with specialized requirements and, as such, is outside the current scope of this documentation.

These instructions have been verified on version 3.3.6-3.4.2 of DCGM Exporter but prior versions of v3 should work as well.

General Quickstart

DCGM Exporter can be installed on most Kubernetes clusters with only a few values provided that a preexisting label can be used to identify GPU-only nodes. This label may be provided by a cloud vendor or yourself. Follow these steps to get started with DCGM Exporter.

In the below values, you provide your own label key and value in place of mylabel and myvalue. This label combination should be unique to NVIDIA GPU nodes.

values-dcgm.yaml
serviceMonitor:
  enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: mylabel
          operator: In
          values:
          - "myvalue"

extraConfigMapVolumes:
  - name: exporter-metrics-volume
    configMap:
      name: exporter-metrics-config-map
      items:
      - key: metrics
        path: dcp-metrics-included.csv

extraVolumeMounts:
  - name: exporter-metrics-volume
    mountPath: /etc/dcgm-exporter/dcp-metrics-included.csv
    readOnly: true
    subPath: dcp-metrics-included.csv

Install DCGM Exporter using the values defined.

helm upgrade -i dcgm dcgm-exporter \
  --repo https://nvidia.github.io/dcgm-exporter/helm-charts \
  -n dcgm-exporter --create-namespace \
  -f values-dcgm.yaml

Ensure the DCGM Exporter pods are in a running state and only on the nodes with NVIDIA GPUs.

kubectl -n dcgm-exporter get pods

Finally, perform a validation step to ensure that metrics are working as expected. See the Validation section for details.

GKE

To install DCGM Exporter on a GKE autopilot cluster where the worker nodes use the default Container Optimized OS (COS), use the following values. The GKE-provided label cloud.google.com/gke-accelerator is used to attract DCGM Exporter pods to nodes with NVIDIA GPUs.

These values have been verified on GKE 1.27 and DCGM Exporter 3.3.6-3.4.2. Ensure you check and follow the current values structure of the target version of DCGM Exporter to be installed if different.

values-dcgm.yaml
serviceMonitor:
  enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: cloud.google.com/gke-accelerator
              operator: Exists

tolerations:
  - operator: Exists

securityContext:
  privileged: true

extraHostVolumes:
  - name: vulkan-icd-mount
    hostPath: /home/kubernetes/bin/nvidia/vulkan/icd.d
  - name: nvidia-install-dir-host
    hostPath: /home/kubernetes/bin/nvidia

extraConfigMapVolumes:
  - name: exporter-metrics-volume
    configMap:
      name: exporter-metrics-config-map
      items:
      - key: metrics
        path: dcp-metrics-included.csv

extraVolumeMounts:
  - name: nvidia-install-dir-host
    mountPath: /usr/local/nvidia
    readOnly: true
  - name: vulkan-icd-mount
    mountPath: /etc/vulkan/icd.d
    readOnly: true
  - name: exporter-metrics-volume
    mountPath: /etc/dcgm-exporter/dcp-metrics-included.csv
    subPath: dcp-metrics-included.csv

extraEnv:
- name: DCGM_EXPORTER_KUBERNETES_GPU_ID_TYPE
  value: device-name

Install DCGM Exporter from the available Helm chart while supplying the values defined above.

helm upgrade -i dcgm dcgm-exporter \
  --repo https://nvidia.github.io/dcgm-exporter/helm-charts \
  -n dcgm-exporter --create-namespace \
  -f values-dcgm.yaml

Ensure the DCGM Exporter pods are in a running state and only on the nodes with NVIDIA GPUs.

kubectl -n dcgm-exporter get pods

For additional information on installing DCGM Exporter in Google Cloud, see here.

Finally, perform a validation step to ensure that metrics are working as expected. See the Validation section for details.

Node Feature Discovery

These instructions are useful for installing DCGM Exporter on any Kubernetes cluster regardless of whether run by a cloud provider or self-managed, on-premises. They leverage the Kubernetes Node Feature Discovery (NFD) component which involves installation of an additional infrastructure component. Following these steps are recommended when you are not on GKE or do not have a preexisting label which identifies NVIDIA GPU nodes.

When following these instructions on a cloud provider, there may be additional values or steps required depending on the component installed.

Node Feature Discovery (NFD) is a Kubernetes utility which automatically discovers information and capabilities about your worker nodes and saves this information in the form of labels applied to the node. For example, NFD will discover the CPU details, OS, and the PCI cards installed in a worker node on which the NFD pod is run. These labels can be useful in a number of scenarios beyond installation of DCGM Exporter. An example of some of the labels are shown below.

<snip>
feature.node.kubernetes.io/cpu-cpuid.ADX: "true"
feature.node.kubernetes.io/cpu-cpuid.AESNI: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX: "true"
feature.node.kubernetes.io/cpu-cpuid.AVX2: "true"
<snip>

When run on a node with an NVIDIA GPU, NFD will apply the label feature.node.kubernetes.io/pci-10de.present="true". This label can then be used to attract DCGM Exporter pods to NVIDIA GPU nodes automatically.

10DE is the vendor ID assigned to the NVIDIA corporation.

NFD may be installed either standalone or as a component of the NVIDIA device plugin for Kubernetes. When installing NFD via the device plugin, you enable the GPU Feature Discovery (GFD) component at the same time. GFD uses the labels written by NFD to locate NVIDIA GPU nodes and write NVIDIA-specific information about the discovered GPUs to the node.

Cloud providers often install the device plugin on GPU nodes automatically. Therefore, in order to deploy GFD and NFD you may be required to upgrade or uninstall/reinstall the device plugin, which is a more advanced procedure. See instructions from your cloud provider first and refer to the NVIDIA device plugin for Kubernetes repository for further details.

To install NFD as a standalone component, follow the deployment guide here. A quick start command is also shown below. In some cases, you may have taints applied to GPU nodes which must be tolerated by the NFD DaemonSet. It is recommended to use the Helm installation guide to define tolerations if so.

# This command uses Kustomize to deploy Kubernetes resources from a specific version.
# Refer to the NFD releases to choose the latest, or most applicable, version.
kubectl apply -k https://github.com/kubernetes-sigs/node-feature-discovery/deployment/overlays/default?ref=v0.16.3

Once NFD is installed, ensure one pod is running on your node(s) with NVIDIA GPUs.

kubectl -n node-feature-discovery get pods

After a few moments, check the labels of one such node to ensure the feature.node.kubernetes.io/pci-10de.present="true" label has been applied.

kubectl get no <my_node_name> -o yaml | yq .metadata.labels

An abridged output of the labels written to an EKS node is shown below.

<snip>
feature.node.kubernetes.io/kernel-version.full: 5.10.219-208.866.amzn2.x86_64
feature.node.kubernetes.io/kernel-version.major: "5"
feature.node.kubernetes.io/kernel-version.minor: "10"
feature.node.kubernetes.io/kernel-version.revision: "219"
feature.node.kubernetes.io/pci-10de.present: "true"
feature.node.kubernetes.io/pci-1d0f.present: "true"
feature.node.kubernetes.io/storage-nonrotationaldisk: "true"
<snip>

With NFD having successfully discovered NVIDIA PCI devices and assigned the feature.node.kubernetes.io/pci-10de.present="true" label, install DCGM Exporter using this label to attract pods to GPU nodes. When following this process on GKE, additional values may be required to successfully run DCGM Exporter. See the GKE section for more details.

values-dcgm.yaml
serviceMonitor:
  enabled: false

affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
      - matchExpressions:
        - key: feature.node.kubernetes.io/pci-10de.present
          operator: In
          values:
          - "true"

extraConfigMapVolumes:
  - name: exporter-metrics-volume
    configMap:
      name: exporter-metrics-config-map
      items:
      - key: metrics
        path: dcp-metrics-included.csv

extraVolumeMounts:
  - name: exporter-metrics-volume
    mountPath: /etc/dcgm-exporter/dcp-metrics-included.csv
    readOnly: true
    subPath: dcp-metrics-included.csv

Install DCGM Exporter using the values defined.

helm upgrade -i dcgm dcgm-exporter \
  --repo https://nvidia.github.io/dcgm-exporter/helm-charts \
  -n dcgm-exporter --create-namespace \
  -f values-dcgm.yaml

Ensure the DCGM Exporter pods are in a running state and only on the nodes with NVIDIA GPUs.

kubectl -n dcgm-exporter get pods

Finally, perform a validation step to ensure that metrics are working as expected. See the Validation section for details.

Customizing Metrics

DCGM Exporter presents a number of useful metrics by default. However, there are many more metrics available from DCGM which are not enabled by default. Kubecost may collect additional metrics about NVIDIA GPUs if they are emitted by DCGM Exporter. Configuring DCGM Exporter to emit additional metrics requires modification of the metrics configuration ConfigMap. Follow the procedure below to configure DCGM Exporter to emit additional metrics. Please be aware that emission of additional DCGM Exporter metrics does not necessarily mean Kubecost will collect and make use of them. This procedure should only be followed at the explicit advice of Kubecost support.

This procedure assumes you have installed DCGM Exporter according to one of the processes outlined in the Install DCGM Exporter section. Specifically, it assumes you have used the provided Helm values to mount the ConfigMap included with DCGM Exporter. If that is not the case or you had DCGM Exporter already installed, you may need to modify your deployment accordingly.

Modify the metrics ConfigMap

In this step, you update the ConfigMap used by DCGM Exporter to include additional metrics. Because this ConfigMap takes comma-separated values (CSV), you must append the new metrics to the ConfigMap in the same format. Rather than modify the ConfigMap directly by using an imperative command such as kubectl edit configmap, it is preferable and more reliable to dump the ConfigMap first, edit the values, and re-apply it. If using a GitOps approach, check with your cluster administrator as you may need to make modifications in git rather than in the cluster directly, otherwise changes may be reverted.

Export the metrics ConfigMap to your local system.

kubectl -n dcgm-exporter get cm exporter-metrics-config-map -o yaml > exporter-metrics-config-map.yaml

Open the exporter-metrics-config-map.yaml YAML file in your editor of choice.

Under the metrics key, scroll to the bottom and insert as new lines the additional metrics you wish DCGM Exporter to emit. You must provide these metrics in CSV format which is <metric>, <type>, <description>. The <type> is especially important as the wrong type will render DCGM Exporter unable to start because the metric configuration will be invalid.

As an example, provide the following new entries at the bottom of the metrics key. Take care to ensure the lines are indented similar to other lines. Lines beginning with the # character indicate comments.

# Kubecost custom metrics
DCGM_FI_PROF_SM_ACTIVE,          gauge, The ratio of cycles an SM has at least 1 warp assigned (in %).
DCGM_FI_PROF_SM_OCCUPANCY,       gauge, The ratio of number of warps resident on an SM (in %).
DCGM_FI_DEV_MEM_MAX_OP_TEMP,     gauge, Maximum operating temperature for the memory of this GPU.
DCGM_FI_DEV_GPU_MAX_OP_TEMP,     gauge, Maximum operating temperature for this GPU.
DCGM_FI_DEV_POWER_MGMT_LIMIT,    gauge, Current Power limit for the device.

Save the changes to the exporter-metrics-config-map.yaml YAML file and apply it back to the cluster using kubectl apply.

kubectl -n dcgm-exporter apply -f exporter-metrics-config-map.yaml

The following output may be displayed. Disregard the warning if present.

Warning: resource configmaps/exporter-metrics-config-map is missing the kubectl.kubernetes.io/last-applied-configuration annotation which is required by kubectl apply. kubectl apply should only be used on resources created declaratively by either kubectl create --save-config or kubectl apply. The missing annotation will be patched automatically.
configmap/exporter-metrics-config-map configured

Restart DCGM Exporter

After the changes are applied, you must restart the DCGM Exporter DaemonSet which will cause the new pods to read the modified ConfigMap.

kubectl -n dcgm-exporter rollout restart daemonset dcgm-dcgm-exporter

After a few moments, check the DCGM Exporter pods to ensure that all are in a running state. If any are found to be in a CrashLoopBackoff there may be errors introduced in the ConfigMap you edited in the previous step. Inspect and rectify any errors and try again.

Validation

To validate your DCGM Exporter configuration, port-forward into the DCGM Exporter service and ensure first that metrics are being exposed.

kubectl -n dcgm-exporter port-forward svc/dcgm-dcgm-exporter 9400:9400

Use cURL to perform a GET request against the service and verify that multiple metrics and their values are shown.

curl localhost:9400/metrics

An output similar to below should be shown.

# HELP DCGM_FI_DEV_SM_CLOCK SM clock frequency (in MHz).
# TYPE DCGM_FI_DEV_SM_CLOCK gauge
DCGM_FI_DEV_SM_CLOCK{gpu="0",UUID="GPU-93ef0036-98de-4946-648a-eca7040afbeb",device="nvidia0",modelName="Tesla T4",Hostname="myhost1.compute.internal"} 300
<snip>

If Kubecost has already been installed, next check the bundled Prometheus instance to ensure that the metrics from DCGM Exporter have been collected and are visible. This command exposes the Prometheus web interface on local port 8080

kubectl -n kubecost port-forward svc/kubecost-prometheus-server 8080:80

Open the Prometheus web interface in your browser by navigating to http://localhost:8080. In the search box, begin typing the prefix for a metric, for example DCGM_FI_DEV_POWER_USAGE. Click Execute to view the returned query and verify that there is data present. An example is shown below.

Last updated