OpenShift / RHEL / DevSecOps 汇总目录
说明:本文已经在 OpenShift 4.18 的环境中验证


在为 OpenShift 控制台增加 NVIDIA GPU 监控功能前先要为 OpenShift 安装配置好 NVIDIA GPU。

安装 GPU Console Plugin

  1. 安装 Helm 环境。
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
  1. 为 Helm 添加 Repository,然后安装部署 console-plugin-nvidia-gpu。
helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
helm repo update
helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
  1. 查看相关资源部署状态。
$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME                                             READY   STATUS    RESTARTS   AGE
pod/console-plugin-nvidia-gpu-7bcf8b4b99-f2tp2   1/1     Running   0          35s
 
NAME                                TYPE        CLUSTER-IP      EXTERNAL-IP   PORT(S)    AGE
service/console-plugin-nvidia-gpu   ClusterIP   172.30.10.228   <none>        9443/TCP   35s
 
NAME                                        READY   UP-TO-DATE   AVAILABLE   AGE
deployment.apps/console-plugin-nvidia-gpu   1/1     1            1           35s
 
NAME                                                   DESIRED   CURRENT   READY   AGE
replicaset.apps/console-plugin-nvidia-gpu-7bcf8b4b99   1         1         1       35s
  1. 查看 OpenShift 控制台是否有 console-plugin-nvidia-gpu 配置了。
$ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
["networking-console-plugin","monitoring-plugin"]
  1. 然后根据返回结果执行以下一条命令。
# If you see the console-plugin-nvidia-gpu in the list
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
 
# If you don't see the console-plugin-nvidia-gpu in the list
$ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
  1. 刷新 OpenShift 控制台首页,可以看到监控到的 GPU 运行情况和监控情况。
    在这里插入图片描述
  2. 进入 “计算” - “GPUs” 菜单,然后在页面中选择一个物理 GPU,即可看到该 GPU 运行监控指标。另外下方还有运行在该 GPU 的工作负载情况。
    在这里插入图片描述

安装 NVIDIA DCGM 仪表板

  1. 执行以下命令启用 dcgmExporter。
$ oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge
  1. 下载 dcgmExporter 仪表板配置文件,然后创建相关配置。
$ curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json
$ oc create configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed --from-file=dcgm-exporter-dashboard.json

# To enable the dashboard in the Administrator view in the OpenShift Web UI 
$ oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/dashboard=true"

# To enable the dashboard for the developer view
$ oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/odc-dashboard=true"

$ oc get cm nvidia-dcgm-exporter-dashboard -n openshift-config-managed --show-labels
NAME                             DATA   AGE   LABELS
nvidia-dcgm-exporter-dashboard   1      31s   console.openshift.io/dashboard=true,console.openshift.io/odc-dashboard=true
  1. 最后进入 OpenShift 控制台的 “观察” - “仪表板” 菜单,在仪表板中选择 NVIDIA DCGM Exporter Dashborad,即可看到跟踪到的 GPU 运行情况。
    在这里插入图片描述

参考

https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/enable-gpu-monitoring-dashboard.html
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-op-dashboard.html
https://github.com/stratus-ss/openshift-ai/blob/main/docs/rendered/OpenShift_AI_User_Interface.md
https://github.com/rh-aiservices-bu/rhoai-uwm/tree/main/rhoai-uwm-grafana/overlays/rhoai-uwm-user-grafana-app
https://ai-on-openshift.io/odh-rhoai/kserve-uwm-dashboard-metrics/

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐