OpenShift AI - 在控制台增加 NVIDIA GPU 监控功能
说明:本文已经在 OpenShift 4.18 的环境中验证。
·
《OpenShift / RHEL / DevSecOps 汇总目录》
说明:本文已经在 OpenShift 4.18 的环境中验证
在为 OpenShift 控制台增加 NVIDIA GPU 监控功能前先要为 OpenShift 安装配置好 NVIDIA GPU。
安装 GPU Console Plugin
- 安装 Helm 环境。
curl -fsSL -o get_helm.sh https://raw.githubusercontent.com/helm/helm/main/scripts/get-helm-3
chmod 700 get_helm.sh
./get_helm.sh
- 为 Helm 添加 Repository,然后安装部署 console-plugin-nvidia-gpu。
helm repo add rh-ecosystem-edge https://rh-ecosystem-edge.github.io/console-plugin-nvidia-gpu
helm repo update
helm install -n nvidia-gpu-operator console-plugin-nvidia-gpu rh-ecosystem-edge/console-plugin-nvidia-gpu
- 查看相关资源部署状态。
$ oc -n nvidia-gpu-operator get all -l app.kubernetes.io/name=console-plugin-nvidia-gpu
Warning: apps.openshift.io/v1 DeploymentConfig is deprecated in v4.14+, unavailable in v4.10000+
NAME READY STATUS RESTARTS AGE
pod/console-plugin-nvidia-gpu-7bcf8b4b99-f2tp2 1/1 Running 0 35s
NAME TYPE CLUSTER-IP EXTERNAL-IP PORT(S) AGE
service/console-plugin-nvidia-gpu ClusterIP 172.30.10.228 <none> 9443/TCP 35s
NAME READY UP-TO-DATE AVAILABLE AGE
deployment.apps/console-plugin-nvidia-gpu 1/1 1 1 35s
NAME DESIRED CURRENT READY AGE
replicaset.apps/console-plugin-nvidia-gpu-7bcf8b4b99 1 1 1 35s
- 查看 OpenShift 控制台是否有 console-plugin-nvidia-gpu 配置了。
$ oc get consoles.operator.openshift.io cluster --output=jsonpath="{.spec.plugins}"
["networking-console-plugin","monitoring-plugin"]
- 然后根据返回结果执行以下一条命令。
# If you see the console-plugin-nvidia-gpu in the list
$ oc patch consoles.operator.openshift.io cluster --patch '{ "spec": { "plugins": ["console-plugin-nvidia-gpu"] } }' --type=merge
# If you don't see the console-plugin-nvidia-gpu in the list
$ oc patch consoles.operator.openshift.io cluster --patch '[{"op": "add", "path": "/spec/plugins/-", "value": "console-plugin-nvidia-gpu" }]' --type=json
- 刷新 OpenShift 控制台首页,可以看到监控到的 GPU 运行情况和监控情况。
- 进入 “计算” - “GPUs” 菜单,然后在页面中选择一个物理 GPU,即可看到该 GPU 运行监控指标。另外下方还有运行在该 GPU 的工作负载情况。
安装 NVIDIA DCGM 仪表板
- 执行以下命令启用 dcgmExporter。
$ oc patch clusterpolicies.nvidia.com gpu-cluster-policy --patch '{ "spec": { "dcgmExporter": { "config": { "name": "console-plugin-nvidia-gpu" } } } }' --type=merge
- 下载 dcgmExporter 仪表板配置文件,然后创建相关配置。
$ curl -LfO https://github.com/NVIDIA/dcgm-exporter/raw/main/grafana/dcgm-exporter-dashboard.json
$ oc create configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed --from-file=dcgm-exporter-dashboard.json
# To enable the dashboard in the Administrator view in the OpenShift Web UI
$ oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/dashboard=true"
# To enable the dashboard for the developer view
$ oc label configmap nvidia-dcgm-exporter-dashboard -n openshift-config-managed "console.openshift.io/odc-dashboard=true"
$ oc get cm nvidia-dcgm-exporter-dashboard -n openshift-config-managed --show-labels
NAME DATA AGE LABELS
nvidia-dcgm-exporter-dashboard 1 31s console.openshift.io/dashboard=true,console.openshift.io/odc-dashboard=true
- 最后进入 OpenShift 控制台的 “观察” - “仪表板” 菜单,在仪表板中选择 NVIDIA DCGM Exporter Dashborad,即可看到跟踪到的 GPU 运行情况。
参考
https://docs.nvidia.com/datacenter/cloud-native/openshift/latest/enable-gpu-monitoring-dashboard.html
https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/openshift/enable-gpu-op-dashboard.html
https://github.com/stratus-ss/openshift-ai/blob/main/docs/rendered/OpenShift_AI_User_Interface.md
https://github.com/rh-aiservices-bu/rhoai-uwm/tree/main/rhoai-uwm-grafana/overlays/rhoai-uwm-user-grafana-app
https://ai-on-openshift.io/odh-rhoai/kserve-uwm-dashboard-metrics/
更多推荐
所有评论(0)