本文记录一下双机8卡H100 GPU服务器上部署满血版DeepseekSeek-R1的流程及遇到过问题

测试环境

GPU信息
8卡H100 80G显存

网卡信息
中mlx5_3和mlx5_4为100G管理网卡,其他8个位400G网卡,用于ROCE网络

# ibv_devices
    device                 node GUID
    ------              ----------------
    mlx5_0              adffe98933p9efjl
    mlx5_1              adffe98933p9efjl
    mlx5_2              adffe98933p9efjl
    mlx5_3              adffe98933p9efjl
    mlx5_4              adffe98933p9efjl
    mlx5_5              adffe98933p9efjl
    mlx5_6              adffe98933p9efjl
    mlx5_7              adffe98933p9efjl
    mlx5_8              adffe98933p9efjl
    mlx5_9              adffe98933p9efjl

软件版本
网卡驱动: 24.10-1.1.4
GPU驱动: 550.54.14
cuda: 12.4
操作系统: ubuntu-22.04
kernel: 5.15.0-92-generic
pytorch: 2.5.1
NCCL: 2.21.5
openmpi: 4.1.3
sglang: 0.4.4.post1

运行nccl-test

网卡和GPU驱动安装好后,首先看一下nccl-test能否在双机跑起来,并测试双机带宽是否符合预期
编译
下载nccl-2.21.5-1和master版本nccl-test进行编译

cd /root/nccl-2.21.5-1
make -j24

cd /root/nccl-tests-master
make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/root/nccl-2.21.5-1/build

运行
mpirun在两机执行nccl-test

#cat /root/hostfile
10.0.0.1 slots=8
10.0.0.2 slots=8

mpirun --allow-run-as-root  --mca btl_tcp_if_include eth0 --hostfile /root/hostfile -x LD_LIBRARY_PATH=/root/nccl-2.21.5-1/build/lib -x NCCL_IB_HCA=^=mlx5_3 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_GID_INDEX=3  /root/nccl-tests-master/build/all_reduce_perf -b 4G -e 4G -f 2 -g 1

LD_LIBRARY_PATH=/root/nccl-master/build/lib: 指定nccl库路径
NCCL_IB_HCA=^=mlx5_3,mlx5_4: 排除管理网卡mlx5_3,mlx5_4,使用其他8400G网卡
NCCL_SOCKET_IFNAME=eth0: 指定nccl初始化时gpu之间通信的接口,eth0为mlx5_3
NCCL_IB_GID_INDEX=3: 指定RDMA通信使用的网卡gid,一般为3,表示ipv4 rocev2协议,可使用命令show_gids进行查看
all_reduce_perf的参数-b和-e要指定大一点,否则测不出最大带宽

问题
运行nccl-test过程遇到几个问题,这里说明一下解决方式,这些配置最好开机后就进行设置。遇到问题时,可通过设置环境变量查看log,或者在github nccl issue中搜索关键字

export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
  1. 保证nvidia-fabricmanager进程运行
systemctl status nvidia-fabricmanager
  1. 加载nvidia-peermem,使用网卡gdr,否则性能上不去
    可参考:https://pavlokhmel.com/enable-gpudirect-rdma-and-benchmark-with-perftest-nccl-test-nvidia-hpcg-and-pytorch-resnet50.html
modprobe nvidia_peermem
  1. 关闭acs
    可参考:
    https://github.com/NVIDIA/nccl/issues/214
    https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
  # skip if it doesn't support ACS
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
  if [ $? -ne 0 ]; then
    continue
  fi
  sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done

安装sglang

安装
本文使用sglang推理引擎进行测试。仍然使用conda虚拟环境安装sglang相关软件

//创建test虚拟环境
conda create -n test python=3.10
//进入test虚拟环境
conda activate test

//安装sglang
pip3 install sglang==0.4.4.post1
//安装torch
pip3 install torch==2.5.1
//下面的包都是在执行过程中提升缺少的,这里一次列举出来
pip3 install Pillow orjson uvicorn uvloop fastapi psutil vllm sgl_kernel decord pynvml torchao pyzmq
pip3 install transformers==4.48.3
pip3 install flashinfer_python
conda install libsqlite=3.48.0

启动
将模型放到/root/DeepSeek-R1-671B目录下

//环境变量在两台机器上执行
//sglang会同时使用pytorch的nccl和gloo分布式通信后端
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=^=mlx5_3,mlx5_4
export NCCL_IB_GID_INDEX=3

//机器1上执行
python3 -m sglang.launch_server --model /root/DeepSeek-R1-671B --tp 16 --nccl-init-addr 10.0.0.1:3000--nnodes 2 --node-rank 0 --trust-remote-code
//机器2上执行
python3 -m sglang.launch_server --model /root/DeepSeek-R1-671B --tp 16 --nccl-init-addr 10.0.0.1:3000  --nnodes 2 --node-rank 1 --trust-remote-code

压测
下载数据集

wget https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json 

执行压测

python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 64 --random-output 512 --random-range-ratio 1 --num-prompts 1000 --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json

压测结果

============ Serving Benchmark Result ============
Backend:                                 sglang
Traffic request rate:                    inf
Max reqeuest concurrency:                not set
Successful requests:                     1000
Benchmark duration (s):                  177.61
Total input tokens:                      64000
Total generated tokens:                  512000
Total generated tokens (retokenized):    510866
Request throughput (req/s):              5.63
Input token throughput (tok/s):          360.34
Output token throughput (tok/s):         2882.70
Total token throughput (tok/s):          3243.04
Concurrency:                             662.91
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   117740.19
Median E2E Latency (ms):                 88516.13
---------------Time to First Token----------------
Mean TTFT (ms):                          28372.62
Median TTFT (ms):                        4495.75
P99 TTFT (ms):                           90799.23
---------------Inter-Token Latency----------------
Mean ITL (ms):                           174.89
Median ITL (ms):                         157.05
P95 ITL (ms):                            179.40
P99 ITL (ms):                            372.32
Max ITL (ms):                            27183.86
==================================================
点击阅读全文
Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐