双机H100部署满血DeepseekSeek-R1-671B
本文记录一下双机8卡H100 GPU服务器上部署满血版DeepseekSeek-R1的流程及遇到过问题。
·
本文记录一下双机8卡H100 GPU服务器上部署满血版DeepseekSeek-R1的流程及遇到过问题
测试环境
GPU信息
8卡H100 80G显存
网卡信息
中mlx5_3和mlx5_4为100G管理网卡,其他8个位400G网卡,用于ROCE网络
# ibv_devices
device node GUID
------ ----------------
mlx5_0 adffe98933p9efjl
mlx5_1 adffe98933p9efjl
mlx5_2 adffe98933p9efjl
mlx5_3 adffe98933p9efjl
mlx5_4 adffe98933p9efjl
mlx5_5 adffe98933p9efjl
mlx5_6 adffe98933p9efjl
mlx5_7 adffe98933p9efjl
mlx5_8 adffe98933p9efjl
mlx5_9 adffe98933p9efjl
软件版本
网卡驱动: 24.10-1.1.4
GPU驱动: 550.54.14
cuda: 12.4
操作系统: ubuntu-22.04
kernel: 5.15.0-92-generic
pytorch: 2.5.1
NCCL: 2.21.5
openmpi: 4.1.3
sglang: 0.4.4.post1
运行nccl-test
网卡和GPU驱动安装好后,首先看一下nccl-test能否在双机跑起来,并测试双机带宽是否符合预期
编译
下载nccl-2.21.5-1和master版本nccl-test进行编译
cd /root/nccl-2.21.5-1
make -j24
cd /root/nccl-tests-master
make -j MPI=1 MPI_HOME=/usr/lib/x86_64-linux-gnu/openmpi CUDA_HOME=/usr/local/cuda NCCL_HOME=/root/nccl-2.21.5-1/build
运行
mpirun在两机执行nccl-test
#cat /root/hostfile
10.0.0.1 slots=8
10.0.0.2 slots=8
mpirun --allow-run-as-root --mca btl_tcp_if_include eth0 --hostfile /root/hostfile -x LD_LIBRARY_PATH=/root/nccl-2.21.5-1/build/lib -x NCCL_IB_HCA=^=mlx5_3 -x NCCL_SOCKET_IFNAME=eth0 -x NCCL_IB_GID_INDEX=3 /root/nccl-tests-master/build/all_reduce_perf -b 4G -e 4G -f 2 -g 1
LD_LIBRARY_PATH=/root/nccl-master/build/lib: 指定nccl库路径
NCCL_IB_HCA=^=mlx5_3,mlx5_4: 排除管理网卡mlx5_3,mlx5_4,使用其他8个400G网卡
NCCL_SOCKET_IFNAME=eth0: 指定nccl初始化时gpu之间通信的接口,eth0为mlx5_3
NCCL_IB_GID_INDEX=3: 指定RDMA通信使用的网卡gid,一般为3,表示ipv4 rocev2协议,可使用命令show_gids进行查看
all_reduce_perf的参数-b和-e要指定大一点,否则测不出最大带宽
问题
运行nccl-test过程遇到几个问题,这里说明一下解决方式,这些配置最好开机后就进行设置。遇到问题时,可通过设置环境变量查看log,或者在github nccl issue中搜索关键字
export NCCL_DEBUG=INFO
export NCCL_DEBUG_SUBSYS=ALL
- 保证nvidia-fabricmanager进程运行
systemctl status nvidia-fabricmanager
- 加载nvidia-peermem,使用网卡gdr,否则性能上不去
可参考:https://pavlokhmel.com/enable-gpudirect-rdma-and-benchmark-with-perftest-nccl-test-nvidia-hpcg-and-pytorch-resnet50.html
modprobe nvidia_peermem
- 关闭acs
可参考:
https://github.com/NVIDIA/nccl/issues/214
https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/troubleshooting.html
for BDF in `lspci -d "*:*:*" | awk '{print $1}'`; do
# skip if it doesn't support ACS
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w > /dev/null 2>&1
if [ $? -ne 0 ]; then
continue
fi
sudo setpci -v -s ${BDF} ECAP_ACS+0x6.w=0000
done
安装sglang
安装
本文使用sglang推理引擎进行测试。仍然使用conda虚拟环境安装sglang相关软件
//创建test虚拟环境
conda create -n test python=3.10
//进入test虚拟环境
conda activate test
//安装sglang
pip3 install sglang==0.4.4.post1
//安装torch
pip3 install torch==2.5.1
//下面的包都是在执行过程中提升缺少的,这里一次列举出来
pip3 install Pillow orjson uvicorn uvloop fastapi psutil vllm sgl_kernel decord pynvml torchao pyzmq
pip3 install transformers==4.48.3
pip3 install flashinfer_python
conda install libsqlite=3.48.0
启动
将模型放到/root/DeepSeek-R1-671B目录下
//环境变量在两台机器上执行
//sglang会同时使用pytorch的nccl和gloo分布式通信后端
export GLOO_SOCKET_IFNAME=eth0
export NCCL_SOCKET_IFNAME=eth0
export NCCL_IB_HCA=^=mlx5_3,mlx5_4
export NCCL_IB_GID_INDEX=3
//机器1上执行
python3 -m sglang.launch_server --model /root/DeepSeek-R1-671B --tp 16 --nccl-init-addr 10.0.0.1:3000--nnodes 2 --node-rank 0 --trust-remote-code
//机器2上执行
python3 -m sglang.launch_server --model /root/DeepSeek-R1-671B --tp 16 --nccl-init-addr 10.0.0.1:3000 --nnodes 2 --node-rank 1 --trust-remote-code
压测
下载数据集
wget https://hf-mirror.com/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
执行压测
python3 -m sglang.bench_serving --backend sglang --dataset-name random --random-input 64 --random-output 512 --random-range-ratio 1 --num-prompts 1000 --dataset-path /root/ShareGPT_V3_unfiltered_cleaned_split.json
压测结果
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: inf
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 177.61
Total input tokens: 64000
Total generated tokens: 512000
Total generated tokens (retokenized): 510866
Request throughput (req/s): 5.63
Input token throughput (tok/s): 360.34
Output token throughput (tok/s): 2882.70
Total token throughput (tok/s): 3243.04
Concurrency: 662.91
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 117740.19
Median E2E Latency (ms): 88516.13
---------------Time to First Token----------------
Mean TTFT (ms): 28372.62
Median TTFT (ms): 4495.75
P99 TTFT (ms): 90799.23
---------------Inter-Token Latency----------------
Mean ITL (ms): 174.89
Median ITL (ms): 157.05
P95 ITL (ms): 179.40
P99 ITL (ms): 372.32
Max ITL (ms): 27183.86
==================================================
点击阅读全文
更多推荐
所有评论(0)