
DeepSeek-R1/V3满血版本地部署(超详细)
1.目前看起来SGLang在满血版的deepseek-v3和r1上优化没有想象中的大、2.多节点部署(tp并行)显存占用略微下降(非线性下降),单卡93GB—>单卡83GB3.官方推荐的性能优化选项貌似没有明显的性能提升4.对多节点真的支持tp和dp同时并行存疑。
·
DeepSeek-R1/V3满血版部署
单机部署
环境信息
机型 | 系统 | CUDA |
---|---|---|
H20*8 | Ubuntu22.04 | 12.4 |
模型下载
实测1000MB的弹性带宽,约120MB/s的下载速度,也需要将近2小时才可以下载完成,所以强烈建议大家从modelscope下载模型。
apt install git-lfs
git lfs install
git lfs clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git
Docker安装启动
由于需要一定空间,建议大家将容器的实际存储位置做更改,以免后续因空间不足无法操作容器
#添加Docker软件包源
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common
sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository -y "deb [arch=$(dpkg --print-architecture)] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
#安装Docker社区版本,容器运行时containerd.io,以及Docker构建和Compose插件
sudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
#修改Docker存储路径
vim /etc/docker/daemon.json
{
"data-root/graph": "/mnt/docker" #挂载路径
}
#安装安装nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
&& curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get install -y nvidia-container-toolkit
#启动Docker
sudo systemctl start docker
#设置Docker守护进程在系统启动时自动启动
sudo systemctl enable docker
下载并启动sglang容器
我们这里选择拉取阿里云镜像
#拉取sglang容器镜像
sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207
#启动sglang容器
sudo docker run -t -d --name="sglang-test" --ipc=host --cap-add=SYS_PTRACE --network=host --gpus all --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/model:/mnt egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207
# 进入容器
docker exec -it sglang-test bash
参数配置
# 进入模型地址在容器的映射位置(随下载位置而变化)
cd /mnt/DeepSeek-R1
# 编辑配置文件(不建议修改模型参数,保持默认即可)
vim generation_config.json
参数 | 介绍 | 推荐值 |
---|---|---|
_from_model_config | 从默认配置加载模型参数 | —— |
bos_token_id | 指示模型开始生成文本时使用的标记 ID | —— |
eos_token_id | 指示生成文本的终止点 | —— |
do_sample | 控制是否启用采样(sampling) | —— |
temperature | 温度参数,用于控制生成时的随机性 | 0.6 |
top_p | 概率分布的累积阈值 | 0.95 |
transformers_version | 模型所使用的Transformers库的版本 | —— |
如果想要修改SGLang的参数配置,也可以参考下述
# 查询sglang相关文件位置
find / -name "*sglang*"
# 在返回结果中,找到容器中安装的sglang库位置
cd /mnt/docker/overlay2/066029ada9b53919beaa9949840d89c881d38c18a995cb489676266274ee4018/diff/usr/local/lib/python3.10/dist-packages/sglang
# 编辑配置文件(建议仅查看,而将参数置于启动命令中)
vim srt/server_args.py
启动模型服务
首次启动时间较长,建议挂后台进行操作
# 启动服务
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --port 30000 --mem-fraction-static 0.9 --tp 8 --trust-remote-code
# 允许外部请求访问
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --port 30000 --mem-fraction-static 0.9 --tp 8 --trust-remote-code --host 0.0.0.0
推理测试
# curl代码
curl http://localhost:30000/generate \
-H "Content-Type: application/json" \
-d '{
"text": "deepseek中有几个e?",
"sampling_params": {
"max_new_tokens": 3000,
"temperature": 0
}
}'
# OpenAI标准接口
import openai
client = openai.Client(
base_url="http://localhost:30000/v1", api_key="EMPTY") #localhost改为公网IP地址即可(安全组中需开放对应IP以及端口)
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "你是谁"},
],
temperature=0.6,
max_tokens=1024,
)
print(response)
性能评估
deepseek-v3
离线评估,大致耗时约20分钟
下载数据集
git lfs clone https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split.git
#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-V3 离线性能评估(离线性能评估时,需要先停止推理服务)
python3 -m sglang.bench_offline_throughput --model-path /mnt/DeepSeek-V3 --trust-remote-code --tensor-parallel-size 8 --num-prompts 4 --dataset-name random --random-input 5000 --random-output 1000 --random-range-ratio 1.0 --disable-radix-cache --prefill-only-one-req True --mem-fraction-static 0.9 --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
#DeepSeek-V3 离线性能评估结果
====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 4
Benchmark duration (s): 54.80
Total input tokens: 20000
Total generated tokens: 4000
Request throughput (req/s): 0.07
Input token throughput (tok/s): 364.98
Output token throughput (tok/s): 73.00
Total token throughput (tok/s): 437.98
==================================================
在线评估结果仅供大家参考
#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-V3 在线性能评估
python3 -m sglang.bench_serving --backend sglang --model /mnt/DeepSeek-V3 --port 30000 --dataset-name random --request-rate-range 1,2,4,8 --random-input 1024 --random-output 128 --random-range-ratio 1.0 --multi --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
#DeepSeek-V3 输出结果:
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json', model='/mnt/DeepSeek-V3', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, max_concurrency=None, multi=True, request_rate_range='1,2,4,8', output_file=None, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)
#Input tokens: 1024000
#Output tokens: 128000
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [16:38<00:00, 1.00it/s]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 1
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 998.79
Total input tokens: 1024000
Total generated tokens: 128000
Total generated tokens (retokenized): 127187
Request throughput (req/s): 1.00
Input token throughput (tok/s): 1025.24
Output token throughput (tok/s): 128.16
Total token throughput (tok/s): 1153.40
Concurrency: 12.50
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 12485.21
Median E2E Latency (ms): 12397.13
---------------Time to First Token----------------
Mean TTFT (ms): 496.60
Median TTFT (ms): 415.34
P99 TTFT (ms): 1085.47
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 94.40
Median TPOT (ms): 93.63
P99 TPOT (ms): 128.08
---------------Inter-token Latency----------------
Mean ITL (ms): 94.41
Median ITL (ms): 69.16
P99 ITL (ms): 376.26
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
42%|██████████████████████████████████████████████████████████████████████████▎ | 415/1000 [03:56<08:27, 1.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [09:01<00:00, 1.85it/s]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 2
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 541.18
Total input tokens: 1024000
Total generated tokens: 128000
Total generated tokens (retokenized): 127206
Request throughput (req/s): 1.85
Input token throughput (tok/s): 1892.16
Output token throughput (tok/s): 236.52
Total token throughput (tok/s): 2128.68
Concurrency: 44.26
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 23951.96
Median E2E Latency (ms): 24760.97
---------------Time to First Token----------------
Mean TTFT (ms): 828.14
Median TTFT (ms): 609.77
P99 TTFT (ms): 4613.55
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 182.08
Median TPOT (ms): 188.13
P99 TPOT (ms): 210.65
---------------Inter-token Latency----------------
Mean ITL (ms): 182.10
Median ITL (ms): 102.29
P99 ITL (ms): 932.91
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:11<00:00, 2.32it/s]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 4
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 431.12
Total input tokens: 1024000
Total generated tokens: 128000
Total generated tokens (retokenized): 127197
Request throughput (req/s): 2.32
Input token throughput (tok/s): 2375.21
Output token throughput (tok/s): 296.90
Total token throughput (tok/s): 2672.11
Concurrency: 271.39
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 117002.49
Median E2E Latency (ms): 120702.03
---------------Time to First Token----------------
Mean TTFT (ms): 96414.64
Median TTFT (ms): 100729.00
P99 TTFT (ms): 197647.17
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 162.11
Median TPOT (ms): 155.82
P99 TPOT (ms): 187.84
---------------Inter-token Latency----------------
Mean ITL (ms): 162.33
Median ITL (ms): 103.53
P99 ITL (ms): 1457.23
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:02<00:00, 2.37it/s]
============ Serving Benchmark Result ============
Backend: sglang
Traffic request rate: 8
Max reqeuest concurrency: not set
Successful requests: 1000
Benchmark duration (s): 422.48
Total input tokens: 1024000
Total generated tokens: 128000
Total generated tokens (retokenized): 127166
Request throughput (req/s): 2.37
Input token throughput (tok/s): 2423.77
Output token throughput (tok/s): 302.97
Total token throughput (tok/s): 2726.74
Concurrency: 372.94
----------------End-to-End Latency----------------
Mean E2E Latency (ms): 157559.86
Median E2E Latency (ms): 158652.08
---------------Time to First Token----------------
Mean TTFT (ms): 140367.45
Median TTFT (ms): 142588.55
P99 TTFT (ms): 288377.24
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms): 135.37
Median TPOT (ms): 123.44
P99 TPOT (ms): 179.09
---------------Inter-token Latency----------------
Mean ITL (ms): 135.44
Median ITL (ms): 103.35
P99 ITL (ms): 386.42
==================================================
deepseek-r1
#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-R1 离线性能评估
python3 -m sglang.bench_offline_throughput --model-path /mnt/DeepSeek-R1 --trust-remote-code --tensor-parallel-size 8 --num-prompts 4 --dataset-name random --random-input 5000 --random-output 1000 --random-range-ratio 1.0 --disable-radix-cache --prefill-only-one-req True --mem-fraction-static 0.9 --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json
server_args=ServerArgs(model_path='/mnt/DeepSeek-R1', tokenizer_path='/mnt/DeepSeek-R1', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='/mnt/DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='127.0.0.1', port=30000, mem_fraction_static=0.9, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=True, tp_size=8, stream_interval=1, stream_output=False, random_seed=422651568, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=True, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False)
#DeepSeek-R1 离线性能评估结果
====== Offline Throughput Benchmark Result =======
Backend: engine
Successful requests: 4
Benchmark duration (s): 54.86
Total input tokens: 20000
Total generated tokens: 4000
Request throughput (req/s): 0.07
Input token throughput (tok/s): 364.53
Output token throughput (tok/s): 72.91
Total token throughput (tok/s): 437.44
==================================================
配置WebUI访问
通过浏览器访问模型,可以配置 Open WebUI。配置方式可参考阿里云文档
#拉取Open WebUI镜像。
sudo docker pull alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/python:3.11.1
#启动Open WebUI服务。
#设置模型服务地址
OPENAI_API_BASE_URL=http://127.0.0.1:30000/v1
# 创建数据目录,确保数据目录存在并位于/mnt下
sudo mkdir -p /mnt/open-webui-data
#启动open-webui服务
#需注意系统盘空间,建议100GB以上
sudo docker run -d -t --network=host --name open-webui \
-e ENABLE_OLLAMA_API=False \
-e OPENAI_API_BASE_URL=${OPENAI_API_BASE_URL} \
-e DATA_DIR=/mnt/open-webui-data \
-e HF_HUB_OFFLINE=1 \
-v /mnt/open-webui-data:/mnt/open-webui-data \
alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/python:3.11.1 \
/bin/bash -c "pip config set global.index-url http://mirrors.cloud.aliyuncs.com/pypi/simple/ && \
pip config set install.trusted-host mirrors.cloud.aliyuncs.com && \
pip install --upgrade pip && \
pip install open-webui && \
mkdir -p /usr/local/lib/python3.11/site-packages/google/colab && \
open-webui serve"
#行以下命令,实时监控下载进度,等待下载结束。
sudo docker logs -f open-webui
#在日志输出中寻找类似以下的消息:
INFO: Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
#表示服务已经成功启动并在端口8080上监听。
#本地物理机上使用浏览器访问http://<公网IP地址>:8080,首次登录时,请根据提示创建管理员账号。
多机部署
多机部署的前置条件与单机部署一致,各节点均需下载模型,且启动对应docker镜像,然后在各自docker内部执行下列命令(示例:两个 H20 节点,每个节点有 8 个 GPU)
# node 1
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# node 2
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
# 补充说明
# 启动命令时,先启动主节点,后启动副节点
# 所有节点的--dist-init-addr均需设置为主节点IP
# 在服务启动后,发起请求时需指定为主节点IP
推理测试
import openai
client = openai.Client(
base_url="http://<Master IP>:<port>/v1", api_key="EMPTY")
# 将上述base_url修改为具体配置即可,master IP为主节点IP地址,port为启动命令中的--port参数
# Chat completion
response = client.chat.completions.create(
model="default",
messages=[
{"role": "system", "content": "You are a helpful AI assistant"},
{"role": "user", "content": "你是谁"},
],
temperature=0.6,
max_tokens=1024,
)
print(response)
性能参数
名称 | 介绍 | 测试结果 |
---|---|---|
–enable-dp-attention | 开启dp并行,适用于高QPS场景,进一步降低显存占用,并提升吞吐 | 与TP并行同时开启时报错 |
–enable-torch-compile | PyTorch 2.0的一个主要功能,对模型进行提前编译优化 | 首次启动服务时,额外消耗5小时,服务启动后,无明显性能提升 |
–torch-compile-max-bs | torch.compile 优化的最大批处理大小 | 一般与–enable-torch-compile同时使用,在小批量时优化效果最佳,推荐为1-8,经测试后无明显性能提升 |
# node 1
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --enable-torch-compile --torch-compile-max-bs 1 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000
# node 2
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --enable-torch-compile --torch-compile-max-bs 1 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000
总结
1.目前看起来SGLang在满血版的deepseek-v3和r1上优化没有想象中的大、
2.多节点部署(tp并行)显存占用略微下降(非线性下降),单卡93GB—>单卡83GB
3.官方推荐的性能优化选项貌似没有明显的性能提升
4.对多节点真的支持tp和dp同时并行存疑
更多推荐
所有评论(0)