DeepSeek-R1/V3满血版部署

单机部署

环境信息

机型系统CUDA
H20*8Ubuntu22.0412.4

在这里插入图片描述

模型下载

实测1000MB的弹性带宽,约120MB/s的下载速度,也需要将近2小时才可以下载完成,所以强烈建议大家从modelscope下载模型。

apt install git-lfs
git lfs install
git lfs clone https://www.modelscope.cn/deepseek-ai/DeepSeek-R1.git

在这里插入图片描述

Docker安装启动

由于需要一定空间,建议大家将容器的实际存储位置做更改,以免后续因空间不足无法操作容器

#添加Docker软件包源
sudo apt-get -y install apt-transport-https ca-certificates curl software-properties-common
sudo curl -fsSL http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu/gpg | sudo apt-key add -
sudo add-apt-repository -y "deb [arch=$(dpkg --print-architecture)] http://mirrors.cloud.aliyuncs.com/docker-ce/linux/ubuntu $(lsb_release -cs) stable"
#安装Docker社区版本,容器运行时containerd.io,以及Docker构建和Compose插件
sudo apt-get -y install docker-ce docker-ce-cli containerd.io docker-buildx-plugin docker-compose-plugin
#修改Docker存储路径
vim /etc/docker/daemon.json
{
  "data-root/graph": "/mnt/docker"  #挂载路径
}
#安装安装nvidia-container-toolkit
curl -fsSL https://nvidia.github.io/libnvidia-container/gpgkey | sudo gpg --dearmor -o /usr/share/keyrings/nvidia-container-toolkit-keyring.gpg \
  && curl -s -L https://nvidia.github.io/libnvidia-container/stable/deb/nvidia-container-toolkit.list | \
    sed 's#deb https://#deb [signed-by=/usr/share/keyrings/nvidia-container-toolkit-keyring.gpg] https://#g' | \
    sudo tee /etc/apt/sources.list.d/nvidia-container-toolkit.list
sudo apt-get install -y nvidia-container-toolkit
#启动Docker
sudo systemctl start docker
#设置Docker守护进程在系统启动时自动启动
sudo systemctl enable docker

下载并启动sglang容器

我们这里选择拉取阿里云镜像

#拉取sglang容器镜像
sudo docker pull egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207
#启动sglang容器
sudo docker run -t -d --name="sglang-test"  --ipc=host --cap-add=SYS_PTRACE --network=host --gpus all --privileged --ulimit memlock=-1 --ulimit stack=67108864 -v /mnt/model:/mnt egslingjun-registry.cn-wulanchabu.cr.aliyuncs.com/egslingjun/inference-nv-pytorch:preview-25.02-vllm0.6.4.post1-sglang0.4.2.post1-pytorch2.5-cuda12.4-20250207
# 进入容器
docker exec -it sglang-test bash

参数配置

# 进入模型地址在容器的映射位置(随下载位置而变化)
cd /mnt/DeepSeek-R1
# 编辑配置文件(不建议修改模型参数,保持默认即可)
vim generation_config.json

在这里插入图片描述

参数介绍推荐值
_from_model_config从默认配置加载模型参数——
bos_token_id指示模型开始生成文本时使用的标记 ID——
eos_token_id指示生成文本的终止点——
do_sample控制是否启用采样(sampling)——
temperature温度参数,用于控制生成时的随机性0.6
top_p概率分布的累积阈值0.95
transformers_version模型所使用的Transformers库的版本——

如果想要修改SGLang的参数配置,也可以参考下述

# 查询sglang相关文件位置
find / -name "*sglang*"
# 在返回结果中,找到容器中安装的sglang库位置
cd /mnt/docker/overlay2/066029ada9b53919beaa9949840d89c881d38c18a995cb489676266274ee4018/diff/usr/local/lib/python3.10/dist-packages/sglang
# 编辑配置文件(建议仅查看,而将参数置于启动命令中)
vim srt/server_args.py

启动模型服务

首次启动时间较长,建议挂后台进行操作

# 启动服务
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --port 30000 --mem-fraction-static 0.9 --tp 8 --trust-remote-code
# 允许外部请求访问
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --port 30000 --mem-fraction-static 0.9 --tp 8 --trust-remote-code --host 0.0.0.0

在这里插入图片描述
在这里插入图片描述

推理测试

# curl代码
curl http://localhost:30000/generate \
 -H "Content-Type: application/json" \
 -d '{
  "text": "deepseek中有几个e?",
  "sampling_params": {
  "max_new_tokens": 3000,
  "temperature": 0
 }
}'
# OpenAI标准接口
import openai
client = openai.Client(
    base_url="http://localhost:30000/v1", api_key="EMPTY") #localhost改为公网IP地址即可(安全组中需开放对应IP以及端口)

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "你是谁"},
    ],
    temperature=0.6,
    max_tokens=1024,
)
print(response)

在这里插入图片描述在这里插入图片描述

性能评估

deepseek-v3

离线评估,大致耗时约20分钟

下载数据集
git lfs clone https://www.modelscope.cn/datasets/gliang1001/ShareGPT_V3_unfiltered_cleaned_split.git
#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-V3 离线性能评估(离线性能评估时,需要先停止推理服务)
python3 -m sglang.bench_offline_throughput --model-path  /mnt/DeepSeek-V3  --trust-remote-code  --tensor-parallel-size 8  --num-prompts 4  --dataset-name random --random-input 5000 --random-output 1000 --random-range-ratio 1.0  --disable-radix-cache  --prefill-only-one-req True    --mem-fraction-static 0.9  --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json  

#DeepSeek-V3 离线性能评估结果
====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     4         
Benchmark duration (s):                  54.80     
Total input tokens:                      20000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.07      
Input token throughput (tok/s):          364.98    
Output token throughput (tok/s):         73.00     
Total token throughput (tok/s):          437.98    
==================================================

在线评估结果仅供大家参考

#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-V3 在线性能评估
python3 -m sglang.bench_serving --backend sglang --model /mnt/DeepSeek-V3 --port 30000 --dataset-name random --request-rate-range 1,2,4,8 --random-input 1024 --random-output 128 --random-range-ratio 1.0 --multi --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json 

#DeepSeek-V3 输出结果:
Namespace(backend='sglang', base_url=None, host='0.0.0.0', port=30000, dataset_name='random', dataset_path='/mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json', model='/mnt/DeepSeek-V3', tokenizer=None, num_prompts=1000, sharegpt_output_len=None, sharegpt_context_len=None, random_input_len=1024, random_output_len=128, random_range_ratio=1.0, request_rate=inf, max_concurrency=None, multi=True, request_rate_range='1,2,4,8', output_file=None, disable_tqdm=False, disable_stream=False, return_logprob=False, seed=1, disable_ignore_eos=False, extra_request_body=None, apply_chat_template=False, profile=False, lora_name=None, gsp_num_groups=64, gsp_prompts_per_group=16, gsp_system_prompt_len=2048, gsp_question_len=128, gsp_output_len=256)

#Input tokens: 1024000
#Output tokens: 128000
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [16:38<00:00,  1.00it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    1         
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  998.79    
Total input tokens:                      1024000   
Total generated tokens:                  128000    
Total generated tokens (retokenized):    127187    
Request throughput (req/s):              1.00      
Input token throughput (tok/s):          1025.24   
Output token throughput (tok/s):         128.16    
Total token throughput (tok/s):          1153.40   
Concurrency:                             12.50     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   12485.21  
Median E2E Latency (ms):                 12397.13  
---------------Time to First Token----------------
Mean TTFT (ms):                          496.60    
Median TTFT (ms):                        415.34    
P99 TTFT (ms):                           1085.47   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          94.40     
Median TPOT (ms):                        93.63     
P99 TPOT (ms):                           128.08    
---------------Inter-token Latency----------------
Mean ITL (ms):                           94.41     
Median ITL (ms):                         69.16     
P99 ITL (ms):                            376.26    
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
 42%|██████████████████████████████████████████████████████████████████████████▎                                                                                                        | 415/1000 [03:56<08:27,  1.15it/s]
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [09:01<00:00,  1.85it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    2         
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  541.18    
Total input tokens:                      1024000   
Total generated tokens:                  128000    
Total generated tokens (retokenized):    127206    
Request throughput (req/s):              1.85      
Input token throughput (tok/s):          1892.16   
Output token throughput (tok/s):         236.52    
Total token throughput (tok/s):          2128.68   
Concurrency:                             44.26     
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   23951.96  
Median E2E Latency (ms):                 24760.97  
---------------Time to First Token----------------
Mean TTFT (ms):                          828.14    
Median TTFT (ms):                        609.77    
P99 TTFT (ms):                           4613.55   
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          182.08    
Median TPOT (ms):                        188.13    
P99 TPOT (ms):                           210.65    
---------------Inter-token Latency----------------
Mean ITL (ms):                           182.10    
Median ITL (ms):                         102.29    
P99 ITL (ms):                            932.91    
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:11<00:00,  2.32it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    4         
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  431.12    
Total input tokens:                      1024000   
Total generated tokens:                  128000    
Total generated tokens (retokenized):    127197    
Request throughput (req/s):              2.32      
Input token throughput (tok/s):          2375.21   
Output token throughput (tok/s):         296.90    
Total token throughput (tok/s):          2672.11   
Concurrency:                             271.39    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   117002.49 
Median E2E Latency (ms):                 120702.03 
---------------Time to First Token----------------
Mean TTFT (ms):                          96414.64  
Median TTFT (ms):                        100729.00 
P99 TTFT (ms):                           197647.17 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          162.11    
Median TPOT (ms):                        155.82    
P99 TPOT (ms):                           187.84    
---------------Inter-token Latency----------------
Mean ITL (ms):                           162.33    
Median ITL (ms):                         103.53    
P99 ITL (ms):                            1457.23   
==================================================
Starting initial single prompt test run...
Initial test run completed. Starting main benchmark run...
100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [07:02<00:00,  2.37it/s]

============ Serving Benchmark Result ============
Backend:                                 sglang    
Traffic request rate:                    8         
Max reqeuest concurrency:                not set   
Successful requests:                     1000      
Benchmark duration (s):                  422.48    
Total input tokens:                      1024000   
Total generated tokens:                  128000    
Total generated tokens (retokenized):    127166    
Request throughput (req/s):              2.37      
Input token throughput (tok/s):          2423.77   
Output token throughput (tok/s):         302.97    
Total token throughput (tok/s):          2726.74   
Concurrency:                             372.94    
----------------End-to-End Latency----------------
Mean E2E Latency (ms):                   157559.86 
Median E2E Latency (ms):                 158652.08 
---------------Time to First Token----------------
Mean TTFT (ms):                          140367.45 
Median TTFT (ms):                        142588.55 
P99 TTFT (ms):                           288377.24 
-----Time per Output Token (excl. 1st token)------
Mean TPOT (ms):                          135.37    
Median TPOT (ms):                        123.44    
P99 TPOT (ms):                           179.09    
---------------Inter-token Latency----------------
Mean ITL (ms):                           135.44    
Median ITL (ms):                         103.35    
P99 ITL (ms):                            386.42    
==================================================
deepseek-r1
#进入sglang-test 容器
docker exec -it sglang-test bash
#运行 DeepSeek-R1 离线性能评估
python3 -m sglang.bench_offline_throughput --model-path  /mnt/DeepSeek-R1  --trust-remote-code  --tensor-parallel-size 8  --num-prompts 4  --dataset-name random --random-input 5000 --random-output 1000 --random-range-ratio 1.0  --disable-radix-cache  --prefill-only-one-req True    --mem-fraction-static 0.9  --dataset-path /mnt/ShareGPT_V3_unfiltered_cleaned_split/ShareGPT_V3_unfiltered_cleaned_split.json

server_args=ServerArgs(model_path='/mnt/DeepSeek-R1', tokenizer_path='/mnt/DeepSeek-R1', tokenizer_mode='auto', load_format='auto', trust_remote_code=True, dtype='auto', kv_cache_dtype='auto', quantization_param_path=None, quantization=None, context_length=None, device='cuda', served_model_name='/mnt/DeepSeek-R1', chat_template=None, is_embedding=False, revision=None, skip_tokenizer_init=False, host='127.0.0.1', port=30000, mem_fraction_static=0.9, max_running_requests=None, max_total_tokens=None, chunked_prefill_size=8192, max_prefill_tokens=16384, schedule_policy='lpm', schedule_conservativeness=1.0, cpu_offload_gb=0, prefill_only_one_req=True, tp_size=8, stream_interval=1, stream_output=False, random_seed=422651568, constrained_json_whitespace_pattern=None, watchdog_timeout=300, download_dir=None, base_gpu_id=0, log_level='info', log_level_http=None, log_requests=False, show_time_cost=False, enable_metrics=False, decode_log_interval=40, api_key=None, file_storage_pth='sglang_storage', enable_cache_report=False, dp_size=1, load_balance_method='round_robin', ep_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', lora_paths=None, max_loras_per_batch=8, lora_backend='triton', attention_backend='flashinfer', sampling_backend='flashinfer', grammar_backend='outlines', speculative_draft_model_path=None, speculative_algorithm=None, speculative_num_steps=5, speculative_num_draft_tokens=64, speculative_eagle_topk=8, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, disable_radix_cache=True, disable_jump_forward=False, disable_cuda_graph=False, disable_cuda_graph_padding=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, disable_mla=False, disable_overlap_schedule=False, enable_mixed_chunk=False, enable_dp_attention=False, enable_ep_moe=False, enable_torch_compile=False, torch_compile_max_bs=32, cuda_graph_max_bs=160, cuda_graph_bs=None, torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, allow_auto_truncate=False, enable_custom_logit_processor=False, tool_call_parser=None, enable_hierarchical_cache=False)

#DeepSeek-R1 离线性能评估结果
====== Offline Throughput Benchmark Result =======
Backend:                                 engine    
Successful requests:                     4         
Benchmark duration (s):                  54.86     
Total input tokens:                      20000     
Total generated tokens:                  4000      
Request throughput (req/s):              0.07      
Input token throughput (tok/s):          364.53    
Output token throughput (tok/s):         72.91     
Total token throughput (tok/s):          437.44    
==================================================

配置WebUI访问

通过浏览器访问模型,可以配置 Open WebUI。配置方式可参考阿里云文档

#拉取Open WebUI镜像。
sudo docker pull alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/python:3.11.1

#启动Open WebUI服务。

#设置模型服务地址
OPENAI_API_BASE_URL=http://127.0.0.1:30000/v1


# 创建数据目录,确保数据目录存在并位于/mnt下
sudo mkdir -p /mnt/open-webui-data

#启动open-webui服务 
#需注意系统盘空间,建议100GB以上
sudo docker run -d -t --network=host --name open-webui \
-e ENABLE_OLLAMA_API=False \
-e OPENAI_API_BASE_URL=${OPENAI_API_BASE_URL} \
-e DATA_DIR=/mnt/open-webui-data \
-e HF_HUB_OFFLINE=1 \
-v /mnt/open-webui-data:/mnt/open-webui-data \
alibaba-cloud-linux-3-registry.cn-hangzhou.cr.aliyuncs.com/alinux3/python:3.11.1 \
/bin/bash -c "pip config set global.index-url http://mirrors.cloud.aliyuncs.com/pypi/simple/ && \
pip config set install.trusted-host mirrors.cloud.aliyuncs.com && \
pip install --upgrade pip && \
pip install open-webui && \
mkdir -p /usr/local/lib/python3.11/site-packages/google/colab && \
open-webui serve"

#行以下命令,实时监控下载进度,等待下载结束。
sudo docker logs -f open-webui

#在日志输出中寻找类似以下的消息:
INFO:     Uvicorn running on http://0.0.0.0:8080 (Press CTRL+C to quit)
#表示服务已经成功启动并在端口8080上监听。
#本地物理机上使用浏览器访问http://<公网IP地址>:8080,首次登录时,请根据提示创建管理员账号。

多机部署

多机部署的前置条件与单机部署一致,各节点均需下载模型,且启动对应docker镜像,然后在各自docker内部执行下列命令(示例:两个 H20 节点,每个节点有 8 个 GPU)

# node 1
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000


# node 2
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000

# 补充说明
# 启动命令时,先启动主节点,后启动副节点
# 所有节点的--dist-init-addr均需设置为主节点IP
# 在服务启动后,发起请求时需指定为主节点IP

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

推理测试

import openai
client = openai.Client(
    base_url="http://<Master IP>:<port>/v1", api_key="EMPTY")
# 将上述base_url修改为具体配置即可,master IP为主节点IP地址,port为启动命令中的--port参数

# Chat completion
response = client.chat.completions.create(
    model="default",
    messages=[
        {"role": "system", "content": "You are a helpful AI assistant"},
        {"role": "user", "content": "你是谁"},
    ],
    temperature=0.6,
    max_tokens=1024,
)
print(response)

在这里插入图片描述

性能参数

名称介绍测试结果
–enable-dp-attention开启dp并行,适用于高QPS场景,进一步降低显存占用,并提升吞吐与TP并行同时开启时报错
–enable-torch-compilePyTorch 2.0的一个主要功能,对模型进行提前编译优化首次启动服务时,额外消耗5小时,服务启动后,无明显性能提升
–torch-compile-max-bstorch.compile 优化的最大批处理大小一般与–enable-torch-compile同时使用,在小批量时优化效果最佳,推荐为1-8,经测试后无明显性能提升
# node 1
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --enable-torch-compile --torch-compile-max-bs 1 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 0 --trust-remote-code --host 0.0.0.0 --port 40000 



# node 2
python3 -m sglang.launch_server --model-path /mnt/DeepSeek-R1 --tp 16 --enable-torch-compile --torch-compile-max-bs 1 --dist-init-addr <Master IP>:20000 --nnodes 2 --node-rank 1 --trust-remote-code --host 0.0.0.0 --port 40000  

在这里插入图片描述
在这里插入图片描述
在这里插入图片描述
在这里插入图片描述

总结

1.目前看起来SGLang在满血版的deepseek-v3和r1上优化没有想象中的大、
2.多节点部署(tp并行)显存占用略微下降(非线性下降),单卡93GB—>单卡83GB
3.官方推荐的性能优化选项貌似没有明显的性能提升
4.对多节点真的支持tp和dp同时并行存疑

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐