使用SGLang实现Qwen3.6-27B模型推理

javahust

284人浏览 · 2026-04-23 16:54:45

javahust · 2026-04-23 16:54:45 发布

启动指令

docker run -it --rm \
  --name sglang-qwen36-27b \
  --gpus all \
  --shm-size 16GB \
  -p 30082:8000 \
  -v /data/models:/data/models \
  docker.m.daocloud.io/lmsysorg/sglang:v0.5.10 \
  python3 -m sglang.launch_server \
  --model-path /data/models/Qwen3.6-27B \
  --served-model-name Qwen3.6-27B \
  --host 0.0.0.0 \
  --port 8000 \
  --tp-size 2 \
  --context-length 1000 \
  --mem-fraction-static 0.8

启动日志

==========
== CUDA ==
==========

CUDA Version 12.9.1

Container image Copyright (c) 2016-2023, NVIDIA CORPORATION & AFFILIATES. All rights reserved.

This container image and its contents are governed by the NVIDIA Deep Learning Container License.
By pulling and using the container, you accept the terms and conditions of this license:
https://developer.nvidia.com/ngc/nvidia-deep-learning-container-license

A copy of this license is made available in this container at /NGC-DL-CONTAINER-LICENSE for your convenience.

/sgl-workspace/sglang/python/sglang/launch_server.py:51: UserWarning: 'python -m sglang.launch_server' is still supported, but 'sglang serve' is the recommended entrypoint.
  Example: sglang serve --model-path <model> [options]
  warnings.warn(
Disabling overlap schedule since mamba no_buffer is not compatible with overlap schedule, try to use --disable-radix-cache if overlap schedule is necessary
/sgl-workspace/sglang/python/sglang/srt/entrypoints/http_server.py:172: FastAPIDeprecationWarning: ORJSONResponse is deprecated, FastAPI now serializes data directly to JSON bytes via Pydantic when a return type or response model is set, which is faster and doesn't need a custom response class. Read more in the FastAPI docs: https://fastapi.tiangolo.com/advanced/custom-response/#orjson-or-response-model and https://fastapi.tiangolo.com/tutorial/response-model/
  from sglang.srt.utils.json_response import (
[2026-04-23 08:36:15] server_args=ServerArgs(model_path='/data/models/Qwen3.6-27B', tokenizer_path='/data/models/Qwen3.6-27B', tokenizer_mode='auto', tokenizer_worker_num=1, skip_tokenizer_init=False, load_format='auto', model_loader_extra_config='{}', trust_remote_code=False, context_length=1000, is_embedding=False, enable_multimodal=None, revision=None, model_impl='auto', host='0.0.0.0', port=8000, fastapi_root_path='', grpc_mode=False, skip_server_warmup=False, warmups=None, nccl_port=None, checkpoint_engine_wait_weights_before_ready=False, ssl_keyfile=None, ssl_certfile=None, ssl_ca_certs=None, ssl_keyfile_password=None, enable_ssl_refresh=False, dtype='auto', quantization=None, quantization_param_path=None, kv_cache_dtype='auto', enable_fp32_lm_head=False, modelopt_quant=None, modelopt_checkpoint_restore_path=None, modelopt_checkpoint_save_path=None, modelopt_export_path=None, quantize_and_serve=False, rl_quant_profile=None, mem_fraction_static=0.8, max_running_requests=None, max_queued_requests=None, max_total_tokens=None, chunked_prefill_size=8192, enable_dynamic_chunking=False, max_prefill_tokens=16384, prefill_max_requests=None, schedule_policy='fcfs', enable_priority_scheduling=False, disable_priority_preemption=False, default_priority_value=None, abort_on_priority_when_disabled=False, schedule_low_priority_values_first=False, priority_scheduling_preemption_threshold=10, schedule_conservativeness=1.0, page_size=1, swa_full_tokens_ratio=0.8, disable_hybrid_swa_memory=False, radix_eviction_policy='lru', enable_prefill_delayer=False, prefill_delayer_max_delay_passes=30, prefill_delayer_token_usage_low_watermark=None, prefill_delayer_forward_passes_buckets=None, prefill_delayer_wait_seconds_buckets=None, device='cuda', tp_size=2, pp_size=1, pp_max_micro_batch_size=None, pp_async_batch_depth=0, stream_interval=1, stream_response_default_include_usage=False, incremental_streaming_output=False, enable_streaming_session=False, random_seed=765992415, constrained_json_whitespace_pattern=None, constrained_json_disable_any_whitespace=False, watchdog_timeout=300, soft_watchdog_timeout=None, dist_timeout=None, download_dir=None, model_checksum=None, base_gpu_id=0, gpu_id_step=1, sleep_on_idle=False, use_ray=False, custom_sigquit_handler=None, log_level='info', log_level_http=None, log_requests=False, log_requests_level=2, log_requests_format='text', log_requests_target=None, uvicorn_access_log_exclude_prefixes=[], crash_dump_folder=None, show_time_cost=False, enable_metrics=False, enable_mfu_metrics=False, enable_metrics_for_all_schedulers=False, tokenizer_metrics_custom_labels_header='x-custom-labels', tokenizer_metrics_allowed_custom_labels=None, extra_metric_labels=None, bucket_time_to_first_token=None, bucket_inter_token_latency=None, bucket_e2e_request_latency=None, collect_tokens_histogram=False, prompt_tokens_buckets=None, generation_tokens_buckets=None, gc_warning_threshold_secs=0.0, decode_log_interval=40, enable_request_time_stats_logging=False, kv_events_config=None, enable_trace=False, otlp_traces_endpoint='localhost:4317', export_metrics_to_file=False, export_metrics_to_file_dir=None, api_key=None, admin_api_key=None, served_model_name='Qwen3.6-27B', weight_version='default', chat_template=None, hf_chat_template_name=None, completion_template=None, file_storage_path='sglang_storage', enable_cache_report=False, reasoning_parser=None, tool_call_parser=None, tool_server=None, sampling_defaults='model', dp_size=1, load_balance_method='round_robin', attn_cp_size=1, moe_dp_size=1, dist_init_addr=None, nnodes=1, node_rank=0, json_model_override_args='{}', preferred_sampling_params=None, enable_lora=None, enable_lora_overlap_loading=None, max_lora_rank=None, lora_target_modules=None, lora_paths=None, max_loaded_loras=None, max_loras_per_batch=8, lora_eviction_policy='lru', lora_backend='csgmv', max_lora_chunk_size=16, experts_shared_outer_loras=None, attention_backend='fa3', decode_attention_backend=None, prefill_attention_backend=None, sampling_backend='flashinfer', grammar_backend='xgrammar', mm_attention_backend=None, fp8_gemm_runner_backend='auto', fp4_gemm_runner_backend='auto', nsa_prefill_backend=None, nsa_decode_backend=None, disable_flashinfer_autotune=False, mamba_backend='triton', speculative_algorithm=None, speculative_draft_model_path=None, speculative_draft_model_revision=None, speculative_draft_load_format=None, speculative_num_steps=None, speculative_eagle_topk=None, speculative_num_draft_tokens=None, speculative_accept_threshold_single=1.0, speculative_accept_threshold_acc=1.0, speculative_token_map=None, speculative_attention_mode='prefill', speculative_draft_attention_backend=None, speculative_moe_runner_backend='auto', speculative_moe_a2a_backend=None, speculative_draft_model_quantization=None, speculative_ngram_min_bfs_breadth=1, speculative_ngram_max_bfs_breadth=10, speculative_ngram_match_type='BFS', speculative_ngram_max_trie_depth=18, speculative_ngram_capacity=10000000, enable_multi_layer_eagle=False, ep_size=1, moe_a2a_backend='none', moe_runner_backend='auto', flashinfer_mxfp4_moe_precision='default', enable_flashinfer_allreduce_fusion=False, enforce_disable_flashinfer_allreduce_fusion=False, enable_aiter_allreduce_fusion=False, deepep_mode='auto', ep_num_redundant_experts=0, ep_dispatch_algorithm=None, init_expert_location='trivial', enable_eplb=False, eplb_algorithm='auto', eplb_rebalance_num_iterations=1000, eplb_rebalance_layers_per_chunk=None, eplb_min_rebalancing_utilization_threshold=1.0, expert_distribution_recorder_mode=None, expert_distribution_recorder_buffer_size=1000, enable_expert_distribution_metrics=False, deepep_config=None, moe_dense_tp_size=None, elastic_ep_backend=None, enable_elastic_expert_backup=False, mooncake_ib_device=None, max_mamba_cache_size=None, mamba_ssm_dtype=None, mamba_full_memory_ratio=0.9, mamba_scheduler_strategy='no_buffer', mamba_track_interval=256, linear_attn_backend='triton', linear_attn_decode_backend=None, linear_attn_prefill_backend=None, enable_hierarchical_cache=False, hicache_ratio=2.0, hicache_size=0, hicache_write_policy='write_through', hicache_io_backend='kernel', hicache_mem_layout='layer_first', hicache_storage_backend=None, hicache_storage_prefetch_policy='best_effort', hicache_storage_backend_extra_config=None, enable_hisparse=False, hisparse_config=None, enable_lmcache=False, kt_weight_path=None, kt_method='AMXINT4', kt_cpuinfer=None, kt_threadpool_count=2, kt_num_gpu_experts=None, kt_max_deferred_experts_per_token=None, dllm_algorithm=None, dllm_algorithm_config=None, enable_double_sparsity=False, ds_channel_config_path=None, ds_heavy_channel_num=32, ds_heavy_token_num=256, ds_heavy_channel_type='qk', ds_sparse_decode_threshold=4096, cpu_offload_gb=0, offload_group_size=-1, offload_num_in_group=1, offload_prefetch_step=1, offload_mode='cpu', multi_item_scoring_delimiter=None, disable_radix_cache=False, cuda_graph_max_bs=256, cuda_graph_bs=[1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 104, 112, 120, 128, 136, 144, 152, 160, 168, 176, 184, 192, 200, 208, 216, 224, 232, 240, 248, 256], disable_cuda_graph=False, disable_cuda_graph_padding=False, enable_profile_cuda_graph=False, enable_cudagraph_gc=False, enable_layerwise_nvtx_marker=False, enable_nccl_nvls=False, enable_symm_mem=False, disable_flashinfer_cutlass_moe_fp4_allgather=False, enable_tokenizer_batch_encode=False, disable_tokenizer_batch_decode=False, disable_outlines_disk_cache=False, disable_custom_all_reduce=False, enable_mscclpp=False, enable_torch_symm_mem=False, pre_warm_nccl=False, disable_overlap_schedule=True, enable_mixed_chunk=False, enable_dp_attention=False, enable_dp_lm_head=False, enable_two_batch_overlap=False, enable_single_batch_overlap=False, tbo_token_distribution_threshold=0.48, enable_torch_compile=False, disable_piecewise_cuda_graph=True, enforce_piecewise_cuda_graph=False, enable_torch_compile_debug_mode=False, torch_compile_max_bs=32, piecewise_cuda_graph_max_tokens=8192, piecewise_cuda_graph_tokens=[4, 8, 12, 16, 20, 24, 28, 32, 48, 64, 80, 96, 112, 128, 144, 160, 176, 192, 208, 224, 240, 256, 288, 320, 352, 384, 416, 448, 480, 512, 576, 640, 704, 768, 832, 896, 960, 1024, 1280, 1536, 1792, 2048, 2304, 2560, 2816, 3072, 3328, 3584, 3840, 4096, 4608, 5120, 5632, 6144, 6656, 7168, 7680, 8192], piecewise_cuda_graph_compiler='eager', torchao_config='', enable_nan_detection=False, enable_p2p_check=False, triton_attention_reduce_in_fp32=False, triton_attention_num_kv_splits=8, triton_attention_split_tile_size=None, num_continuous_decode_steps=1, delete_ckpt_after_loading=False, enable_memory_saver=False, enable_weights_cpu_backup=False, enable_draft_weights_cpu_backup=False, allow_auto_truncate=False, enable_custom_logit_processor=False, flashinfer_mla_disable_ragged=False, disable_shared_experts_fusion=False, disable_chunked_prefix_cache=False, disable_fast_image_processor=False, keep_mm_feature_on_device=False, enable_return_hidden_states=False, enable_return_routed_experts=False, scheduler_recv_interval=1, numa_node=None, enable_deterministic_inference=False, rl_on_policy_target=None, enable_attn_tp_input_scattered=False, gc_threshold=None, enable_nsa_prefill_context_parallel=False, nsa_prefill_cp_mode='round-robin-split', enable_fused_qk_norm_rope=False, enable_precise_embedding_interpolation=False, enable_fused_moe_sum_all_reduce=False, enable_prefill_context_parallel=False, prefill_cp_mode='in-seq-split', enable_dynamic_batch_tokenizer=False, dynamic_batch_tokenizer_batch_size=32, dynamic_batch_tokenizer_batch_timeout=0.002, debug_tensor_dump_output_folder=None, debug_tensor_dump_layers=None, debug_tensor_dump_input_file=None, debug_tensor_dump_inject=False, disaggregation_mode='null', disaggregation_transfer_backend='mooncake', disaggregation_bootstrap_port=8998, disaggregation_ib_device=None, disaggregation_decode_enable_offload_kvcache=False, num_reserved_decode_tokens=512, disaggregation_decode_polling_interval=1, encoder_only=False, language_only=False, encoder_transfer_backend='zmq_to_scheduler', encoder_urls=[], enable_adaptive_dispatch_to_encoder=False, custom_weight_loader=[], weight_loader_disable_mmap=False, remote_instance_weight_loader_seed_instance_ip=None, remote_instance_weight_loader_seed_instance_service_port=None, remote_instance_weight_loader_send_weights_group_ports=None, remote_instance_weight_loader_backend='nccl', remote_instance_weight_loader_start_seed_via_transfer_engine=False, engine_info_bootstrap_port=6789, modelexpress_config=None, enable_pdmux=False, pdmux_config_path=None, sm_group_num=8, enable_broadcast_mm_inputs_process=False, enable_prefix_mm_cache=False, mm_enable_dp_encoder=False, mm_process_config={}, limit_mm_data_per_request=None, enable_mm_global_cache=False, decrypted_config_file=None, decrypted_draft_config_file=None, forward_hooks=None)
[2026-04-23 08:36:19] Using default HuggingFace chat template with detected content format: openai
[2026-04-23 08:36:25 TP0] Init torch distributed begin.
[2026-04-23 08:36:25 TP1] Init torch distributed begin.
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 0 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[Gloo] Rank 1 is connected to 1 peer ranks. Expected number of connected peer ranks is : 1
[2026-04-23 08:36:25 TP0] sglang is using nccl==2.28.3
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[Gloo] Rank 0 is connected to 0 peer ranks. Expected number of connected peer ranks is : 0
[2026-04-23 08:36:25 TP0] Init torch distributed ends. elapsed=0.74 s, mem usage=0.90 GB
[2026-04-23 08:36:25 TP1] Init torch distributed ends. elapsed=0.45 s, mem usage=0.90 GB
[2026-04-23 08:36:25 TP1] Load weight begin. avail mem=91.88 GB
[2026-04-23 08:36:25 TP0] Load weight begin. avail mem=90.63 GB
[2026-04-23 08:36:26 TP1] Multimodal attention backend not set. Use fa3.
[2026-04-23 08:36:26 TP1] Using fa3 as multimodal attention backend.
[2026-04-23 08:36:26 TP0] Multimodal attention backend not set. Use fa3.
[2026-04-23 08:36:26 TP0] Using fa3 as multimodal attention backend.
`torch_dtype` is deprecated! Use `dtype` instead!
`torch_dtype` is deprecated! Use `dtype` instead!
[2026-04-23 08:36:26 TP1] using attn output gate!
[2026-04-23 08:36:26 TP0] using attn output gate!
Multi-thread loading shards:  93% Completed | 14/15 [01:09<00:07,  7.10s/it][2026-04-23 08:37:43 TP1] Load weight end. elapsed=77.43 s, type=Qwen3_5ForConditionalGeneration, avail mem=66.24 GB, mem usage=25.64 GB.
Multi-thread loading shards: 100% Completed | 15/15 [01:17<00:00,  5.16s/it]
[2026-04-23 08:37:43 TP0] Load weight end. elapsed=77.66 s, type=Qwen3_5ForConditionalGeneration, avail mem=65.00 GB, mem usage=25.64 GB.
[2026-04-23 08:37:43 TP0] Using KV cache dtype: torch.bfloat16
[2026-04-23 08:37:43 TP0] Mamba Cache is allocated. max_mamba_cache_size: 309, conv_state size: 0.43GB, ssm_state size: 21.80GB 
[2026-04-23 08:37:43 TP1] Mamba Cache is allocated. max_mamba_cache_size: 309, conv_state size: 0.43GB, ssm_state size: 21.80GB 
[2026-04-23 08:37:43 TP0] KV Cache is allocated. #tokens: 809954, K size: 12.36 GB, V size: 12.36 GB
[2026-04-23 08:37:43 TP1] KV Cache is allocated. #tokens: 809954, K size: 12.36 GB, V size: 12.36 GB
[2026-04-23 08:37:43 TP1] Memory pool end. avail mem=19.27 GB
[2026-04-23 08:37:43 TP0] Memory pool end. avail mem=18.02 GB
[2026-04-23 08:37:43 TP1] Using hybrid linear attention backend for hybrid GDN models.
[2026-04-23 08:37:43 TP0] Linear attention kernel backend: decode=triton, prefill=triton
[2026-04-23 08:37:43 TP0] Using hybrid linear attention backend for hybrid GDN models.
[2026-04-23 08:37:43 TP0] GDN kernel dispatcher: decode=TritonGDNKernel, extend=TritonGDNKernel, verify=TritonGDNKernel packed_decode=True
[2026-04-23 08:37:43 TP1] Capture cuda graph begin. This can take up to several minutes. avail mem=19.17 GB
[2026-04-23 08:37:43 TP0] Capture cuda graph begin. This can take up to several minutes. avail mem=17.93 GB
[2026-04-23 08:37:43 TP0] Capture cuda graph bs [1, 2, 4, 8, 12, 16, 24, 32, 40, 48, 56, 64, 72, 80, 88, 96, 103]
Capturing batches (bs=103 avail_mem=17.83 GB):   0%|                                                              | 0/17 [00:00<?, ?it/s]2026-04-23 08:37:45,176 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
2026-04-23 08:37:45,176 - CUTE_DSL - WARNING - [handle_import_error] - Unexpected error during package walk: cutlass.cute.experimental
[2026-04-23 08:37:45 TP0] Unexpected error during package walk: cutlass.cute.experimental
[2026-04-23 08:37:45 TP1] Unexpected error during package walk: cutlass.cute.experimental
Capturing batches (bs=1 avail_mem=17.53 GB): 100%|███████████████████████████████████████████████████████| 17/17 [00:12<00:00,  1.38it/s]
[2026-04-23 08:37:56 TP0] Registering 2193 cuda graph addresses
[2026-04-23 08:37:56 TP1] Capture cuda graph end. Time elapsed: 13.02 s. mem usage=0.41 GB. avail mem=18.76 GB.
[2026-04-23 08:37:56 TP1] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-23 08:37:56 TP0] Capture cuda graph end. Time elapsed: 13.03 s. mem usage=0.41 GB. avail mem=17.52 GB.
[2026-04-23 08:37:56 TP0] Disable piecewise CUDA graph because --disable-piecewise-cuda-graph is set
[2026-04-23 08:37:58 TP0] max_total_num_tokens=809954, chunked_prefill_size=8192, max_prefill_tokens=16384, max_running_requests=103, context_len=1000, available_gpu_mem=17.52 GB
[2026-04-23 08:37:59] INFO:     Started server process [1]
[2026-04-23 08:37:59] INFO:     Waiting for application startup.
[2026-04-23 08:37:59] Using default chat sampling params from model generation config: {'temperature': 1.0, 'top_k': 20, 'top_p': 0.95}
[2026-04-23 08:37:59] INFO:     Application startup complete.
[2026-04-23 08:37:59] INFO:     Uvicorn running on http://0.0.0.0:8000 (Press CTRL+C to quit)
[2026-04-23 08:38:00] INFO:     127.0.0.1:53908 - "GET /model_info HTTP/1.1" 200 OK
[2026-04-23 08:38:12 TP0] Prefill batch, #new-seq: 1, #new-token: 80, #cached-token: 0, full token usage: 0.00, mamba usage: 0.00, #running-req: 0, #queue-req: 0, cuda graph: False, input throughput (token/s): 0.00
[2026-04-23 08:38:12] INFO:     127.0.0.1:53918 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2026-04-23 08:38:12] The server is fired up and ready to roll!

测试调用

curl -X POST http://81.70.247.xx:30082/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "Qwen3.6-27B",
    "messages": [
      {"role": "user", "content": [
        {"type": "text", "text": "你好"}
      ]}
    ],
    "max_tokens": 512,
    "stream": false
  }'

DeepSeek技术社区

欢迎加入DeepSeek 技术社区。在这里，你可以找到志同道合的朋友，共同探索AI技术的奥秘。

更多推荐

我把 Claude Code 的安全系统扒了个底朝天：四层管线 + 五层权限 + 三平台沙箱

DeepSeek技术社区

大模型选型指南：结合具体行业场景，谈谈 Claude 4.8 的长程上下文与逻辑推理优势

DeepSeek技术社区

我花了一周时间部署odysseus，对比ChatGPT/Claude的结果如下

odysseus 26天78K星，自托管AI工作空间最火项目。我花一周实际部署，对比ChatGPT/Claude/Copilot的结果：部署耗时约3小时，混合模式月费$8-12（原SaaS订阅$70+）。功能覆盖度方面，聊天和Agent功能基本覆盖SaaS方案，额外提供邮件/笔记/日历集成、本地全文搜索、多模型切换、自定义Agent定时任务。差距在于聊天流畅度、移动端缺失、文档协作功能有限。适合有