Qwen3-4B-Thinking-GGUF部署教程：vLLM Prometheus指标暴露与可视化

本文介绍了如何在星图GPU平台上自动化部署Qwen3-4B-Thinking-2507-GPT-5-Codex-Distill-GGUF镜像，实现高效的文本生成与代码创作。该镜像特别适用于自动化代码生成和智能文本创作场景，通过vLLM框架可快速搭建服务，并结合Prometheus和Grafana实现性能监控与可视化。

飙车致死法厄同

843人浏览 · 2026-03-14 03:40:23

飙车致死法厄同 · 2026-03-14 03:40:23 发布

Qwen3-4B-Thinking-GGUF部署教程：vLLM Prometheus指标暴露与可视化

1. 模型简介与环境准备

Qwen3-4B-Thinking-2507-GPT-5-Codex-Distill-GGUF是基于Qwen3-4B-Thinking-2507模型进行微调的文本生成模型，由TeichAI开发并采用Apache-2.0许可证开源。该模型在OpenAI GPT-5-Codex的1000个示例上进行了微调优化，特别适合代码生成和文本创作任务。

1.1 系统要求

在开始部署前，请确保您的系统满足以下最低要求：

操作系统: Ubuntu 20.04或更高版本
GPU: NVIDIA显卡（推荐RTX 3090或更高）
显存: 至少16GB
内存: 32GB或更高
存储空间: 50GB可用空间

1.2 基础环境安装

首先安装必要的系统依赖：

sudo apt update
sudo apt install -y python3-pip python3-dev git curl

然后安装CUDA工具包（根据您的NVIDIA驱动版本选择对应版本）：

wget https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/cuda-ubuntu2004.pin
sudo mv cuda-ubuntu2004.pin /etc/apt/preferences.d/cuda-repository-pin-600
sudo apt-key adv --fetch-keys https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/3bf863cc.pub
sudo add-apt-repository "deb https://developer.download.nvidia.com/compute/cuda/repos/ubuntu2004/x86_64/ /"
sudo apt update
sudo apt -y install cuda

2. 模型部署与验证

2.1 使用vLLM部署模型

vLLM是一个高效的大语言模型推理和服务框架，特别适合部署GGUF格式的模型。

首先安装vLLM：

pip install vllm

然后下载模型并启动服务：

# 创建模型目录
mkdir -p /root/workspace/models
cd /root/workspace/models

# 下载模型（请替换为实际模型下载链接）
wget [模型下载链接] -O Qwen3-4B-Thinking-GGUF.bin

# 启动vLLM服务
python -m vllm.entrypoints.api_server \
    --model /root/workspace/models/Qwen3-4B-Thinking-GGUF.bin \
    --trust-remote-code \
    --port 8000 \
    --log-file /root/workspace/llm.log

2.2 验证服务状态

服务启动后，可以通过以下命令检查日志确认是否部署成功：

cat /root/workspace/llm.log

如果看到类似以下输出，表示服务已成功启动：

INFO 07-10 12:34:56 api_server.py:150] Loading model weights...
INFO 07-10 12:35:23 api_server.py:167] Model loaded successfully
INFO 07-10 12:35:23 api_server.py:189] Starting API server on port 8000

3. Prometheus指标暴露与配置

3.1 启用vLLM的Prometheus指标

vLLM内置支持Prometheus指标暴露，只需在启动参数中添加--metrics-port选项：

python -m vllm.entrypoints.api_server \
    --model /root/workspace/models/Qwen3-4B-Thinking-GGUF.bin \
    --trust-remote-code \
    --port 8000 \
    --metrics-port 8001 \
    --log-file /root/workspace/llm.log

3.2 配置Prometheus采集指标

安装Prometheus：

wget https://github.com/prometheus/prometheus/releases/download/v2.47.0/prometheus-2.47.0.linux-amd64.tar.gz
tar xvfz prometheus-*.tar.gz
cd prometheus-*

编辑prometheus.yml配置文件，添加vLLM的metrics端点：

scrape_configs:
  - job_name: 'vllm'
    static_configs:
      - targets: ['localhost:8001']

启动Prometheus：

./prometheus --config.file=prometheus.yml

4. 使用Grafana可视化指标

4.1 安装Grafana

sudo apt-get install -y apt-transport-https
sudo apt-get install -y software-properties-common wget
wget -q -O - https://packages.grafana.com/gpg.key | sudo apt-key add -
echo "deb https://packages.grafana.com/oss/deb stable main" | sudo tee -a /etc/apt/sources.list.d/grafana.list
sudo apt-get update
sudo apt-get install grafana
sudo systemctl start grafana-server
sudo systemctl enable grafana-server

4.2 配置Grafana数据源

访问Grafana界面（默认http://localhost:3000）
添加Prometheus数据源
设置URL为http://localhost:9090
点击"Save & Test"保存配置

4.3 导入vLLM仪表盘

vLLM提供了官方的Grafana仪表盘模板，可以直接导入使用：

在Grafana界面点击"+" → "Import"
输入仪表盘ID18674（vLLM官方仪表盘）
选择之前配置的Prometheus数据源
点击"Import"完成导入

5. 使用Chainlit构建前端界面

5.1 安装Chainlit

pip install chainlit

5.2 创建Chainlit应用

创建一个名为app.py的文件：

import chainlit as cl
import requests

@cl.on_message
async def main(message: cl.Message):
    # 调用vLLM API
    response = requests.post(
        "http://localhost:8000/generate",
        json={
            "prompt": message.content,
            "max_tokens": 1024,
            "temperature": 0.7
        }
    )
    
    # 获取响应并发送给用户
    result = response.json()
    await cl.Message(content=result["text"]).send()