ms-swift：一机多卡微调GLM-4-9b-chat模型操作

ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持450+大模型与150+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。

weixin_38472918

2033人浏览 · 2025-03-02 23:15:19

weixin_38472918 · 2025-03-02 23:15:19 发布

简介：

ms-swift是魔搭社区提供的大模型与多模态大模型微调部署框架，现已支持450+大模型与150+多模态大模型的训练（预训练、微调、人类对齐）、推理、评测、量化与部署。其中大模型包括：Qwen2.5、InternLM3、GLM4、Llama3.3、Mistral、DeepSeek-R1、Yi1.5、TeleChat2、Baichuan2、Gemma2等模型，多模态大模型包括：Qwen2.5-VL、Qwen2-Audio、Llama3.2-Vision、Llava、InternVL2.5、MiniCPM-V-2.6、GLM4v、Xcomposer2.5、Yi-VL、DeepSeek-VL2、Phi3.5-Vision、GOT-OCR2等模型。
有兴趣的朋友可以去了解一下：
ms-swift官方地址

操作步骤：

准备一个一机多卡的服务，大家自己准备哈，也可以去购买云服务。
我准备的是2个32G显存的，内存是160G

#新建虚拟环境
python3 -m venv myenv(虚拟环境名称）
#进入虚拟环境
source myenv/bin/activate

1、安装ms-swift

从源代码安装：
# pip install git+https://github.com/modelscope/ms-swift.git
git clone https://github.com/modelscope/ms-swift.git
cd ms-swift
pip install -e .

也可以使用pip进行安装：

pip install ms-swift -U

2、下载模型

pip install modelscope
modelscope download --model ZhipuAI/glm-4-9b-chat

3、将模型复制到合适的位置

cp -r /root/.cache/modelscope/hub/models/ZhipuAI/glm-4-9b-chat ./

4、添加数据集

新建存放数据集的目录

mkdir data

然后上传数据集，一般有两个jsonl文件，一个是需要训练的数据集train.jsonl，另外一个是需要验证数据集的dev.jsonl，如下图所示
在这里插入图片描述
数据集的格式是这样的：
下面是一条数据集的格式，是一条json

{
	"messages": [{
		"role": "system",
		"content": "你是一名软件测试工程师。你的任务是根据需求内容生成测试用例。"
	}, {
		"role": "user",
		"content": "你需要用完整的需求内容来理解业务关联和逻辑，然后结合你对业务的理解针对指定的需求内容生成需要的测试用例，需要考虑测试用例覆盖度"
	}, {
		"role": "assistant",
		"content": "好的，我将生成符合要求的测试用例，同时严格按照python list包含dict的格式生成"
	}, {
		"role": "user",
		"content": "请为需求中的输入框最大输入长度限制35个汉字，汉字不做一个汉字等于两个字符的拆分。编写测试用例"
	}, {
		"role": "assistant",
		"content": "[{'case': '验证输入框对HTML注入攻击的防护能力。', 'precondition': '1、学员已登录系统。\\n2、学习计划页面已打开。\\n3、入框可用。', 'operation_steps': \"1、在输入框中输入含有HTML标签的文本，例如 <script>alert('XSS');</script>。\\n2、点击“确定”按钮。\\n3、观察页面提示信息。\", 'expected_results': '输入框应过滤或转义HTML标签，不执行任何HTML或JavaScript代码。\\n提交后，页面不应显示弹窗或其他由注入代码引起的副作用。'}]"
	}]
}

如果有多条json这样的：
在这里插入图片描述

5、配置ms-swift微调命令

新建一个lora_run.sh，内容如下：

# Use `--template default`
nproc_per_node=2

CUDA_VISIBLE_DEVICES=0,1 \
MASTER_PORT=29501 \
NPROC_PER_NODE=$nproc_per_node \
swift sft \
    --model /root/glm-4-9b-chat \
    --train_type lora \
    --dataset '/root/data/train.jsonl' \
              '/root/data/dev.jsonl' \
    --torch_dtype bfloat16 \
    --template default \
    --num_train_epochs 10 \
    --per_device_train_batch_size 1 \
    --per_device_eval_batch_size 4 \
    --learning_rate 5e-4 \
    --target_modules all-linear \
    --gradient_accumulation_steps $(expr 16 / $nproc_per_node) \
    --eval_steps 500 \
    --max_steps 1000 \
    --save_steps 500 \
    --logging_steps 5 \
    --output_dir output \
    --system 'You are a helpful assistant.' \
    --warmup_ratio 0.05 \
    --dataloader_num_workers 4 \
    --max_new_tokens 3000 \
    --ddp_find_unused_parameters False \

CUDA_VISIBLE_DEVICES：控制使用哪些GPU，默认使用所有卡，类似数值id，如果有4个GPU，可以设置为：1,2,3,4。
MASTER_PORT：torchrun中–master_port的参数透传。默认为29500，其实就是端口号
NPROC_PER_NODE：就是所使用的线程数，用几个显卡就几个线程数。
–model：模型id或模型本地路径
–train_type: 可选为: ‘lora’、‘full’、‘longlora’、‘adalora’、‘llamapro’、‘adapter’、‘vera’、‘boft’、‘fourierft’、‘reft’。默认为’lora’
–dataset:数据集路径

output_dir: 用于保存模型和其他输出的目录。
max_steps: 训练的最大步数。
per_device_train_batch_size: 每个设备（如 GPU）的训练批次大小。
dataloader_num_workers: 加载数据时使用的工作线程数量。
remove_unused_columns: 是否移除数据中未使用的列。
save_strategy: 模型保存策略（例如，每隔多少步保存一次）。
save_steps: 每隔多少步保存一次模型。
log_level: 日志级别（如 info）。
logging_strategy: 日志记录策略。
logging_steps: 每隔多少步记录一次日志。
per_device_eval_batch_size: 每个设备的评估批次大小。
evaluation_strategy: 评估策略（例如，每隔多少步进行一次评估）。
eval_steps: 每隔多少步进行一次评估。
predict_with_generate: 是否使用生成模式进行预测。
generation_config 部分
max_new_tokens: 生成的最大新 token 数量。
这里就不过多解释，具体的命令可以去官方文档上查看：[ms-swift官方文档命令说明]

6、执行lora_run.sh

记得给此Shell脚本执行的权限：

chmod 755 lora_run.sh
#执行该脚本
./lora_run.sh

下面正在训练的模型：
在这里插入图片描述

7、合并模型

# Since `output/vx-xxx/checkpoint-xxx` is trained by swift and contains an `args.json` file,
# there is no need to explicitly set `--model`, `--system`, etc., as they will be automatically read.
swift export \
    --adapters /root/output/v0-20250302-112152/checkpoint-1000 \
    --merge_lora true

8、推理模型

使用训练后的模型推理：

# Since `swift/test_lora` is trained by swift and contains an `args.json` file,
# there is no need to explicitly set `--model`, `--system`, etc., as they will be automatically read.
# To disable this behavior, please set `--load_args false`.
CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --adapters /root/output/v0-20250302-112152/checkpoint-1000 \
    --stream true \
    --temperature 0 \
    --max_new_tokens 2048

执行此脚本后，输入问题即可

使用合并的模型进行推理：

CUDA_VISIBLE_DEVICES=0 \
swift infer \
    --model /root/output/v0-20250302-112152/checkpoint-1000-merged \
    --infer_backend pt \
    --temperature 0 \
    --max_new_tokens 2048

9、遇到的问题

1）当使用lora训练，并使用torchrun启动时，会出现如下报错：

RuntimeError: Expected to mark a variable ready only once. This error is caused by one of the following reasons: 1) Use of a module parameter outside the `forward` function. Please make sure model parameters are not shared across multiple concurrent forward-backward passes. or try to use _set_static_graph() as a workaround if this module graph does not change during training loop.2) Reused parameters in multiple reentrant backward passes. For example, if you use multiple `checkpoint` functions to wrap the same part of your model, it would result in the same set of parameters been used by different reentrant backward passes multiple times, and hence marking a variable ready multiple times. DDP does not support such use cases in default. You can try to use _set_static_graph() as a workaround if your module graph does not change over iterations.

解决办法：
所以最后添加了此参数：
–ddp_find_unused_parameters False

2）在对训练后模型进行推理时，发现回答问题不全

如下图所示：
猜测应该是训练时读取数据集的数据不全导致在这里插入图片描述
解决办法：
所以在lora_run.sh脚本中添加了–max_length 2048 \，参数配置

3）进行推理时，报错

AttributeError: ‘ChatGLMForConditionalGeneration’ object has no attribute ‘_extract_past_from_model_output’
解决办法：
在使用transformers4.49.0版本时都会遇到此问题，只能降版本到transformers4.48.3