Unsloth本地微调DeepSeek-R1-Distill-Qwen-14B流程概述

清水即心

1307人浏览 · 2025-03-23 19:21:40

清水即心 · 2025-03-23 19:21:40 发布

1 环境准备

1.1 cuda

实验室服务器cuda 版本12.4

1.2 conda

安装miniconda
修改conda、pip源地址
创建虚拟环境，python=3.10

conda create --name <env_name> python=3.10

1.3 pytorch

安装pytorch，刚开始我安装的版本是2.5.1。最终改为2.6.0

1.4 unsloth

pip install unsloth

unsloth安装好之后，会自动卸载原来安装的pytorch，并安装一个2.6.0的CPU版本。所以就需要重新安装pytorch，最后安装的命令如下：

pip3 install torch=2.6.0 torchvision=0.21.0 torchaudio=2.6.0 --index-url https://download.pytorch.org/whl/cu124

1.5 wandb

第一步到Weights & Biases: The AI Developer Platform网站注册

第二步安装

pip install wandb

2 模型下载

微调基础模型DeepSeek-R1-Distill-Qwen-14B，从魔塔社区下载

服务器网速很慢，尝试了ftp，命令行下载，最终采用了SDK方式下载。

#指定本地数据集的缓存目录
cache_dir = "./models"
#模型下载
from modelscope.hub.snapshot_download import snapshot_download
model_dir = snapshot_download('deepseek-ai/DeepSeek-R1-Distill-Qwen-14B',
                              cache_dir=cache_dir)

程序运行命令如下：

nohup python download_model.py > output.log 2>&1 &

期间重启了多次下载程序，重启后支持断点续传。

下载结果，通过查看日志文件output.log来确认。

tail output.log

3 构建数据集

数据集构建，采用easy-Dataset工具实现从文档到jsonl格式数据集的转换。

工具地址：https://github.com/ConardLi/easy-dataset

工具使用教程：如何把领域文献批量转换为可供模型微调的数据集？_哔哩哔哩_bilibili

4 启动微调

代码如下：

import os
os.environ["CUDA_VISIBLE_DEVICES"] = "2"  # 这里填写你要使用的GPU编号，例如1代表第二个GPU
from unsloth import FastLanguageModel
from trl import SFTTrainer
from transformers import TrainingArguments
from unsloth import is_bfloat16_supported

# 用来下载huggingface数据集
from datasets import load_dataset

max_seq_length = 2048
dtype = None
load_in_4bit = False

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "../../../jupyter/models/deepseek-ai/DeepSeek-R1-Distill-Qwen-14B",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    device_map={"": "cuda:0"}
)

train_prompt_style = """下面是一条描述任务的指令，与提供进一步上下文的输入配对。
写一个适当完成请求的响应。
在回答之前，仔细思考问题，确保逻辑清晰、准确的回答.

### Instruction:
你是一名XXX。

### Question:
{}

### Response:
{}"""

EOS_TOKEN = tokenizer.eos_token
tokenizer.eos_token

def formatting_prompts_func(examples):
    inputs = examples["instruction"]
    #cots = examples["Complex_CoT"]
    outputs = examples["output"]
    texts = []
    for input, output in zip(inputs, outputs):
        text = train_prompt_style.format(input, output) + EOS_TOKEN
        texts.append(text)
    return {
        "text": texts,
    }


#cache_dir = "./local_dataset_cache"
#dataset = load_dataset("FreedomIntelligence/medical-o1-reasoning-SFT","en", split = "train[0:100]",trust_remote_code=True, cache_dir=cache_dir)

# 加载本地的 jsonl 文件
dataset = load_dataset('json', split = "train",data_files='./local_dataset_cache/jingxin/')
print(dataset[0])

# 对数据集进行处理
dataset = dataset.map(formatting_prompts_func, batched = True,)
print(dataset["text"][0])

# 将模型设置为微调模式
model = FastLanguageModel.get_peft_model(
    model,  # 需要微调的模型
    r=16,   # 秩，LoRA低秩矩阵的维度，决定新增参数的数量。值越大，适配能力越强，但是参数量和显存占用就越高。
    target_modules=[    # 指定模型中需要添加LoRA适配器的模块名称
        "q_proj",   # 查询
        "k_proj",   # 键
        "v_proj",   # 值
        "o_proj",   # 输出
        "gate_proj",    # 门控
        "up_proj",      # 升维
        "down_proj",    # 降维
    ],
    lora_alpha=16,  # 缩放因子，通常与r相同
    lora_dropout=0, # 防止过拟合
    bias="none",    # 是否训练模型的偏执项。"none": 不训练任何偏置（默认）
    use_gradient_checkpointing="unsloth",  # True or "unsloth" for very long context
    random_state=3407,
    use_rslora=False,   # 是否使用 RSLoRA（Rescaled LoRA），一种改进的 LoRA 变体。 False，表示使用标准 LoRA
    loftq_config=None,  # 启用 LoFTQ（LoRA-Fine-Tuning with Quantization）量化配置，进一步压缩模型。None 表示不启用量化，保持全精度训练
)

# hugging face 旗下trl库，有监督微调工具，适用于LoRA等低秩适配微调方式
trainer = SFTTrainer(
    model=model,    # 要微调的模型
    tokenizer=tokenizer,    # 文本分词器
    train_dataset=dataset,  # 训练数据集
    dataset_text_field="text",  # 包含训练文本的字段名
    max_seq_length=max_seq_length,  #输入文本的最大长度序列，超过的会被截断，不够的会填充
    dataset_num_proc=2, # 数据集预处理时的进程数

    #控制训练过程的核心配置
    args=TrainingArguments(
        per_device_train_batch_size=2,  # 每个GPU设备的批次大小，2表示每次处理2个样本
        gradient_accumulation_steps=4,  # 梯度累计次数，4次后更新参数
        # Use num_train_epochs = 1, warmup_ratio for full training runs!
        num_train_epochs = 3,   # 训练的总轮数，遍历整个数据集的次数。
        warmup_steps=5, # 学习率预热的步数，表示前5步学习率从0线性增长到目标值
        # max_steps=60,
        learning_rate=2e-4, # 初始学习率
        fp16=not is_bfloat16_supported(),   # 启用混合精度训练，减少显存占用
        bf16=is_bfloat16_supported(),
        logging_steps=10,   # 间隔10步，记录一次日志
        optim="adamw_8bit", # 优化器类型
        weight_decay=0.01,  # 权重衰减系数，用于防止过拟合
        lr_scheduler_type="linear", # 学习率调度器类型，linear表示线性衰减学习率
        seed=3407,  # 随机种子，确保实验可以复现。
        output_dir="outputs",   # 训练过程中模型和日志的输出目录
    ),
)

# 测试模型性能
def test_model_func(questions):
    # 模型设为推理模式
    FastLanguageModel.for_inference(model)
    for question in questions:
        inputs = tokenizer([question], return_tensors="pt").to("cuda")
        outputs = model.generate(
            input_ids=inputs.input_ids,
            max_new_tokens=1200,
            use_cache=True,
        )
        response = tokenizer.batch_decode(outputs)
        print(response)



# 开始微调
trainer_stats = trainer.train()

#微调结束
print(trainer_stats)

#测试模型性能
questions=["你好?", "你都有啥技能？"]
test_model_func(questions)

#保存模型
new_model_local = "./models/trained/DeepSeek-R1-Distill-Qwen-14B-jingxin"
model.save_pretrained(new_model_local)
tokenizer.save_pretrained(new_model_local)

model.save_pretrained_merged(new_model_local, tokenizer, save_method = "merged_16bit",)

5 导出GGUF格式文件并导入ollama

5.1 导出gguf格式文件

model.save_pretrained_gguf("dir", tokenizer, quantization_method = "q8_0")

这一步容易出错，我是用的jupyter notebook执行，出错后修改再运行。

遇到的错误是llama.cpp构建失败，这个问题需要自己下载llama.cpp并本地构建，才能执行gguf导出。

地址：GitHub - ggml-org/llama.cpp: LLM inference in C/C++

构建方法：llama.cpp/docs/build.md at master · ggml-org/llama.cpp · GitHub

git clone github库的过程遇到了问题。科学上网都会有这个问题。需要给git设置http代理

教程：Git报错： Failed to connect to github.com port 443 解决方案-CSDN博客

5.2 导入ollama

安装好ollama之后，需要创建一个modelfile，再执行：

ollama create 模型名称 -f ./Modelfile

教程：https://zhuanlan.zhihu.com/p/27483137957

DeepSeek技术社区

欢迎加入DeepSeek 技术社区。在这里，你可以找到志同道合的朋友，共同探索AI技术的奥秘。

更多推荐

GRPO：利用组内平均奖励来计算优势

DeepSeek技术社区

5.1 DL-FWI培训总结

DeepSeek技术社区

DeepSeek全景解析：技术革新与应用实践（十二）——提示词工程与高效使用全攻略：解锁AI协作效率的黄金法则

DeepSeek技术社区

所有评论(0)

查看更多评论

清水即心

@qaswzh

已为社区贡献1条内容