一、最终效果

    本次教程我想让大模型回复的更亲切点,关于数据集的选择方面,在最开始推理的时候,发现虽然chat模型回答的非常不错,但是有点套路的感觉,如:

然后经过训练,回答如下:

一、训练模型基座准备

1、首先下载模型,从魔塔社区下载模型,如地址:https://www.modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/files

如下图所示:

然后将下载的模型加载到ollama中,并运行查看效果,如:

具体如何部署到ollama中,请看:将下载的.bin模型转成gguf格式,导入ollama

2、训练之后的问答

二、训练数据集下载

1、在网上寻找训练数据集

    在EmoLLM这个项目里发现了很多现成的数据集,他们的数据集由智谱清言、文心一言、通义千问、讯飞星火等大模型生成,下载好之后如下目录:

2、构建自己的数据集

如果有其他需要可以参考EmoLLM数据生成方式来构建自己的数据集,或者从以下文件中查看:

3、使用数据

这里我使用了其中单轮对话数据集来进行微调,即:

三、训练环境准备

1、首先安装conda

conda的安装步骤请看:https://blog.csdn.net/sunxiaoju/article/details/146035560

2、创建一个虚拟环境

本文代码python=3.10,请设置环境python>=3.9即可。由于本次使用的模型较大,请确保此次微调显存至少有15GB,我们需要安装以下这几个Python库,在这之前,请确保你的环境内已安装了CUDA,CUDA的安装方法请看:CUDA安装方法

conda create --name ds.llm.7b.chart python==3.10

如下图所示:

3、安装对应的库

分别执行如下命令:

pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install transformers
pip install accelerate
pip install peft
pip install bitsandbytes
pip install swanlab
pip install pandas
pip install datasets
pip install loguru

注意:此处的torch需要通过官网语句来安装,才能支持cuda,否则默认安装是不支持cuda的,并且是需要根据cuda版本进行选择的,具体命令可以通过:https://pytorch.org/get-started/locally/来选择与cuda对应的命令,如:

安装之后可以通过如下方式进行验证,入托结果是True表示cuda可用,也就是GPU可用

如果出现连接超时,可以多执行几次即可。

在安装swanlab时,会提示以下信息,最后导致下载很多版本超时:

INFO: pip is looking at multiple versions of pydantic to determine which version is compatible with other requirements. This could take a while.

意思是:信息:pip正在研究pydantic的多个版本,以确定哪个版本与其他要求兼容。这可能需要一段时间。

如下图所示:

单独安装:pydantic,执行如下命令:

 pip install pydantic==2.10.6

如下图所示:

然后再次执行:

pip install swanlab

此时就会安装成功,如下图所示:

四、开始训练

1、准备数据集

   首先创建一个目录deepseek-finetune-lora,并在该目录下新建一个data目录,并将EmoLLM\datasets目录下的single_turn_dataset_1.json、single_turn_dataset_2.json拷贝到该目录中,如下图所示:

在deepseek-finetune-lora目录中新建一个data.py文件,其内容如下:

import json
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from tqdm import tqdm

# 处理数据集,这次使用单轮对话来微调
## 合并数据集,其中一个数据集大概1.5w,两个合并大概3w条数据,回复比较人性化,观察最后微调的结果是否能够像数据集一样变得比较人性化
def len_data(data_path):
    with open(data_path, 'r') as file:
        data = json.load(file)
    # 获取字典中的条目数
    num_entries = len(data)
    return num_entries

# 合并数据集,总共3w条数据
def merge_data(data1_path,data2_path):
    # 读取第一个JSON文件
    with open(data1_path, 'r', encoding='utf-8') as file:
        data1 = json.load(file)

    # 读取第二个JSON文件
    with open(data2_path, 'r', encoding='utf-8') as file:
        data2 = json.load(file)

    # 合并两个列表
    merged_data = data1 + data2

    # 如果需要写入合并后的数据到一个新文件
    with open('./data/merged_data.json', 'w', encoding='utf-8') as file:
        json.dump(merged_data, file, ensure_ascii=False, indent=4)

    # 输出合并后数据的条目数
    print(f"合并后的数据集包含 {len(merged_data)} 条数据。")

# 处理数据,把原数据集格式更改,然后保存成jsonl格式,好进行后续的处理
def data_process(data_path,output_path):
    # 最终结果
    all_results = []
    with open(data_path, 'r', encoding='utf-8') as file:
        data_json = json.load(file)
        # 打开输出文件(JSONL格式)
        with open(output_path, 'w', encoding='utf-8') as output_file:
            # 处理数据
            for i, data in enumerate(data_json):
                # id号
                conversation_id = i + 1
                # 由于是单轮对话,因此不用循环
                conversation = []

                try:
                    human_text = data["prompt"]
                    assistant_text = data["completion"]

                    conversation_texts = {"human": human_text, "assistant": assistant_text}
                    conversation.append(conversation_texts)
                except KeyError:
                    # 如果数据没有完整的“prompt”和“completion”字段,跳过
                    continue

                result = {"conversation_id": conversation_id, "conversation": conversation}

                all_results.append(result)
    # 写入到输出文件中
    with open(output_path, 'w', encoding='utf-8') as file:
        for item in tqdm(all_results, desc="Writing to File"):
            file.write(json.dumps(item, ensure_ascii=False) + '\n')

    print(f"将 {len(data_json)} 条数据保存到{output_path}中")

# 按照微调的要求预处理数据
def finetune_data(data: dict, tokenizer, max_seq_length):
    # 处理数据
    conversation = data["conversation"]
    input_ids, attention_mask, labels = [], [], []

    for i,conv in enumerate(conversation):
        human_text = conv["human"]
        assistant_text = conv["assistant"]

        input_text = "human:"+human_text + "\n\nassistant:"

        input_tokenizer = tokenizer(
            input_text,
            add_special_tokens=False,
            truncation=True,
            padding=False,
            return_tensors=None,
        )
        output_tokenizer = tokenizer(
            assistant_text,
            add_special_tokens=False,
            truncation=True,
            padding=False,
            return_tensors=None,
        )

        input_ids += (
                input_tokenizer["input_ids"]+output_tokenizer["input_ids"]+[tokenizer.eos_token_id]
        )
        attention_mask += input_tokenizer["attention_mask"]+output_tokenizer["attention_mask"]+[1]
        labels += ([-100]*len(input_tokenizer["input_ids"])+output_tokenizer["input_ids"]+[tokenizer.eos_token_id]
        )

    if len(input_ids) > max_seq_length:  # 做一个截断
        input_ids = input_ids[:max_seq_length]
        attention_mask = attention_mask[:max_seq_length]
        labels = labels[:max_seq_length]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}

if __name__ == '__main__':
    merge_data('./data/single_turn_dataset_1.json','./data/single_turn_dataset_2.json');
    data_path = './data/merged_data.json'
    output_path = './data/single_datas.jsonl'
    data_process(data_path, output_path)

    model_path = "D:/deepseek/deepseek-llm-7b-chat"
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)

    with open(output_path, "r", encoding="utf-8") as f:
        data = [json.loads(line) for line in f]

    data = data[0]
    print(data)
    max_seq_length = 2048
    print(finetune_data(data,tokenizer,max_seq_length))

需要设置数据集位置,如:

merge_data('./data/single_turn_dataset_1.json','./data/single_turn_dataset_2.json');

然后通过如下代码进行转换成jsonl:

data_process(data_path, output_path)

 然后进入到deepseek-finetune-lora目录,如下图所示:

最后执行如下命令:

python ./data.py

执行结果如下图所示:

查看对应的目录:

2、开始执行训练

创建一个在deepseek-finetune-lora目录中新建一个model目录,并在该目录中新建一个finetune.py文件,代码如下:

import argparse
from os.path import join

import pandas as pd
from datasets import Dataset
from loguru import logger
from transformers import (
    TrainingArguments,
    AutoModelForCausalLM,
    Trainer,
    DataCollatorForSeq2Seq,
    AutoTokenizer,
)
import torch
from peft import LoraConfig, get_peft_model, TaskType
from swanlab.integration.transformers import SwanLabCallback
import bitsandbytes as bnb
from torch import nn
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import os

# 配置参数
def configuration_parameter():
    parser = argparse.ArgumentParser(description="LoRA fine-tuning for deepseek model")

    # 模型路径相关参数
    parser.add_argument("--model_name_or_path", type=str, default="D:/deepseek/deepseek-llm-7b-chat",
                        help="Path to the model directory downloaded locally")
    parser.add_argument("--output_dir", type=str, default="D:/deepseek/deepseek-llm-7b-chat-lora/model",
                        help="Directory to save the fine-tuned model and checkpoints")

    # 数据集路径
    parser.add_argument("--train_file", type=str, default="./data/single_datas.jsonl",
                        help="Path to the training data file in JSONL format")

    # 训练超参数
    parser.add_argument("--num_train_epochs", type=int, default=3,
                        help="Number of training epochs")
    parser.add_argument("--per_device_train_batch_size", type=int, default=4,
                        help="Batch size per device during training")
    parser.add_argument("--gradient_accumulation_steps", type=int, default=16,
                        help="Number of updates steps to accumulate before performing a backward/update pass")
    parser.add_argument("--learning_rate", type=float, default=2e-4,
                        help="Learning rate for the optimizer")
    parser.add_argument("--max_seq_length", type=int, default=2048,
                        help="Maximum sequence length for the input")
    parser.add_argument("--logging_steps", type=int, default=1,
                        help="Number of steps between logging metrics")
    parser.add_argument("--save_steps", type=int, default=500,
                        help="Number of steps between saving checkpoints")
    parser.add_argument("--save_total_limit", type=int, default=1,
                        help="Maximum number of checkpoints to keep")
    parser.add_argument("--lr_scheduler_type", type=str, default="constant_with_warmup",
                        help="Type of learning rate scheduler")
    parser.add_argument("--warmup_steps", type=int, default=100,
                        help="Number of warmup steps for learning rate scheduler")

    # LoRA 特定参数
    parser.add_argument("--lora_rank", type=int, default=16,
                        help="Rank of LoRA matrices")
    parser.add_argument("--lora_alpha", type=int, default=32,
                        help="Alpha parameter for LoRA")
    parser.add_argument("--lora_dropout", type=float, default=0.05,
                        help="Dropout rate for LoRA")

    # 分布式训练参数
    # 使用pytorch进行分布式训练,需要指定 local_rank,主机 local_rank = 0
    parser.add_argument("--local_rank", type=int, default=int(os.environ.get("LOCAL_RANK", -1)), help="Local rank for distributed training")
    # 是否启用分布式训练,False表示单机模式,True表示分布式训练
    parser.add_argument("--distributed",type=bool, default=False, help="分布式训练")

    # 额外优化和硬件相关参数
    parser.add_argument("--gradient_checkpointing", type=bool, default=True,
                        help="Enable gradient checkpointing to save memory")
    parser.add_argument("--optim", type=str, default="adamw_torch",
                        help="Optimizer to use during training")
    parser.add_argument("--train_mode", type=str, default="lora",
                        help="lora or qlora")
    parser.add_argument("--seed", type=int, default=42,
                        help="Random seed for reproducibility")
    parser.add_argument("--fp16", type=bool, default=True,
                        help="Use mixed precision (FP16) training")
    parser.add_argument("--report_to", type=str, default=None,
                        help="Reporting tool for logging (e.g., tensorboard)")
    parser.add_argument("--dataloader_num_workers", type=int, default=0,
                        help="Number of workers for data loading")
    parser.add_argument("--save_strategy", type=str, default="steps",
                        help="Strategy for saving checkpoints ('steps', 'epoch')")
    parser.add_argument("--weight_decay", type=float, default=0,
                        help="Weight decay for the optimizer")
    parser.add_argument("--max_grad_norm", type=float, default=1,
                        help="Maximum gradient norm for clipping")
    parser.add_argument("--remove_unused_columns", type=bool, default=True,
                        help="Remove unused columns from the dataset")

    args = parser.parse_args()
    return args


def find_all_linear_names(model, train_mode):
    """
    找出所有全连接层,为所有全连接添加adapter
    """
    assert train_mode in ['lora', 'qlora']
    cls = bnb.nn.Linear4bit if train_mode == 'qlora' else nn.Linear
    lora_module_names = set()
    for name, module in model.named_modules():
        if isinstance(module, cls):
            names = name.split('.')
            lora_module_names.add(names[0] if len(names) == 1 else names[-1])

    if 'lm_head' in lora_module_names:  # needed for 16-bit
        lora_module_names.remove('lm_head')
    lora_module_names = list(lora_module_names)
    logger.info(f'LoRA target module names: {lora_module_names}')
    return lora_module_names

def setup_distributed(args):
    """初始化分布式环境"""
    if args.distributed:
        if args.local_rank == -1:
            raise ValueError("未正确初始化 local_rank,请确保通过分布式启动脚本传递参数,例如 torchrun。")

        # 初始化分布式进程组
        dist.init_process_group(backend="nccl")
        torch.cuda.set_device(args.local_rank)
        print(f"分布式训练已启用,Local rank: {args.local_rank}")
    else:
        print("未启用分布式训练,单线程模式。")

# 加载模型
def load_model(args, train_dataset, data_collator):
    # 初始化分布式环境
    setup_distributed(args)
    # 加载模型
    model_kwargs = {
        "trust_remote_code": True,
        "torch_dtype": torch.float16 if args.fp16 else torch.bfloat16,
        "use_cache": False if args.gradient_checkpointing else True,
        "device_map": "auto" if not args.distributed else None,
    }
    model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, **model_kwargs)
    # 用于确保模型的词嵌入层参与训练
    model.enable_input_require_grads()
    # 将模型移动到正确设备
    if args.distributed:
        model.to(args.local_rank)
        model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank)
    # 哪些模块需要注入Lora参数
    target_modules = find_all_linear_names(model.module if isinstance(model, DDP) else model, args.train_mode)
    # lora参数设置
    config = LoraConfig(
        r=args.lora_rank,
        lora_alpha=args.lora_alpha,
        lora_dropout=args.lora_dropout,
        bias="none",
        target_modules=target_modules,
        task_type=TaskType.CAUSAL_LM,
        inference_mode=False

    )
    use_bfloat16 = torch.cuda.is_bf16_supported()  # 检查设备是否支持 bf16
    # 配置训练参数
    train_args = TrainingArguments(
        output_dir=args.output_dir,
        per_device_train_batch_size=args.per_device_train_batch_size,
        gradient_accumulation_steps=args.gradient_accumulation_steps,
        logging_steps=args.logging_steps,
        num_train_epochs=args.num_train_epochs,
        save_steps=args.save_steps,
        learning_rate=args.learning_rate,
        save_on_each_node=True,
        gradient_checkpointing=args.gradient_checkpointing,
        report_to=args.report_to,
        seed=args.seed,
        optim=args.optim,
        local_rank=args.local_rank if args.distributed else -1,
        ddp_find_unused_parameters=False,  # 分布式参数检查优化
        fp16=args.fp16,
        bf16=not args.fp16 and use_bfloat16,
    )
    # 应用 PEFT 配置到模型
    model = get_peft_model(model.module if isinstance(model, DDP) else model, config)  # 确保传递的是原始模型
    print("model:", model)
    model.print_trainable_parameters()
    ### 展示平台
    swanlab_config = {
        "lora_rank": args.lora_rank,
        "lora_alpha": args.lora_alpha,
        "lora_dropout": args.lora_dropout,
        "dataset":"single-data-3w"

    }
    swanlab_callback = SwanLabCallback(
        project="deepseek-finetune",
        experiment_name="deepseek-llm-7b-chat-lora",
        description="DeepSeek有很多模型,V2太大了,这里选择llm-7b-chat的,希望能让回答更加人性化",
        workspace=None,
        config=swanlab_config,
    )
    trainer = Trainer(
        model=model,
        args=train_args,
        train_dataset=train_dataset,
        data_collator=data_collator,
        callbacks=[swanlab_callback],
    )
    return trainer


# 处理数据
def process_data(data: dict, tokenizer, max_seq_length):
    # 处理数据
    conversation = data["conversation"]
    input_ids, attention_mask, labels = [], [], []

    for i, conv in enumerate(conversation):
        human_text = conv["human"].strip()
        assistant_text = conv["assistant"].strip()

        input_text = "Human:" + human_text + "\n\nnAssistant:"

        input_tokenizer = tokenizer(
            input_text,
            add_special_tokens=False,
            truncation=True,
            padding=False,
            return_tensors=None,
        )
        output_tokenizer = tokenizer(
            assistant_text,
            add_special_tokens=False,
            truncation=True,
            padding=False,
            return_tensors=None,
        )

        input_ids += (
                input_tokenizer["input_ids"] + output_tokenizer["input_ids"] + [tokenizer.eos_token_id]
        )
        attention_mask += input_tokenizer["attention_mask"] + output_tokenizer["attention_mask"] + [1]
        labels += ([-100] * len(input_tokenizer["input_ids"]) + output_tokenizer["input_ids"] + [tokenizer.eos_token_id]
                   )

    if len(input_ids) > max_seq_length:  # 做一个截断
        input_ids = input_ids[:max_seq_length]
        attention_mask = attention_mask[:max_seq_length]
        labels = labels[:max_seq_length]
    return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}


# 训练部分
def main():
    args = configuration_parameter()
    print("*****************加载分词器*************************")
    # 加载分词器
    model_path = args.model_name_or_path
    tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
    print("*****************处理数据*************************")
    # 处理数据
    # 获得数据
    data = pd.read_json(args.train_file, lines=True)
    train_ds = Dataset.from_pandas(data)
    train_dataset = train_ds.map(process_data, fn_kwargs={"tokenizer": tokenizer, "max_seq_length": args.max_seq_length},
                                 remove_columns=train_ds.column_names)
    data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True, return_tensors="pt")
    print(train_dataset, data_collator)
    # 加载模型
    print("*****************训练*************************")
    trainer = load_model(args, train_dataset, data_collator)
    trainer.train()
    # 训练
    final_save_path = join(args.output_dir)
    trainer.save_model(final_save_path)


if __name__ == "__main__":
    main()

执行如下代码:

python ./finetune.py

如下图所示:

此时swanlab会让输入选择,1:表示创建一个新的账号,2表示适用已有账号,3表示不保存结果,在选择之前我们需要先在SwanLab官网注册一个账号,并记录一个API-K,如下图所示:

然后再此处选择1,如下图所示:

然后输入API Key,注意:输入是不进行回显的,如下图所示:

此时就会自动在swanlab上创建一个训练项目,如下图所示:

经过长达8个小时运行已经完成了训练,如下图所示:

推理后的模型文件如下图所示:

3、合并模型

在deepseek-finetune-lora目录新建一个merge_model.py文件,代码如下:

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
import shutil

# 保证原始模型的各个文件不遗漏保存到merge_path中
def copy_files_not_in_B(A_path, B_path):
    if not os.path.exists(A_path):
        raise FileNotFoundError(f"The directory {A_path} does not exist.")
    if not os.path.exists(B_path):
        os.makedirs(B_path)

    # 获取路径A中所有非权重文件
    files_in_A = os.listdir(A_path)
    files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file)])

    files_in_B = set(os.listdir(B_path))

    # 找到所有A中存在但B中不存在的文件
    files_to_copy = files_in_A - files_in_B

    # 将文件或文件夹复制到B路径下
    for file in files_to_copy:
        src_path = os.path.join(A_path, file)
        dst_path = os.path.join(B_path, file)

        if os.path.isdir(src_path):
            # 复制目录及其内容
            shutil.copytree(src_path, dst_path)
        else:
            # 复制文件
            shutil.copy2(src_path, dst_path)

def merge_lora_to_base_model():
    model_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat'  # 原模型地址
    adapter_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat-lora/model'  # 微调后模型的保存地址
    save_path = 'output/merge-model'

    # 如果文件夹不存在,就创建
    if not os.path.exists(save_path):
        os.makedirs(save_path)
    tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True,)

    model = AutoModelForCausalLM.from_pretrained(
        model_name_or_path,
        trust_remote_code=True,
        low_cpu_mem_usage=True,
        torch_dtype=torch.float16,
        device_map="auto"
    )
    # 加载保存的 Adapter
    model = PeftModel.from_pretrained(model, adapter_name_or_path, device_map="auto",trust_remote_code=True)
    # 将 Adapter 合并到基础模型中
    merged_model = model.merge_and_unload()  # PEFT 的方法将 Adapter 权重合并到基础模型
    # 保存合并后的模型
    tokenizer.save_pretrained(save_path)
    merged_model.save_pretrained(save_path, safe_serialization=False)
    copy_files_not_in_B(model_name_or_path, save_path)
    print(f"合并后的模型已保存至: {save_path}")


if __name__ == '__main__':
    merge_lora_to_base_model()

说明:其中有3个路径需要配置:

model_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat'  #原模型地址,就是从网上下载好的地址
adapter_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat-lora/model'  #微调后模型的保存地址,也就是训练之后生成的模型地址
save_path = 'output/merge-model' #合并之后模型的保存地址

然后开始合并,如:

 python ./merge_model.py

执行结果如下:

4、将模型转换成gguf,并部署到ollama中

具体方法请看将下载的.bin模型转成gguf格式,导入ollama

最后执行结果如下:

参考:https://zhuanlan.zhihu.com/p/9812641926

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐