
DeepSeek-llm-7B-Chat微调训练步骤
本次教程我想让大模型回复的更亲切点,关于数据集的选择方面,在最开始推理的时候,发现虽然chat模型回答的非常不错,但是有点套路的感觉,如:然后经过训练,回答如下:1、首先下载模型,从魔塔社区下载模型,如地址:https://www.modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/files如下图所示:然后将下载的模型加载到ollama中,
一、最终效果
本次教程我想让大模型回复的更亲切点,关于数据集的选择方面,在最开始推理的时候,发现虽然chat模型回答的非常不错,但是有点套路的感觉,如:
然后经过训练,回答如下:
一、训练模型基座准备
1、首先下载模型,从魔塔社区下载模型,如地址:https://www.modelscope.cn/models/deepseek-ai/deepseek-llm-7b-chat/files
如下图所示:
然后将下载的模型加载到ollama中,并运行查看效果,如:
具体如何部署到ollama中,请看:将下载的.bin模型转成gguf格式,导入ollama
2、训练之后的问答
二、训练数据集下载
1、在网上寻找训练数据集
在EmoLLM这个项目里发现了很多现成的数据集,他们的数据集由智谱清言、文心一言、通义千问、讯飞星火等大模型生成,下载好之后如下目录:
2、构建自己的数据集
如果有其他需要可以参考EmoLLM数据生成方式来构建自己的数据集,或者从以下文件中查看:
3、使用数据
这里我使用了其中单轮对话数据集来进行微调,即:
三、训练环境准备
1、首先安装conda
conda的安装步骤请看:https://blog.csdn.net/sunxiaoju/article/details/146035560
2、创建一个虚拟环境
本文代码python=3.10,请设置环境python>=3.9即可。由于本次使用的模型较大,请确保此次微调显存至少有15GB,我们需要安装以下这几个Python库,在这之前,请确保你的环境内已安装了CUDA,CUDA的安装方法请看:CUDA安装方法
conda create --name ds.llm.7b.chart python==3.10
如下图所示:
3、安装对应的库
分别执行如下命令:
pip3 install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
pip install transformers
pip install accelerate
pip install peft
pip install bitsandbytes
pip install swanlab
pip install pandas
pip install datasets
pip install loguru
注意:此处的torch需要通过官网语句来安装,才能支持cuda,否则默认安装是不支持cuda的,并且是需要根据cuda版本进行选择的,具体命令可以通过:https://pytorch.org/get-started/locally/来选择与cuda对应的命令,如:
安装之后可以通过如下方式进行验证,入托结果是True表示cuda可用,也就是GPU可用
如果出现连接超时,可以多执行几次即可。
在安装swanlab时,会提示以下信息,最后导致下载很多版本超时:
INFO: pip is looking at multiple versions of pydantic to determine which version is compatible with other requirements. This could take a while.
意思是:信息:pip正在研究pydantic的多个版本,以确定哪个版本与其他要求兼容。这可能需要一段时间。
如下图所示:
单独安装:pydantic,执行如下命令:
pip install pydantic==2.10.6
如下图所示:
然后再次执行:
pip install swanlab
此时就会安装成功,如下图所示:
四、开始训练
1、准备数据集
首先创建一个目录deepseek-finetune-lora,并在该目录下新建一个data目录,并将EmoLLM\datasets目录下的single_turn_dataset_1.json、single_turn_dataset_2.json拷贝到该目录中,如下图所示:
在deepseek-finetune-lora目录中新建一个data.py文件,其内容如下:
import json
from transformers import AutoTokenizer, AutoModelForCausalLM, GenerationConfig
from tqdm import tqdm
# 处理数据集,这次使用单轮对话来微调
## 合并数据集,其中一个数据集大概1.5w,两个合并大概3w条数据,回复比较人性化,观察最后微调的结果是否能够像数据集一样变得比较人性化
def len_data(data_path):
with open(data_path, 'r') as file:
data = json.load(file)
# 获取字典中的条目数
num_entries = len(data)
return num_entries
# 合并数据集,总共3w条数据
def merge_data(data1_path,data2_path):
# 读取第一个JSON文件
with open(data1_path, 'r', encoding='utf-8') as file:
data1 = json.load(file)
# 读取第二个JSON文件
with open(data2_path, 'r', encoding='utf-8') as file:
data2 = json.load(file)
# 合并两个列表
merged_data = data1 + data2
# 如果需要写入合并后的数据到一个新文件
with open('./data/merged_data.json', 'w', encoding='utf-8') as file:
json.dump(merged_data, file, ensure_ascii=False, indent=4)
# 输出合并后数据的条目数
print(f"合并后的数据集包含 {len(merged_data)} 条数据。")
# 处理数据,把原数据集格式更改,然后保存成jsonl格式,好进行后续的处理
def data_process(data_path,output_path):
# 最终结果
all_results = []
with open(data_path, 'r', encoding='utf-8') as file:
data_json = json.load(file)
# 打开输出文件(JSONL格式)
with open(output_path, 'w', encoding='utf-8') as output_file:
# 处理数据
for i, data in enumerate(data_json):
# id号
conversation_id = i + 1
# 由于是单轮对话,因此不用循环
conversation = []
try:
human_text = data["prompt"]
assistant_text = data["completion"]
conversation_texts = {"human": human_text, "assistant": assistant_text}
conversation.append(conversation_texts)
except KeyError:
# 如果数据没有完整的“prompt”和“completion”字段,跳过
continue
result = {"conversation_id": conversation_id, "conversation": conversation}
all_results.append(result)
# 写入到输出文件中
with open(output_path, 'w', encoding='utf-8') as file:
for item in tqdm(all_results, desc="Writing to File"):
file.write(json.dumps(item, ensure_ascii=False) + '\n')
print(f"将 {len(data_json)} 条数据保存到{output_path}中")
# 按照微调的要求预处理数据
def finetune_data(data: dict, tokenizer, max_seq_length):
# 处理数据
conversation = data["conversation"]
input_ids, attention_mask, labels = [], [], []
for i,conv in enumerate(conversation):
human_text = conv["human"]
assistant_text = conv["assistant"]
input_text = "human:"+human_text + "\n\nassistant:"
input_tokenizer = tokenizer(
input_text,
add_special_tokens=False,
truncation=True,
padding=False,
return_tensors=None,
)
output_tokenizer = tokenizer(
assistant_text,
add_special_tokens=False,
truncation=True,
padding=False,
return_tensors=None,
)
input_ids += (
input_tokenizer["input_ids"]+output_tokenizer["input_ids"]+[tokenizer.eos_token_id]
)
attention_mask += input_tokenizer["attention_mask"]+output_tokenizer["attention_mask"]+[1]
labels += ([-100]*len(input_tokenizer["input_ids"])+output_tokenizer["input_ids"]+[tokenizer.eos_token_id]
)
if len(input_ids) > max_seq_length: # 做一个截断
input_ids = input_ids[:max_seq_length]
attention_mask = attention_mask[:max_seq_length]
labels = labels[:max_seq_length]
return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
if __name__ == '__main__':
merge_data('./data/single_turn_dataset_1.json','./data/single_turn_dataset_2.json');
data_path = './data/merged_data.json'
output_path = './data/single_datas.jsonl'
data_process(data_path, output_path)
model_path = "D:/deepseek/deepseek-llm-7b-chat"
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
with open(output_path, "r", encoding="utf-8") as f:
data = [json.loads(line) for line in f]
data = data[0]
print(data)
max_seq_length = 2048
print(finetune_data(data,tokenizer,max_seq_length))
需要设置数据集位置,如:
merge_data('./data/single_turn_dataset_1.json','./data/single_turn_dataset_2.json');
然后通过如下代码进行转换成jsonl:
data_process(data_path, output_path)
然后进入到deepseek-finetune-lora目录,如下图所示:
最后执行如下命令:
python ./data.py
执行结果如下图所示:
查看对应的目录:
2、开始执行训练
创建一个在deepseek-finetune-lora目录中新建一个model目录,并在该目录中新建一个finetune.py文件,代码如下:
import argparse
from os.path import join
import pandas as pd
from datasets import Dataset
from loguru import logger
from transformers import (
TrainingArguments,
AutoModelForCausalLM,
Trainer,
DataCollatorForSeq2Seq,
AutoTokenizer,
)
import torch
from peft import LoraConfig, get_peft_model, TaskType
from swanlab.integration.transformers import SwanLabCallback
import bitsandbytes as bnb
from torch import nn
from torch.nn.parallel import DistributedDataParallel as DDP
import torch.distributed as dist
import os
# 配置参数
def configuration_parameter():
parser = argparse.ArgumentParser(description="LoRA fine-tuning for deepseek model")
# 模型路径相关参数
parser.add_argument("--model_name_or_path", type=str, default="D:/deepseek/deepseek-llm-7b-chat",
help="Path to the model directory downloaded locally")
parser.add_argument("--output_dir", type=str, default="D:/deepseek/deepseek-llm-7b-chat-lora/model",
help="Directory to save the fine-tuned model and checkpoints")
# 数据集路径
parser.add_argument("--train_file", type=str, default="./data/single_datas.jsonl",
help="Path to the training data file in JSONL format")
# 训练超参数
parser.add_argument("--num_train_epochs", type=int, default=3,
help="Number of training epochs")
parser.add_argument("--per_device_train_batch_size", type=int, default=4,
help="Batch size per device during training")
parser.add_argument("--gradient_accumulation_steps", type=int, default=16,
help="Number of updates steps to accumulate before performing a backward/update pass")
parser.add_argument("--learning_rate", type=float, default=2e-4,
help="Learning rate for the optimizer")
parser.add_argument("--max_seq_length", type=int, default=2048,
help="Maximum sequence length for the input")
parser.add_argument("--logging_steps", type=int, default=1,
help="Number of steps between logging metrics")
parser.add_argument("--save_steps", type=int, default=500,
help="Number of steps between saving checkpoints")
parser.add_argument("--save_total_limit", type=int, default=1,
help="Maximum number of checkpoints to keep")
parser.add_argument("--lr_scheduler_type", type=str, default="constant_with_warmup",
help="Type of learning rate scheduler")
parser.add_argument("--warmup_steps", type=int, default=100,
help="Number of warmup steps for learning rate scheduler")
# LoRA 特定参数
parser.add_argument("--lora_rank", type=int, default=16,
help="Rank of LoRA matrices")
parser.add_argument("--lora_alpha", type=int, default=32,
help="Alpha parameter for LoRA")
parser.add_argument("--lora_dropout", type=float, default=0.05,
help="Dropout rate for LoRA")
# 分布式训练参数
# 使用pytorch进行分布式训练,需要指定 local_rank,主机 local_rank = 0
parser.add_argument("--local_rank", type=int, default=int(os.environ.get("LOCAL_RANK", -1)), help="Local rank for distributed training")
# 是否启用分布式训练,False表示单机模式,True表示分布式训练
parser.add_argument("--distributed",type=bool, default=False, help="分布式训练")
# 额外优化和硬件相关参数
parser.add_argument("--gradient_checkpointing", type=bool, default=True,
help="Enable gradient checkpointing to save memory")
parser.add_argument("--optim", type=str, default="adamw_torch",
help="Optimizer to use during training")
parser.add_argument("--train_mode", type=str, default="lora",
help="lora or qlora")
parser.add_argument("--seed", type=int, default=42,
help="Random seed for reproducibility")
parser.add_argument("--fp16", type=bool, default=True,
help="Use mixed precision (FP16) training")
parser.add_argument("--report_to", type=str, default=None,
help="Reporting tool for logging (e.g., tensorboard)")
parser.add_argument("--dataloader_num_workers", type=int, default=0,
help="Number of workers for data loading")
parser.add_argument("--save_strategy", type=str, default="steps",
help="Strategy for saving checkpoints ('steps', 'epoch')")
parser.add_argument("--weight_decay", type=float, default=0,
help="Weight decay for the optimizer")
parser.add_argument("--max_grad_norm", type=float, default=1,
help="Maximum gradient norm for clipping")
parser.add_argument("--remove_unused_columns", type=bool, default=True,
help="Remove unused columns from the dataset")
args = parser.parse_args()
return args
def find_all_linear_names(model, train_mode):
"""
找出所有全连接层,为所有全连接添加adapter
"""
assert train_mode in ['lora', 'qlora']
cls = bnb.nn.Linear4bit if train_mode == 'qlora' else nn.Linear
lora_module_names = set()
for name, module in model.named_modules():
if isinstance(module, cls):
names = name.split('.')
lora_module_names.add(names[0] if len(names) == 1 else names[-1])
if 'lm_head' in lora_module_names: # needed for 16-bit
lora_module_names.remove('lm_head')
lora_module_names = list(lora_module_names)
logger.info(f'LoRA target module names: {lora_module_names}')
return lora_module_names
def setup_distributed(args):
"""初始化分布式环境"""
if args.distributed:
if args.local_rank == -1:
raise ValueError("未正确初始化 local_rank,请确保通过分布式启动脚本传递参数,例如 torchrun。")
# 初始化分布式进程组
dist.init_process_group(backend="nccl")
torch.cuda.set_device(args.local_rank)
print(f"分布式训练已启用,Local rank: {args.local_rank}")
else:
print("未启用分布式训练,单线程模式。")
# 加载模型
def load_model(args, train_dataset, data_collator):
# 初始化分布式环境
setup_distributed(args)
# 加载模型
model_kwargs = {
"trust_remote_code": True,
"torch_dtype": torch.float16 if args.fp16 else torch.bfloat16,
"use_cache": False if args.gradient_checkpointing else True,
"device_map": "auto" if not args.distributed else None,
}
model = AutoModelForCausalLM.from_pretrained(args.model_name_or_path, **model_kwargs)
# 用于确保模型的词嵌入层参与训练
model.enable_input_require_grads()
# 将模型移动到正确设备
if args.distributed:
model.to(args.local_rank)
model = DDP(model, device_ids=[args.local_rank], output_device=args.local_rank)
# 哪些模块需要注入Lora参数
target_modules = find_all_linear_names(model.module if isinstance(model, DDP) else model, args.train_mode)
# lora参数设置
config = LoraConfig(
r=args.lora_rank,
lora_alpha=args.lora_alpha,
lora_dropout=args.lora_dropout,
bias="none",
target_modules=target_modules,
task_type=TaskType.CAUSAL_LM,
inference_mode=False
)
use_bfloat16 = torch.cuda.is_bf16_supported() # 检查设备是否支持 bf16
# 配置训练参数
train_args = TrainingArguments(
output_dir=args.output_dir,
per_device_train_batch_size=args.per_device_train_batch_size,
gradient_accumulation_steps=args.gradient_accumulation_steps,
logging_steps=args.logging_steps,
num_train_epochs=args.num_train_epochs,
save_steps=args.save_steps,
learning_rate=args.learning_rate,
save_on_each_node=True,
gradient_checkpointing=args.gradient_checkpointing,
report_to=args.report_to,
seed=args.seed,
optim=args.optim,
local_rank=args.local_rank if args.distributed else -1,
ddp_find_unused_parameters=False, # 分布式参数检查优化
fp16=args.fp16,
bf16=not args.fp16 and use_bfloat16,
)
# 应用 PEFT 配置到模型
model = get_peft_model(model.module if isinstance(model, DDP) else model, config) # 确保传递的是原始模型
print("model:", model)
model.print_trainable_parameters()
### 展示平台
swanlab_config = {
"lora_rank": args.lora_rank,
"lora_alpha": args.lora_alpha,
"lora_dropout": args.lora_dropout,
"dataset":"single-data-3w"
}
swanlab_callback = SwanLabCallback(
project="deepseek-finetune",
experiment_name="deepseek-llm-7b-chat-lora",
description="DeepSeek有很多模型,V2太大了,这里选择llm-7b-chat的,希望能让回答更加人性化",
workspace=None,
config=swanlab_config,
)
trainer = Trainer(
model=model,
args=train_args,
train_dataset=train_dataset,
data_collator=data_collator,
callbacks=[swanlab_callback],
)
return trainer
# 处理数据
def process_data(data: dict, tokenizer, max_seq_length):
# 处理数据
conversation = data["conversation"]
input_ids, attention_mask, labels = [], [], []
for i, conv in enumerate(conversation):
human_text = conv["human"].strip()
assistant_text = conv["assistant"].strip()
input_text = "Human:" + human_text + "\n\nnAssistant:"
input_tokenizer = tokenizer(
input_text,
add_special_tokens=False,
truncation=True,
padding=False,
return_tensors=None,
)
output_tokenizer = tokenizer(
assistant_text,
add_special_tokens=False,
truncation=True,
padding=False,
return_tensors=None,
)
input_ids += (
input_tokenizer["input_ids"] + output_tokenizer["input_ids"] + [tokenizer.eos_token_id]
)
attention_mask += input_tokenizer["attention_mask"] + output_tokenizer["attention_mask"] + [1]
labels += ([-100] * len(input_tokenizer["input_ids"]) + output_tokenizer["input_ids"] + [tokenizer.eos_token_id]
)
if len(input_ids) > max_seq_length: # 做一个截断
input_ids = input_ids[:max_seq_length]
attention_mask = attention_mask[:max_seq_length]
labels = labels[:max_seq_length]
return {"input_ids": input_ids, "attention_mask": attention_mask, "labels": labels}
# 训练部分
def main():
args = configuration_parameter()
print("*****************加载分词器*************************")
# 加载分词器
model_path = args.model_name_or_path
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True, use_fast=False)
print("*****************处理数据*************************")
# 处理数据
# 获得数据
data = pd.read_json(args.train_file, lines=True)
train_ds = Dataset.from_pandas(data)
train_dataset = train_ds.map(process_data, fn_kwargs={"tokenizer": tokenizer, "max_seq_length": args.max_seq_length},
remove_columns=train_ds.column_names)
data_collator = DataCollatorForSeq2Seq(tokenizer=tokenizer, padding=True, return_tensors="pt")
print(train_dataset, data_collator)
# 加载模型
print("*****************训练*************************")
trainer = load_model(args, train_dataset, data_collator)
trainer.train()
# 训练
final_save_path = join(args.output_dir)
trainer.save_model(final_save_path)
if __name__ == "__main__":
main()
执行如下代码:
python ./finetune.py
如下图所示:
此时swanlab会让输入选择,1:表示创建一个新的账号,2表示适用已有账号,3表示不保存结果,在选择之前我们需要先在SwanLab官网注册一个账号,并记录一个API-K,如下图所示:
然后再此处选择1,如下图所示:
然后输入API Key,注意:输入是不进行回显的,如下图所示:
此时就会自动在swanlab上创建一个训练项目,如下图所示:
经过长达8个小时运行已经完成了训练,如下图所示:
推理后的模型文件如下图所示:
3、合并模型
在deepseek-finetune-lora目录新建一个merge_model.py文件,代码如下:
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
import os
import shutil
# 保证原始模型的各个文件不遗漏保存到merge_path中
def copy_files_not_in_B(A_path, B_path):
if not os.path.exists(A_path):
raise FileNotFoundError(f"The directory {A_path} does not exist.")
if not os.path.exists(B_path):
os.makedirs(B_path)
# 获取路径A中所有非权重文件
files_in_A = os.listdir(A_path)
files_in_A = set([file for file in files_in_A if not (".bin" in file or "safetensors" in file)])
files_in_B = set(os.listdir(B_path))
# 找到所有A中存在但B中不存在的文件
files_to_copy = files_in_A - files_in_B
# 将文件或文件夹复制到B路径下
for file in files_to_copy:
src_path = os.path.join(A_path, file)
dst_path = os.path.join(B_path, file)
if os.path.isdir(src_path):
# 复制目录及其内容
shutil.copytree(src_path, dst_path)
else:
# 复制文件
shutil.copy2(src_path, dst_path)
def merge_lora_to_base_model():
model_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat' # 原模型地址
adapter_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat-lora/model' # 微调后模型的保存地址
save_path = 'output/merge-model'
# 如果文件夹不存在,就创建
if not os.path.exists(save_path):
os.makedirs(save_path)
tokenizer = AutoTokenizer.from_pretrained(model_name_or_path,trust_remote_code=True,)
model = AutoModelForCausalLM.from_pretrained(
model_name_or_path,
trust_remote_code=True,
low_cpu_mem_usage=True,
torch_dtype=torch.float16,
device_map="auto"
)
# 加载保存的 Adapter
model = PeftModel.from_pretrained(model, adapter_name_or_path, device_map="auto",trust_remote_code=True)
# 将 Adapter 合并到基础模型中
merged_model = model.merge_and_unload() # PEFT 的方法将 Adapter 权重合并到基础模型
# 保存合并后的模型
tokenizer.save_pretrained(save_path)
merged_model.save_pretrained(save_path, safe_serialization=False)
copy_files_not_in_B(model_name_or_path, save_path)
print(f"合并后的模型已保存至: {save_path}")
if __name__ == '__main__':
merge_lora_to_base_model()
说明:其中有3个路径需要配置:
model_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat' #原模型地址,就是从网上下载好的地址
adapter_name_or_path = 'D:/deepseek/deepseek-llm-7b-chat-lora/model' #微调后模型的保存地址,也就是训练之后生成的模型地址
save_path = 'output/merge-model' #合并之后模型的保存地址
然后开始合并,如:
python ./merge_model.py
执行结果如下:
4、将模型转换成gguf,并部署到ollama中
具体方法请看将下载的.bin模型转成gguf格式,导入ollama
最后执行结果如下:
更多推荐
所有评论(0)