微调DeepSeek-R1&构建RAG系统（篇三）

利用langchain把数据向量化保存到milvus中

若水寒刃

1212人浏览 · 2025-04-12 16:16:04

若水寒刃 · 2025-04-12 16:16:04 发布

系列文章目录

文章目录

系列文章目录
微调Deepseek-R1，并构建RAG系统（构建RAG系统）
- Rag(检索增强生成（Retrieval Augmented Generation），简称 RAG)

微调Deepseek-R1，并构建RAG系统（构建RAG系统）

Rag(检索增强生成（Retrieval Augmented Generation），简称 RAG)

能够让大模型检索到实时的非公开的数据。对于一些私密的数据，或者个性化领域的数据，大模型通过结合rag查询知识库是一个很好的选择

完成rag检索的步骤，准备工作如下

数据准备

文档收集（获取原始文档）-》文档格式化（去除格式、图像等非文本内容）-》文档分块-》元数据附加（添加来源、创建时间等元信息)
向量化

选择嵌入模型（嵌入模型起到了把文本转化为向量表示的重要作用，以便进行高效的语义检索，不同大模型根据功能、设计架构的不同，所需的嵌入模型也有些不同，例：【通用模型：OpenAI text-embedding-ada-002、BGE、Sentence-BERT。领域专用模型】）-》生成向量
构建索引

选择向量库（例：【Milvus、Faiss、Chroma】）-》数据入库

数据库选择了milvus，开源

安装向量数据库（使用docker一键安装方便，如果在windows系统跑，建议使用Docker Desktop）

milvus有三种安装方式

milvus lite版本是轻量版本适合自己测试

milvus standalone是单机版，尽量使用docker部署，windows上建议使用docker desktop

milvus distributed是分布式版本，适合大规模的向量搜索

我是找了个linux系统安装了milvus stanalone
```
curl -sfL https://raw.githubusercontent.com/milvus-io/milvus/master/scripts/standalone_embed.sh -o standalone_embed.sh

bash standalone_embed.sh start
```
然后安装完成之后，会自带一个web-ui的页面，可以直接访问http://xxxx::9091/webui
查询处理阶段

用户查询-》查询预处理-》查询向量化-》向量相似度搜索/混合搜索-》结果后处理-》top-k相关（从一组数据里挑选出最类似的k个元素）

准备的数据库的数据如下

data.txt数据

# 基于RAG的智能客服系统技术白皮书  
## 一、摘要  
本文提出一种融合向量检索与知识图谱的RAG架构，在客服场景中实现 **92.7%** 的意图识别准确率，响应速度提升至 **3.2秒/次**，有效降低人工介入率 **40%**。核心技术包括：  
1. 混合检索模块（向量检索+知识图谱）  
2. 动态语义分块策略  
3. 多模态内容解析（图片OCR、表格结构化）  


## 二、系统架构设计  
### 2.1 混合检索模块  
系统采用 **768维BERT向量** 进行语义向量化，结合Neo4j知识图谱实现逻辑推理。在10万条客服对话数据测试中：  
- **混合检索F1值**：0.91（较单一向量检索提升15%）  
- **不同检索方式性能对比**：  

| 指标         | 向量检索 | 知识图谱 | 混合检索 |  
|--------------|----------|----------|----------|  
| 准确率       | 82.3%    | 85.6%    | **92.7%**|  
| 响应时间     | 1.8s     | 2.5s     | 3.2s     |  
| 召回率       | 78.5%    | 83.2%    | **91.0%**|  

### 2.2 多模态支持  
- 集成Tesseract 5.3进行图片OCR解析，成功解析 **98.6%** 的图片内容  
- 处理流程：文档解析 → OCR识别 → 向量生成（见下图1）  


## 三、关键技术实现  
### 3.1 动态分块策略  
采用SentencePiece语义切分算法，将文档分割为 **平均300字/块**，相邻块保留 **50字重叠**。金融领域实测效果：  
- 检索相关性提升 **22%**  
- 错误率下降 **18%**  

**代码示例（Python）**：  
‍```python  
def semantic_chunking(text):  
    # 基于BERT的句间相似度计算  
    sentences = split_sentences(text)  
    chunks = []  
    for i in range(0, len(sentences), 5):  
        chunk = ' '.join(sentences[i:i+5])  
        chunks.append(chunk)  
    return chunks

连接milvus数据库，并保存向量数据

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.embeddings import ModelScopeEmbeddings
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility
# 定义
MILVUS_HOST = "47.113.203.110"
MILVUS_PORT = "19530"
COLLECTION_NAME = "rag_knowledge_custom"

# 创建连接
def define_collection():
    # 连接到 Milvus
    connections.connect(
        alias="default",
        host=MILVUS_HOST,
        port=MILVUS_PORT
    )

    # 检查集合是否存在，如果存在则删除
    if utility.has_collection(COLLECTION_NAME):
        utility.drop_collection(COLLECTION_NAME)

    # 定义集合字段
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000),
        # 向量嵌入的维度，thenlper/gte-large-zh 维度 1024
        FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=768)
    ]
    schema = CollectionSchema(fields=fields,description="创建rag知识库")
    collection = Collection(COLLECTION_NAME, schema=schema)
    return collection

# 处理md文件
def dreal_md_file():
    # 1. 加载文档
    file_path = "/mnt/workspace/pythonCode/data.txt"
 	# 使用 UnstructuredMarkdownLoader 加载 Markdown 文件
    #loader = UnstructuredMarkdownLoader(file_path)
    loader = TextLoader(file_path)
    documents = loader.load()

    # 2. 文档分块
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=50,
        separators=["\n\n", "\n", " ", ""]
    )
    docs = text_splitter.split_documents(documents)
    return docs

#将数据插入milvus中
def insert_milvus():
    split_docs = dreal_md_file()
    # 获取定义的连接
    collection = define_collection()
    # 配置向量（嵌入）模型，iic/nlp_gte_sentence-embedding_chinese-base
    embeddings = ModelScopeEmbeddings(model_id="iic/nlp_gte_sentence-embedding_chinese-base")
    texts = [doc.page_content for doc in split_docs]
    print(texts)
    embeds = embeddings.embed_documents(texts)
	# 构建数据集
    data = [
        texts,
        embeds
    ]
    # 将数据插入milvus中
    collection.insert(data)
    # 将数据flush到milvus中
    collection.flush()

    # 生成索引
    index_params = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 128}
    }
    collection.create_index(field_name="embeddings", index_params=index_params)
    collection.load()
    return collection

if __name__ == '__main__':
    print("开始向量知识库保存")
    insert_milvus()
    print("结束向量知识库保存")

在这里插入图片描述

可以在web-ui上看到有数据进入

在这里插入图片描述
这里使用的向量嵌入模型是国内的iic/nlp_gte_sentence-embedding_chinese-base

如果你选用HuggingFaceEmbeddings模型（我一开始本地用的是这个，需要下载的包很多）

pip install unstructured -i https://mirrors.aliyun.com/pypi/simple/
pip install markdown -i https://mirrors.aliyun.com/pypi/simple/
pip install datasets==3.2.0 -i https://mirrors.aliyun.com/pypi/simple/
pip install sortedcontainers -i https://mirrors.aliyun.com/pypi/simple/
pip install simplejson -i https://mirrors.aliyun.com/pypi/simple/
pip install sentence-transformers -i https://mirrors.aliyun.com/pypi/simple/
pip install modelscope -i https://mirrors.aliyun.com/pypi/simple/
pip install addict -i https://mirrors.aliyun.com/pypi/simple/
pip install datasets -i https://mirrors.aliyun.com/pypi/simple/
pip install accelerate -i https://mirrors.aliyun.com/pypi/simple/

还要下载nltk的nlk_data包，下载nlk_data的包
在这里插入图片描述
下载后把package包改为nlk_data,并放到你的python的Lib下，例如：
C:\Users\XXX\AppData\Local\Programs\Python\Python310\Lib
整体文件nltk_data.zip(已上传。在文章顶部展示)

我本地python环境各个包的版本号如下

addict                           2.4.0
aiofiles                         24.1.0
aiohappyeyeballs                 2.6.1
aiohttp                          3.11.16
aiosignal                        1.3.2
annotated-types                  0.7.0
anyio                            4.9.0
async-timeout                    4.0.3
attrs                            25.3.0
backoff                          2.2.1
beautifulsoup4                   4.13.3
certifi                          2025.1.31
cffi                             1.17.1
chardet                          5.2.0
charset-normalizer               3.4.1
click                            8.1.8
colorama                         0.4.6
cryptography                     44.0.2
dataclasses-json                 0.6.7
datasets                         3.2.0
dill                             0.3.8
distro                           1.9.0
emoji                            2.14.1
eval_type_backport               0.2.2
exceptiongroup                   1.2.2
filelock                         3.18.0
filetype                         1.2.0
frozenlist                       1.5.0
fsspec                           2024.9.0
greenlet                         3.1.1
grpcio                           1.67.1
h11                              0.14.0
html5lib                         1.1
httpcore                         1.0.7
httpx                            0.28.1
httpx-sse                        0.4.0
huggingface-hub                  0.30.2
idna                             3.10
Jinja2                           3.1.6
jiter                            0.9.0
joblib                           1.4.2
jsonpatch                        1.33
jsonpointer                      3.0.0
langchain                        0.3.23
langchain-community              0.3.21
langchain-core                   0.3.51
langchain-modelscope-integration 0.1.0
langchain-openai                 0.2.14
langchain-text-splitters         0.3.8
langdetect                       1.0.9
langsmith                        0.3.27
lxml                             5.3.2
Markdown                         3.7
MarkupSafe                       3.0.2
marshmallow                      3.26.1
modelscope                       1.24.1
mpmath                           1.3.0
multidict                        6.2.0
multiprocess                     0.70.16
mypy-extensions                  1.0.0
nest-asyncio                     1.6.0
networkx                         3.4.2
nltk                             3.9.1
numpy                            2.2.4
olefile                          0.47
openai                           1.72.0
orjson                           3.10.16
packaging                        24.2
pandas                           2.2.3
pillow                           11.1.0
pip                              25.0.1
propcache                        0.3.1
protobuf                         6.30.2
psutil                           7.0.0
pyarrow                          19.0.1
pycparser                        2.22
pydantic                         2.11.3
pydantic_core                    2.33.1
pydantic-settings                2.8.1
pymilvus                         2.5.6
pypdf                            5.4.0
python-dateutil                  2.9.0.post0
python-dotenv                    1.1.0
python-iso639                    2025.2.18
python-magic                     0.4.27
python-oxmsg                     0.0.2
pytz                             2025.2
PyYAML                           6.0.2
RapidFuzz                        3.13.0
regex                            2024.11.6
requests                         2.32.3
requests-toolbelt                1.0.0
safetensors                      0.5.3
scikit-learn                     1.6.1
scipy                            1.15.2
sentence-transformers            4.0.2
setuptools                       78.1.0
simplejson                       3.20.1
six                              1.17.0
sniffio                          1.3.1
sortedcontainers                 2.4.0
soupsieve                        2.6
SQLAlchemy                       2.0.40
sympy                            1.13.1
tenacity                         9.1.2
threadpoolctl                    3.6.0
tiktoken                         0.9.0
tokenizers                       0.21.1
torch                            2.6.0
tqdm                             4.67.1
transformers                     4.51.1
typing_extensions                4.13.1
typing-inspect                   0.9.0
typing-inspection                0.4.0
tzdata                           2025.2
ujson                            5.10.0
unstructured                     0.17.2
unstructured-client              0.32.3
urllib3                          2.3.0
webencodings                     0.5.1
wrapt                            1.17.2
xxhash                           3.5.0
yarl                             1.19.0
zstandard                        0.23.0

整个构建rag系统代码如下

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.document_loaders import TextLoader
from langchain_community.document_loaders import UnstructuredMarkdownLoader
from langchain_community.embeddings import ModelScopeEmbeddings
from pymilvus import connections, FieldSchema, CollectionSchema, DataType, Collection, utility


# 定义
MILVUS_HOST = "47.113.203.110"
MILVUS_PORT = "19530"
COLLECTION_NAME = "rag_knowledge_custom"
embeddings = ModelScopeEmbeddings(model_id="iic/nlp_gte_sentence-embedding_chinese-base")


# 创建连接
def define_collection():
    # 连接到 Milvus
    connections.connect(
        alias="default",
        host=MILVUS_HOST,
        port=MILVUS_PORT
    )

    # 检查集合是否存在，如果存在则删除
    if utility.has_collection(COLLECTION_NAME):
        utility.drop_collection(COLLECTION_NAME)

    # 定义集合字段
    fields = [
        FieldSchema(name="id", dtype=DataType.INT64, is_primary=True, auto_id=True),
        FieldSchema(name="text", dtype=DataType.VARCHAR, max_length=5000),
        # 向量嵌入的维度，thenlper/gte-large-zh 维度 1024
        FieldSchema(name="embeddings", dtype=DataType.FLOAT_VECTOR, dim=768)
    ]
    schema = CollectionSchema(fields=fields,description="创建rag知识库")
    collection = Collection(COLLECTION_NAME, schema=schema)
    return collection

# 处理md文件
def dreal_md_file():
    # 1. 加载文档
    file_path = "/mnt/workspace/pythonCode/data.txt"
    loader = TextLoader(file_path)
    documents = loader.load()

    # 2. 文档分块
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=300,
        chunk_overlap=50,
        separators=["\n\n", "\n", " ", ""]
    )
    docs = text_splitter.split_documents(documents)
    return docs

#将数据插入milvus中
def insert_milvus():
    split_docs = dreal_md_file()
    # 获取定义的连接
    collection = define_collection()
    # 配置向量（嵌入）模型，阿里巴巴达摩院开发的iic/nlp_gte_sentence-embedding_chinese-base

    texts = [doc.page_content for doc in split_docs]
    print(texts)
    embeds = embeddings.embed_documents(texts)
    data = [
        texts,
        embeds
    ]
    # 将数据插入milvus中
    collection.insert(data)
    # 将数据flush到milvus中
    collection.flush()

    # 生成索引
    index_params = {
        "metric_type": "L2",
        "index_type": "IVF_FLAT",
        "params": {"nlist": 128}
    }
    collection.create_index(field_name="embeddings", index_params=index_params)
    collection.load()
    return collection

from modelscope import AutoModelForCausalLM, AutoTokenizer

model_name = "deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(model_name,device_map="auto")

# 检索最相近的前3个
def rag_answer(collection,question, top_k=3):
    # 向量化问题
    question_vector = embeddings.embed_query(question)
    print("\n向量数据"+str(question_vector))
    # Milvus 检索相似文本
    search_params = {"metric_type": "L2", "params": {"nprobe": 10}}
	# 接口说明如下图所示
    results = collection.search(
        data=[question_vector],
        anns_field="embeddings",#上面配置的 field_name="embeddings"
        param=search_params,
        limit=top_k,
        output_fields=["text"]
    )
    print("\n这是结果"+str(results))
    # 手动拼接上下文，建立prompt,将检索到的向量化知识与原始 prompt（用户输入）整合
	# 核心思路是：让生成模型动态关注最相关的知识片段
    context = "\n".join([hit.entity.get("text") for hit in results[0]])
    prompt = f"基于以下上下文：\n{context}\n\n问题：{question}\n回答："
    
    # 生成回答
    inputs = tokenizer(prompt, return_tensors="pt", max_length=1024, truncation=True).to(model.device)
    outputs = model.generate(**inputs, max_new_tokens=200)
    answer = tokenizer.decode(outputs[0], skip_special_tokens=True)
    print("\n这是回答:\n"+str(answer))
    return answer

if __name__ == '__main__':
    # print("开始向量知识库保存")
    collection = insert_milvus()
    # print("结束向量知识库保存")
    rag_answer(collection,"多模态支持")