安装Postgres数据库以及安装pgvector向量插件请参考:这里

使用的文档是一部pdf格式的英文小小说,下载链接见文章顶部资源链接。

首先定义.env文件,保存千问模型的key,base url和模型名字,这里使用的是qwen-plus:

DASHSCOPE_API_KEY=你的key
BASE_URL=https://dashscope.aliyuncs.com/compatible-mode/v1
MODEL=qwen-plus

打开Jupyter notebook,先import。其中psycopg2用于连接数据库,使用的版本如下:

psycopg2: 2.9.11
langchain: 1.0.5

import os
import json
import psycopg2
from langchain_community.document_loaders.pdf import PyPDFLoader
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document
from langchain_openai import ChatOpenAI
from langchain_community.embeddings import DashScopeEmbeddings
from dotenv import load_dotenv

load_dotenv()

设置文档路径,数据库连接字符串,向量维度

PDF_PATH = "./testrag.pdf"
PG_DSN = "dbname=postgres user=postgres password=admin host=localhost port=5432"
VECTOR_DIM = 1024

加载pdf测试文档,分块:

loader = PyPDFLoader(PDF_PATH)
docs = loader.load()
splitter = RecursiveCharacterTextSplitter(chunk_size=200, chunk_overlap=30)
chunks = splitter.split_documents(docs)

定义一个方法方便对分块的文本做嵌入操作

def get_embedding(text):
    print(text)
    underlying_embeddings = DashScopeEmbeddings(
        model="text-embedding-v4",
        dashscope_api_key=os.getenv("DASHSCOPE_API_KEY")
    )
    embeddings = underlying_embeddings.embed_documents(text)
    print(embeddings)
    return embeddings

连接数据库,创建表用来存储向量

conn = psycopg2.connect(PG_DSN)
cur = conn.cursor()
cur.execute("""
   CREATE TABLE IF NOT EXISTS doc_chunks (
            id SERIAL PRIMARY KEY,
            content TEXT,
            metadata TEXT,
            embedding vector(%s)
        )
""" % VECTOR_DIM)
conn.commit()

将文本向量化然后存入数据库

for chunk in chunks:
    emb = get_embedding([chunk.page_content])
    if hasattr(emb, "tolist"):
        emb = emb.tolist()
    if isinstance(emb, list) and len(emb) == 1 and isinstance(emb[0], (list, tuple)):
        emb = emb[0]
    cur.execute(
        "INSERT INTO doc_chunks(content,metadata, embedding) VALUES (%s, %s,%s)", (chunk.page_content, json.dumps(chunk.metadata), emb)
    )
conn.commit()
print(f"has inserted {len(chunks)} chunk and embeddings")

插入到数据库的结果:

查询:

query_text = "had she drag him down?"
query_emb = get_embedding(query_text)
print(query_emb[0])
cur.execute(
    "SELECT content, metadata, embedding <-> %s::vector AS distance FROM doc_chunks ORDER BY distance ASC LIMIT 3", (query_emb[0],)
)
#cur.execute("SELECT content, metadata from doc_chunks order by embedding <-> %s::vector", (query_emb[0],))
results = cur.fetchall()
print("most relativecontent: ")
print(results)

查询结果:

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐