chromadb

collection.query(query_texts="美食",n_results=2,where_document={"$contains":"西安"})--设置返回的记录数，根据文档内容过滤。collection.query(query_texts=['hello'],include=['embeddings']) --显示embeddings。documents=["北京的旅游景点很多"

ggaofeng

857人浏览 · 2025-04-05 20:19:51

ggaofeng · 2025-04-05 20:19:51 发布

chromadb是一个轻量化的向量数据库，可以和llama-index等RAG框架使用。底层基于sqllite。

Getting Started - Chroma Docs

1、安装 $pip install chromadb

pip install chromadb-client --在CS模式下，如果机器A上只需要安装客户端

2、可以使用客户端，服务端模式

$chroma run --path ./db_path

Saving data to: ./db_path
Connect to Chroma at: http://localhost:8000
--可以看到./db_path目录下自动生成了 chroma.sqlite3 这个文件

客户端连接

import chromadb
client = chromadb.HttpClient(host='localhost', port=8000)

也可以使用客户端内嵌模式

import chromadb
chroma_client = chromadb.Client()

下面这个是有持久化的客户端内嵌

client = chromadb.PersistentClient(path="/path/to/save/to")

3、创建一个集合

collection = client.get_or_create_collection(name="my_collection")
默认使用all-MiniLM-L6-v2 embedding，可以设置自己的embedding函数

model_path = r'D:\PycharmProjects\example\models\bge-large-zh-v1.5'
from chromadb.utils import embedding_functions
sentence_transformer_ef = embedding_functions.SentenceTransformerEmbeddingFunction(model_name=model_path)

collection = chroma_client.create_collection(name="my_collection", embedding_function=sentence_transformer_ef)

https://zhuanlan.zhihu.com/p/649413704

注意，下载embedding模型时，1_Pooling目录也要下载，否则用SentenceTransformer部署本地模型时，会报 huggingface_hub.errors.HFValidationError: Repo id must be in the form ‘repo_name‘ or ‘namespace/repo

4、几个简单函数

collection.count()
collection.get()

5、添加文档

collection.add(
documents=["北京的旅游景点很多", "西安有很多大学"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)

更新文档

collection.upsert(
documents=["北京的旅游景点很多", "西安有很多大学"],
metadatas=[{"source": "my_source"}, {"source": "my_source"}],
ids=["id1", "id2"]
)

6、查找最相似的文档

collection.query(query_texts=['西北'])

collection.query(query_texts=['hello'],include=['embeddings']) --显示embeddings

collection.query(query_texts="美食",n_results=2,where_document={"$contains":"西安"})--设置返回的记录数，根据文档内容过滤

DeepSeek技术社区

欢迎加入DeepSeek 技术社区。在这里，你可以找到志同道合的朋友，共同探索AI技术的奥秘。