多模态RAG实战：图文检索增强生成系统构建

少林码僧

210人浏览 · 2026-05-08 08:05:38

少林码僧 · 2026-05-08 08:05:38 发布

原创技术解读 | 构建支持图文的多模态RAG系统

摘要多模态RAG（Multimodal RAG）将检索增强生成技术扩展到图像、视频等非文本模态，开启了全新的应用场景。本文系统讲解多模态RAG的技术原理、架构设计和实战实现，帮助开发者构建支持图文混合检索的智能系统。## 一、多模态RAG概述### 1.1 什么是多模态RAG？传统RAG系统主要处理文本数据，而多模态RAG能够同时处理和理解多种模态的数据：- 文本：文档、文章、说明- 图像：图片、图表、截图- 视频：教程、演示、监控- 音频：语音、播客、会议记录### 1.2 应用场景| 场景 | 描述 | 示例 ||------|------|------|| 电商搜索 | 图片搜索商品 | 上传图片找相似商品 || 文档理解 | 图文混合问答 | 分析财报图表 || 教育辅导 | 题目解析 | 拍照搜题并讲解 || 医疗诊断 | 影像分析 | 结合病历和影像诊断 || 设计辅助 | 素材检索 | 根据描述找设计素材 |## 二、多模态RAG技术架构### 2.1 系统架构图┌─────────────────────────────────────────────────────────┐│ 多模态RAG系统架构 │├─────────────────────────────────────────────────────────┤│ ││ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ││ │ 文本编码器 │ │ 图像编码器 │ │ 融合层 │ ││ │ Text Encoder│ │ Image Encoder│ │ Fusion │ ││ └──────┬──────┘ └──────┬──────┘ └──────┬──────┘ ││ │ │ │ ││ └──────────────────┼──────────────────┘ ││ ▼ ││ ┌─────────────────┐ ││ │ 多模态向量空间 │ ││ │ Vector Store │ ││ └────────┬────────┘ ││ │ ││ ┌───────────────────────┼───────────────────────┐ ││ ▼ ▼ ▼ ││ ┌─────────┐ ┌─────────┐ ┌─────────┐││ │文本查询 │ │图像查询 │ │图文查询 │││ └────┬────┘ └────┬────┘ └────┬────┘││ └─────────────────────┴─────────────────────┘ ││ │ ││ ▼ ││ ┌─────────────────┐ ││ │ 检索与生成 │ ││ │ Retrieve + LLM │ ││ └─────────────────┘ ││ │└─────────────────────────────────────────────────────────┘### 2.2 核心技术组件#### 2.2.1 多模态嵌入模型将不同模态的数据映射到统一的向量空间：pythonfrom sentence_transformers import SentenceTransformerfrom PIL import Imageimport torchclass MultimodalEmbedder: def init(self): # 使用CLIP或类似的多模态模型 self.model = SentenceTransformer('clip-ViT-B-32') def encode_text(self, text: str) -> torch.Tensor: """编码文本""" return self.model.encode(text, convert_to_tensor=True) def encode_image(self, image_path: str) -> torch.Tensor: """编码图像""" image = Image.open(image_path) return self.model.encode(image, convert_to_tensor=True) def encode_multimodal(self, text: str = None, image_path: str = None): """编码多模态数据""" embeddings = [] if text: embeddings.append(self.encode_text(text)) if image_path: embeddings.append(self.encode_image(image_path)) # 融合多模态嵌入 if len(embeddings) > 1: return torch.mean(torch.stack(embeddings), dim=0) return embeddings[0]#### 2.2.2 图像理解增强结合OCR和图像描述技术：pythonfrom transformers import pipelineimport pytesseractfrom PIL import Imageclass ImageUnderstanding: def init(self): # 图像描述模型 self.captioner = pipeline("image-to-text", model="Salesforce/blip-image-captioning-base") def extract_image_content(self, image_path: str) -> dict: """提取图像的多维度信息""" image = Image.open(image_path) # 1. 图像描述 caption = self.captioner(image)[0]['generated_text'] # 2. OCR文本提取 ocr_text = pytesseract.image_to_string(image) # 3. 图像元数据 metadata = { "size": image.size, "mode": image.mode, "format": image.format } return { "caption": caption, "ocr_text": ocr_text, "metadata": metadata }## 三、多模态RAG实现### 3.1 文档处理流程pythonclass MultimodalDocumentProcessor: def init(self): self.embedder = MultimodalEmbedder() self.image_understanding = ImageUnderstanding() async def process_document(self, doc_path: str) -> List[DocumentChunk]: """处理多模态文档""" chunks = [] if doc_path.endswith('.pdf'): chunks = await self.process_pdf(doc_path) elif doc_path.endswith(('.png', '.jpg', '.jpeg')): chunks = await self.process_image(doc_path) else: chunks = await self.process_text(doc_path) return chunks async def process_pdf(self, pdf_path: str) -> List[DocumentChunk]: """处理PDF文档（可能包含图文）""" from pdf2image import convert_from_path import fitz # PyMuPDF chunks = [] doc = fitz.open(pdf_path) for page_num in range(len(doc)): page = doc[page_num] # 提取文本 text = page.get_text() if text.strip(): chunks.append(DocumentChunk( content=text, modality="text", page=page_num, embedding=self.embedder.encode_text(text) )) # 提取图像 images = page.get_images() for img_idx, img in enumerate(images): xref = img[0] base_image = doc.extract_image(xref) image_bytes = base_image["image"] # 保存临时图像 temp_path = f"/tmp/pdf_img_{page_num}_{img_idx}.png" with open(temp_path, "wb") as f: f.write(image_bytes) # 理解图像内容 img_content = self.image_understanding.extract_image_content(temp_path) chunks.append(DocumentChunk( content=json.dumps(img_content), modality="image", page=page_num, embedding=self.embedder.encode_image(temp_path), metadata={"type": "pdf_image", "caption": img_content["caption"]} )) return chunks### 3.2 多模态检索pythonclass MultimodalRetriever: def init(self, vector_store): self.vector_store = vector_store self.embedder = MultimodalEmbedder() async def retrieve( self, query_text: str = None, query_image: str = None, top_k: int = 5 ) -> List[RetrievalResult]: """多模态检索""" # 构建查询向量 if query_text and query_image: # 图文混合查询 query_embedding = self.embedder.encode_multimodal( text=query_text, image_path=query_image ) elif query_text: # 纯文本查询 query_embedding = self.embedder.encode_text(query_text) elif query_image: # 纯图像查询 query_embedding = self.embedder.encode_image(query_image) else: raise ValueError("必须提供文本或图像查询") # 执行检索 results = self.vector_store.search( query_embedding=query_embedding, top_k=top_k ) return results### 3.3 多模态生成pythonclass MultimodalRAGGenerator: def init(self, llm, retriever): self.llm = llm self.retriever = retriever async def generate( self, query: str, query_image: str = None ) -> str: """多模态RAG生成""" # 检索相关文档 retrieved_docs = await self.retriever.retrieve( query_text=query, query_image=query_image ) # 构建多模态上下文 context = self.build_multimodal_context(retrieved_docs) # 生成提示 prompt = self.build_multimodal_prompt(query, context) # 调用LLM生成 response = await self.llm.generate(prompt) return response def build_multimodal_context(self, docs: List[DocumentChunk]) -> str: """构建多模态上下文""" context_parts = [] for i, doc in enumerate(docs): if doc.modality == "text": context_parts.append(f"[文本片段 {i+1}]\n{doc.content}\n") elif doc.modality == "image": img_data = json.loads(doc.content) context_parts.append( f"[图像 {i+1}]\n" f"描述: {img_data['caption']}\n" f"OCR文本: {img_data['ocr_text']}\n" ) return "\n".join(context_parts) def build_multimodal_prompt(self, query: str, context: str) -> str: """构建多模态提示""" return f"""你是一个多模态信息助手。请基于以下检索到的信息回答用户问题。## 检索信息{context}## 用户问题{query}## 回答要求1. 综合利用文本和图像信息2. 如果涉及图像，描述图像的关键内容3. 保持回答的准确性和完整性4. 如果不确定，明确说明## 回答"""## 四、实战案例：电商商品搜索### 4.1 系统实现pythonclass EcommerceMultimodalSearch: def init(self): self.embedder = MultimodalEmbedder() self.vector_store = MultimodalVectorStore() self.llm = OpenAI() async def index_product(self, product: Product): """索引商品信息""" # 文本信息 text_content = f""" 商品名称: {product.name} 描述: {product.description} 类别: {product.category} 标签: {', '.join(product.tags)} """ text_embedding = self.embedder.encode_text(text_content) # 图像信息 image_embeddings = [] for img_url in product.images: img_embedding = self.embedder.encode_image(img_url) image_embeddings.append(img_embedding) # 存储到向量库 await self.vector_store.add_product( product_id=product.id, text_embedding=text_embedding, image_embeddings=image_embeddings, metadata=product.to_dict() ) async def search( self, query: str = None, image: str = None, filters: dict = None ) -> List[SearchResult]: """多模态商品搜索""" # 构建查询向量 if image: query_embedding = self.embedder.encode_image(image) else: query_embedding = self.embedder.encode_text(query) # 检索相似商品 results = await self.vector_store.search( query_embedding=query_embedding, filters=filters, top_k=20 ) # 重排序 reranked = await self.rerank_results(query or "", results) return reranked[:10]## 五、性能优化### 5.1 向量索引优化python# 使用多模态特定的索引策略class MultimodalIndex: def init(self): # 文本索引 self.text_index = faiss.IndexHNSWFlat(512, 32) # 图像索引 self.image_index = faiss.IndexHNSWFlat(512, 32) def add(self, text_embedding, image_embeddings): """添加多模态数据""" self.text_index.add(text_embedding.reshape(1, -1)) for img_emb in image_embeddings: self.image_index.add(img_emb.reshape(1, -1)) def search_multimodal(self, query_embedding, k=10): """多模态检索""" # 同时在文本和图像索引中搜索 text_scores, text_ids = self.text_index.search(query_embedding, k) image_scores, image_ids = self.image_index.search(query_embedding, k) # 融合结果 return self.fuse_results(text_scores, text_ids, image_scores, image_ids)### 5.2 缓存策略`pythonclass MultimodalCache: def init(self): self.embedding_cache = {} self.result_cache = {} def get_cached_embedding(self, content_hash: str): return self.embedding_cache.get(content_hash) def cache_embedding(self, content_hash: str, embedding): self.embedding_cache[content_hash] = embedding def get_cached_result(self, query_hash: str): return self.result_cache.get(query_hash)`## 六、总结多模态RAG代表了RAG技术的重要发展方向，通过整合文本、图像等多种模态的信息，能够支持更丰富的应用场景。关键技术点包括：1. 多模态嵌入：统一的多模态向量表示2. 图像理解：OCR、图像描述等技术3. 混合检索：支持多种查询方式4. 结果融合：多源信息的有效整合随着多模态大模型的发展，多模态RAG的能力将进一步增强，应用场景也将更加广泛。—参考资源：- CLIP: Learning Transferable Visual Models- BLIP: Bootstrapping Language-Image Pre-training- Multimodal RAG Best Practices标签：#多模态RAG #图像检索 #CLIP #向量搜索 #AI应用

DeepSeek技术社区

欢迎加入DeepSeek 技术社区。在这里，你可以找到志同道合的朋友，共同探索AI技术的奥秘。

更多推荐

Claude Code + OpenClaw 全栈教程!

DeepSeek技术社区

Spring Boot + Milvus + LangChain4j 实现 RAG 问答：从向量入库到 DeepSeek 生成

本文介绍了一个基于Spring Boot、Milvus向量数据库和LangChain4j框架实现的RAG（检索增强生成）问答系统。系统包含两个主要流程：启动时自动创建Milvus库表，加载并向量化文档入库；问答时检索相似片段，拼装Prompt后调用DeepSeek生成答案。关键组件包括Milvus连接配置、本地384维向量嵌入模型、文档切块处理和DeepSeek大模型集成。系统通过Maven管理依

DeepSeek技术社区

YouTube Clipper Skill：给 Claude Code 加上视频剪辑能力

YouTube Clipper Skill 是一个开源 Claude Code 插件，可为 Claude 添加 YouTube 视频处理能力。该工具能自动下载视频、进行 AI 语义分析生成 2-5 分钟的章节片段、剪辑视频、批量翻译字幕（效率提升10倍）并烧录字幕。支持双语字幕输出和社交媒体内容摘要生成，通过环境变量可配置输出参数。安装简单，只需一条 npx 命令，使用时可直接向 Claude 发