实测 | 比较Qwen2.5-VL与Janus-Pro-7B在视觉理解上效果

天Qwen和DeepSeek都开源了多模态模型，Qwen开源的是Qwen2.5-VL模型专注多模态（图像+视频）理解，而DeepSeek开源的是Janus-Pro模型即可以进行图像理解，也可以进行图片生成。知乎热榜，Janus-Pro也是挂了一天，但我测了一下图像理解能力，真的不太行，我希望请大家不要无脑吹。声明：我没有贬低Janus-Pro-7B的意思，也没有测试Janus-Pro-7B的图像生

Python官方资料

1298人浏览 · 2025-03-04 17:46:46

Python官方资料 · 2025-03-04 17:46:46 发布

天Qwen和DeepSeek都开源了多模态模型，Qwen开源的是Qwen2.5-VL模型专注多模态（图像+视频）理解，而DeepSeek开源的是Janus-Pro模型即可以进行图像理解，也可以进行图片生成。

知乎热榜，Janus-Pro也是挂了一天，但我测了一下图像理解能力，真的不太行，我希望请大家不要无脑吹。

前排提示，文末有大模型AGI-CSDN独家资料包哦！

声明：我没有贬低Janus-Pro-7B的意思，也没有测试Janus-Pro-7B的图像生成能力，仅从图像理解来进行评价。

再次强调，当然我对图像生成研究不深，我主要是想看看图像理解到底到了什么程度。

为了有对比，我将Qwen2.5-VL与Janus-Pro-7B进行相同测试，比较结果。

先说结论：

与72B模型不同，Qwen2.5-VL-7B在表格解析上效果不理想，应该还是参数量的问题。
Janus-Pro-7B特别爱不回答，并且生成结果乱起八糟。

测试代码分别来自官方Github的HF代码，直接测试，Janus-Pro-7B的结果我一度认为我测试的有问题。

先放结果，后放代码，感兴趣可以check一下。

图像理解测试

先来表格识别，3个共三个表格图片，prompt如下：

## Role   你是一位有多年经验的OCR表格识别专家。       ## Goals   需要通过给定的图片，识别表格里的内容，并以html表格结果格式输出结果。       ## Constrains   - 需要认识识别图片中的内容，将每个表格单元格中的内容完整的识别出来，并填入html表格结构中；   - 图片中的表格单元格中可能存在一些占位符需要识别出来，例如"-"、"—"、"/"等；   - 输出表格结构一定遵循图片中的结构，表格结构完全一致；   - 特别注意图片中存在合并单元格的情况，结构不要出错；   - 对于内容较多的图片，一定要输出完整的结果，不要断章取义，更不要随意编造；   - 最终输出结果需要是html格式的表格内容。      ## Initialization   请仔细思考后，输出html表格结果。

测试1：

结果：Qwen2.5-VL-7B结构错误，Janus-Pro-7B错了，内容都不对。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试2：

结果：Qwen2.5-VL-7B结构错误，Janus-Pro-7B不正面回答，图像已经传了，但是它不理解。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试3：

结果：Qwen2.5-VL-7B结构错误，Janus-Pro-7B根本不回答。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

总结，昨天因为测试Qwen2.5-VL-72B模型，表格解析都出来，我以为7B也能出来，不过也没出来，只能说明，还是多模态再做表格解析部分，还是有门槛的。光有相对的训练策略还不够，还需要模型足够大。

再来两道数学题，prompt如下：

请解题。

测试4：

结果：Qwen2.5-VL-7B对了，，Janus-Pro-7B错了。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试5：

结果：Qwen2.5-VL-7B对了，C方程为; ，Janus-Pro-7B错了。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

最后，3道理解题目。

测试6：

query：请逐步详细分析，告诉我在中文数据和英文数据分别占比是多少，并且告诉我总和

结果：Qwen2.5-VL-7B对了，Janus-Pro-7B没识别对。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试7：

query：请逐步详细分析，这张图片里是有两只狗，对吗

结果：Qwen2.5-VL-7B对了，识别一猫一狗，Janus-Pro-7B分析出是了，但是结论是不知道。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试8：

query：请逐步详细分析，输出图片中的文字内容

结果：Qwen2.5-VL-7B错了两个字，但是Janus-Pro-7B生成的幻觉也太夸张了吧。

Qwen2.5-VL-7B结果

Janus-Pro-7B结果

测试代码

我真怕被喷，代码放在这里了，图片和提示词都在上面，大家可以测测看。如果是我的代码有问题请指出。

Qwen2.5-Vl-7B测试代码：

来自：https://huggingface.co/Qwen/Qwen2.5-VL-7B-Instruct

from transformers import Qwen2_5_VLForConditionalGeneration, AutoTokenizer, AutoProcessor   from qwen_vl_utils import process_vision_info      model_path = "Qwen/Qwen2.5-VL-7B-Instruct/"      model = Qwen2_5_VLForConditionalGeneration.from_pretrained(       model_path, torch_dtype="auto", device_map="auto"   )      processor = AutoProcessor.from_pretrained(model_path)      query = ""   image_path = ""      messages = [       {           "role": "user",           "content": [               {                   "type": "image",                   "image": image_path,               },               {"type": "text", "text": query},           ],       }   ]      text = processor.apply_chat_template(       messages, tokenize=False, add_generation_prompt=True   )   print("text:", text)   image_inputs, video_inputs = process_vision_info(messages)   inputs = processor(       text=[text],       images=image_inputs,       videos=video_inputs,       padding=True,       return_tensors="pt",   )   inputs = inputs.to("cuda")      generated_ids = model.generate(**inputs, max_new_tokens=4096)   generated_ids_trimmed = [       out_ids[len(in_ids):] for in_ids, out_ids in zip(inputs.input_ids, generated_ids)   ]   output_text = processor.batch_decode(       generated_ids_trimmed, skip_special_tokens=True, clean_up_tokenization_spaces=False   )   print("query: ", query)   print("output: ", output_text[0])

Janus-Pro-7B测试代码：

来自：https://github.com/deepseek-ai/Janus#3-quick-start

import torch   from transformers import AutoModelForCausalLM   from janus.models import MultiModalityCausalLM, VLChatProcessor   from janus.utils.io import load_pil_images      # specify the path to the model   model_path = "deepseek-ai/Janus-Pro-7B"   vl_chat_processor: VLChatProcessor = VLChatProcessor.from_pretrained(model_path)   tokenizer = vl_chat_processor.tokenizer      vl_gpt: MultiModalityCausalLM = AutoModelForCausalLM.from_pretrained(       model_path, trust_remote_code=True   )   vl_gpt = vl_gpt.to(torch.bfloat16).cuda().eval()      query = ""   image_path = ""      conversation = [       {           "role": "User",           "content": "<image_placeholder>\n{}".format(query),           "images": [image_path],       },       {"role": "Assistant", "content": ""},   ]      # load images and prepare for inputs   pil_images = load_pil_images(conversation)   prepare_inputs = vl_chat_processor(       conversations=conversation, images=pil_images, force_batchify=True   ).to(vl_gpt.device)      # # run image encoder to get the image embeddings   inputs_embeds = vl_gpt.prepare_inputs_embeds(**prepare_inputs)      # # run the model to get the response   outputs = vl_gpt.language_model.generate(       inputs_embeds=inputs_embeds,       attention_mask=prepare_inputs.attention_mask,       pad_token_id=tokenizer.eos_token_id,       bos_token_id=tokenizer.bos_token_id,       eos_token_id=tokenizer.eos_token_id,       max_new_tokens=4096,       do_sample=True,       use_cache=True,   )      answer = tokenizer.decode(outputs[0].cpu().tolist(), skip_special_tokens=True)   print(f"{prepare_inputs['sft_format'][0]}")   print("query: ", query)   print("output: ", answer)

最后，再次强调，我是一个没有情感的评测机器，只是希望大家理性看待技术。

可以吹，但别无脑吹！！！

读者福利：如果大家对大模型感兴趣，这套大模型学习资料一定对你有用

对于0基础小白入门：