山东大学创新项目实训（7）模型训练之故事与漫画内容一致性

本周工作：我们的项目通过DeepSeek大语言模型生成漫画分镜，并将分镜描述作为Stable Diffusion模型的prompt来生成漫画，实现了一定程度的故事与漫画内容一致性。在两个模型的输入输出对接，即，有不同的处理方法，本周我对主流的方法进行了广泛的尝试，并总结出了最佳方案。

卢某人233

1026人浏览 · 2025-05-19 17:13:12

卢某人233 · 2025-05-19 17:13:12 发布

本周工作：我们的项目通过DeepSeek大语言模型生成漫画分镜，并将分镜描述作为Stable Diffusion模型的prompt来生成漫画，实现了一定程度的故事与漫画内容一致性。在两个模型的输入输出对接，即语义-视觉对齐，有不同的处理方法，本周我对主流的方法进行了广泛的尝试，并总结出了最佳方案。

一、语义约束

1.结构化输出

因为Stable Diffusion的输出严重依赖prompt的结构。结构化描述能稳定地控制生成的内容，才能让 SD 理解在干什么。因此可以规范分解描述的结构，提升promot的稳定性。使用显式的标签让每个分镜描述隔离开，并保证其都是清晰、稳定的 prompt，用于图像生成。

我设计的规范分解描述的结构如下：

val description = "## 角色\n" +
        "你是一个专业的四格漫画画家，能够根据用户提供的简单漫画故事和角色设定，丰富每一格漫画的画面描述，并使用英文输出，描述开头需带上[4koma]。\n" +
        "\n" +
        "## 技能\n" +
        "1. 细致描绘每一格漫画中的人物、场景、物品等元素。\n" +
        "2. 运用生动形象的词汇和丰富的想象力，使描述更具吸引力。\n" +
        "3. 按照给定的格式分别描述每一格漫画。\n" +
        "\n" +
        "## 示例\n" +
        "[4koma] In this delightful and charming sequence, [SCENE-1] <Sora>, a creative and observant young artist with blue eyes and short purple hair, is seated at a restaurant table, diligently sketching on a napkin while a waiter, wearing a red and white striped apron, approaches with a concerned expression, requesting more napkins; [SCENE-2] as the scene transitions, <Sora> is now in a car, eyes sparkling with inspiration, pointing excitedly at a small, round creature with a red shell on the dashboard, while musical notes float in the air, indicating a moment of creative epiphany; [SCENE-3] at the beach, <Sora> is engrossed in an intense gaze with a fish-like creature emerging from the water, the sun shining brightly in the background, casting a warm glow over the scene; [SCENE-4] back in the car, <Sora> joyfully exclaims upon finding ink, ready to capture the new inspiration, while a squid-like creature, with a mischievous expression, stands nearby, holding a pen, as if ready to assist in the creative process.\n" +
        "\n" +
        "## 要求\n" +
        "1. 描述要详细、准确，充分展现漫画的情节和氛围。\n" +
        "2. 语言生动、有趣，富有表现力。\n" +
        "3. 每个分镜严格按照[4koma]和[SCENE-1/2/3/4]的格式进行描述。\n" +
        "4. 角色名称使用<xxx>格式进行描述。\n" +
        "5. 输出一定是英文！！！！\n" +
        "\n" +
        "## 限制\n" +
        "1. 描述总长度不超过 350 个单词。\n" +
        "2. 仅围绕用户提供的漫画故事和角色设定进行描述，不添加无关内容。" +
        "3. 角色：Anon Chihaya,Jotaro Kujo \n" +

输出效果：

{ mutableStateOf("[4koma] In this lively and musical encounter:\n" +
            "[SCENE-1] <Anon Chihaya> sits cross-legged on a grassy patch beneath a cherry blossom tree, a sleek acoustic guitar resting on her lap. Her pink hair sways slightly in the breeze as her fingers strum the strings with joyful precision. Notes float through the air around her, and her gray eyes sparkle with passion. A dreamy smile is on her face as she hums along, clearly lost in her own musical world.\n" +
            "[SCENE-2] A sudden rhythm interrupts the melody—<Jotaro Kujo> appears on the next panel, sitting behind a compact drum set set up in an alleyway. His eyes are focused, blue and intense, as his sticks move with calm precision. The background is a blur of motion lines emphasizing his steady beat. A few curious cats gather around, nodding to the rhythm. Jotaro, unaware of any audience, is entirely in sync with the drums.\n" +
            "[SCENE-3] <Anon> peeks around the corner with wide eyes and a huge grin, guitar slung over her shoulder. She rushes toward Jotaro, her hands animated in excitement as cherry petals float around her. “You play drums?! That’s amazing!” she exclaims, practically bouncing. Jotaro looks up, slightly startled but composed, raising an eyebrow in mild curiosity.\n" +
            "[SCENE-4] The two now face each other—Anon striking a dynamic pose with one hand pointing at Jotaro, the other gripping her guitar, while Jotaro rests his drumsticks on his shoulders with a calm smile. A large, vibrant speech bubble hovers between them: “Let’s start a band!” Musical notes and colorful spark lines burst from the panel, capturing the moment of this new, spontaneous partnership.") }

2.结构化解析输入与生成

对于输入的解析和漫画的生成，有两种方法。

（1）一张图上生成连续的四幅漫画

构建一个描述四个分镜内容的长 Prompt，描述每一格漫画的内容和布局，将整个四格漫画生成为一张图。

这样处理的优点有：

图像中的四个分镜的风格可以保持高度一致，不会出现生成的角色形象在不同格中明显变化的问题。
一次性生成一张图，减少了多次模型调用的开销。
无需处理多张图片的拼接。

关键工作流：

  "4": {
    "inputs": {
      "ckpt_name": "动漫光影_Anishadow V2.safetensors"
    },
    "class_type": "CheckpointLoaderSimple",
    "_meta": {
      "title": "Checkpoint加载器（简易）"
    }
  },
  "5": {
    "inputs": {
      "width": 512,
      "height": 1448,
      "batch_size": 1
    },
    "class_type": "EmptyLatentImage",
    "_meta": {
      "title": "空Latent图像"
    }
  },
  "6": {
    "inputs": {
      "text": "test",
      "clip": [
        "11",
        1
      ]
    },
    "class_type": "CLIPTextEncode",
    "_meta": {
      "title": "CLIP文本编码"
    }
  },
  "7": {
    "inputs": {
      "text": "test",
      "clip": [
        "4",
        1
      ]
    },
    "class_type": "CLIPTextEncode",
    "_meta": {
      "title": "CLIP文本编码"
    }
  },
  "10": {
    "inputs": {
      "lora_name": "MYGO!!!!!（Nagasaki Soyo,Kaname Rāna,heterochromia,Takamatsu Tomori,Shiina Taki,Anon Chihaya).safetensors",
      "strength_model": 1,
      "strength_clip": 1
    },
    "class_type": "LoraLoader",
    "_meta": {
      "title": "加载LoRA"
    }
  },
  "11": {
    "inputs": {
      "lora_name": "4koma.safetensors",
      "strength_model": 1,
      "strength_clip": 1,
      "model": [
        "4",
        0
      ],
      "clip": [
        "4",
        1
      ]
    },
    "class_type": "LoraLoader",
    "_meta": {
      "title": "加载LoRA"
    }
  },
 }

生成效果：

可以看到四个分镜在风格、构图、线条、色彩等方面都保持高度一致，不会出现生成的风格和角色形象在不同格中明显变化的问题。

（2）逐格生成（基于前一张推导下一张）

暂未实现，先挖个坑...

二、角色触发

1.构建角色档案

为了防止DeepSeek生成的人物描述与Stable Diffusion训练的人物模型不一致，导致SD模型无法正确识别人物。我使用构建角色档案的方式实现“角色记忆”，并在描述过程中通过<xxx>显示指明人物名称。

角色档案：

{
  "characters": [
    {
      "name": "Anon Chihaya",
      "gender": "Female",
      "age": "High School Student",
      "personality": "cheerful and energetic",
      "appearance": "long pink hair, gray eyes"
    },
    {
      "name": "Jotaro Kujo",
      "gender": "Male",
      "age": "High School Student",
      "personality": "calm and composed",
      "appearance": "short black hair, blue eyes"
    }
  ]
}

prompt注入：

fun buildPrompt(characters: List<Character>): String {
    val charDescriptions = characters.mapIndexed { i, c ->
        "${i + 1}. ${c.name}: ${c.gender} ${c.age.toLowerCase()}, ${c.personality}, ${c.appearance}."
    }.joinToString("\n")

    return """
        xxx
        Characters:
        $charDescriptions
        xxx
    """.trimIndent()
}

2.设置角色触发词

在LoRA训练时，为每一组训练样本中的标签加上触发词，同样使用<xxx>格式：

详情见山东大学创新项目实训（6）LoRA微调之角色一致性_人物一致性lora模型-CSDN博客

这样一来，每当SD模型检测到<xxx>标签的角色后，将立即触发已经LoRA训练好的人物模型。

三、后处理：必要时重新生成

可以为每次生成的图像使用反演模型（详情见山东大学创新项目实训（2）本地部署StableDiffusion模型与基础工作流搭建_stablediffusion本地模型-CSDN博客）中的CLIP询问机，进行相似度匹配，检查图像是否“语义一致”。CLIP询问机计算生成的图像和原始 prompt的余弦值相似度，总体相似度低于一定数值时自动重试图像生成。

具体实现：

实现计算生成的图像和原始 prompt的相似度的方法：

def calculate_clip_similarity(image_path, text):
    
    image = Image.open(image_path)
    
    inputs = processor(text=[text], images=image, return_tensors="pt", padding=True)
    
    with torch.no_grad():
        outputs = model(**inputs)
    
    image_features = outputs.image_embeds
    text_features = outputs.text_embeds
    similarity = torch.cosine_similarity(image_features, text_features)
    
    return similarity.item()

实现在低相似度时自动重试的方法：

# 相似度阈值
SIMILARITY_THRESHOLD = 0.7

def generate_image_with_retry(prompt_data, max_retries=3):
   
    retries = 0
    
    while retries < max_retries:
        generated_image_path = upload_and_generate_image(prompt_data)
        
        similarity_score = calculate_clip_similarity(generated_image_path, prompt_data["positive"])
        
        print(f"相似度得分: {similarity_score}")
        
        if similarity_score >= SIMILARITY_THRESHOLD:
            print("✅ 图像符合要求，返回结果")
            return generated_image_path
        else:
            print(f"❌ 相似度低于阈值 ({SIMILARITY_THRESHOLD}), 重试第 {retries + 1} 次")
            retries += 1
            time.sleep(1)
    
    print(f"❌ 最大重试次数({max_retries})已达，生成的图像不符合要求")
    return None

这里设置了最大生成次数，防止无限生成。

修改原有的生成逻辑：

@app.route("/generate", methods=["POST"])
def generate():
    data = request.get_json()
    if not data:
        return jsonify({"error": "请求必须为 JSON 格式"}), 400

    pos = data.get("positive", "").strip()
    neg = data.get("negative", "").strip()

    if not pos:
        return jsonify({"error": "缺少正面提示词"}), 400

    print(f"✅ 收到提示词: 正面='{pos}'，反面='{neg}'")

    prompt_data = update_prompt_text(prompt_data, pos, neg)

    result = upload_prompt_to_comfyui(updated_prompt_data)
    if not result:
        return jsonify({"error": "上传工作流失败"}), 500

    generated_image = generate_image_with_retry(prompt_data)
    if generated_image:
        return jsonify({
            "message": "✅ 成功提交生成任务并生成图像",
            "image_path": generated_image
        })
    else:
        return jsonify({"error": "图像生成失败，无法满足提示词相似度要求"}), 500

在每次图像生成前进行相似度检查，如果达到最大生成次数仍然无法达到相似度要求，则返回错误。

效果：

单词：苹果

不符合相似度要求的图像：（这里是通过服务器队列查看的，因为不符合相似度要求的图像不会发送给客户端）