Claude Prompt工程：香港职场文件AI翻译的完整解决方案

针对香港职场中英混排、粤语口语文件的处理痛点（如误译“强积金”），提出基于Claude API+Python+Prompt Engineering的本地化方案。核心实现：1.场景化Prompt：摒弃直译，按职场语境处理，保留英文并括号标注港式说法，粤语转普通话，保留原格式；2.Python封装：支持MPF、合同、邮件专项配置，输出译文及术语对照表。踩坑与解决：AWS节点中转应对合规；分段处理解决截

patrickstar231

359人浏览 · 2026-05-08 20:29:26

patrickstar231 · 2026-05-08 20:29:26 发布

适用场景：在香港/海外工作的开发者，需要处理中英混排、繁体中文职场文件
技术栈：Claude API + Python + Prompt Engineering
踩坑记录：粤语口语识别、术语一致性、出口合规

一、问题背景

在香港从事金融科技工作，每天面对的文件类型比代码还杂：

MPF（强积金）条款：全英文，夹杂法律术语
公司内部通告：繁体中文 + 英文缩写混排
客户邮件：英文为主体，夹杂粤语惯用表达（“听日交”、"俾我"等）
政府表格：繁体中文字面，但内地人理解有歧义

传统的翻译工具（Google Translate/DeepL）在处理这类文件时有三个硬伤：

香港特有术语翻译错误（如MPF翻成"强制性公积金"而非"强积金"）
中英混排处理能力弱，经常保留原文不翻
粤语口语识别为零

解决方案：通过Claude API + 定制Prompt工程，实现香港职场文件的一站式处理。

二、环境要求

Python >= 3.10
anthropic >= 0.49.0
python-dotenv >= 1.0.0

pip install anthropic python-dotenv

三、核心实现

3.1 Prompt模板设计

核心思路：不给Claude通用的"翻译"指令，而是给它香港职场上下文。

# config.py
HONG_KONG_DOC_SYSTEM_PROMPT = """
你是一位熟悉香港职场环境的文件处理专家。
你的任务不是简单的"翻译"，而是在香港职场语境下对文件进行"本地化处理"。

### 核心规则

#### 1. 术语处理（优先级最高）
规则：英文专业术语首次出现时保留原文，括号内标注香港通用的中文说法。
- ✅ MPF（强积金）—— 香港通用说法
- ❌ MPF（强制性公积金）—— 内地通用说法，香港不常用
- ✅ IRD（税务局）—— 香港税务局简称
- ✅ HKMA（香港金融管理局）

#### 2. 粤语口语处理
规则：粤语口语表达翻译为内地普通话对应说法。
- "听日交" → "明天提交"
- "俾我" → "给我"
- "唔该" → "请问/麻烦"

#### 3. 格式保留
规则：保持原文的表格结构、编号层级、段落分段。
- 表格 → 输出为Markdown表格
- 层级 → 用#层级保留
"""

# prompt_templates.py
def build_file_prompt(file_content: str, file_type: str = "general") -> str:
    base_prompt = f"""请处理以下{file_type}文件：

{file_content}

要求：
1. 输出简体中文正文
2. 正文后附加术语对照表（格式见下）
3. 对照表包含：原文 | 香港通用说法 | 内地对应说法（如有差异）

术语对照表示例：
| 原文 | 香港通用说法 | 内地对应说法 |
| --- | --- | --- |
| MPF | 强积金 | 强制性公积金 |
| Annual Leave | 年假 | 年休假 |
"""
    return base_prompt

3.2 完整调用代码

import os
from typing import Optional
from anthropic import Anthropic
from dotenv import load_dotenv

load_dotenv()

class HongKongDocProcessor:
    """香港职场文件处理器"""
    
    def __init__(self, api_key: Optional[str] = None):
        self.client = Anthropic(
            api_key=api_key or os.getenv("ANTHROPIC_API_KEY")
        )
        self.system_prompt = HONG_KONG_DOC_SYSTEM_PROMPT
    
    def process(
        self,
        content: str,
        file_type: str = "general",
        max_tokens: int = 4096
    ) -> dict:
        """
        处理香港职场文件
        
        Args:
            content: 文件文本内容
            file_type: 文件类型（general / mpf / contract / email）
            max_tokens: 最大输出token数
            
        Returns:
            包含译文和术语表的dict
        """
        user_prompt = build_file_prompt(content, file_type)
        
        response = self.client.messages.create(
            model="claude-sonnet-4-20250514",
            max_tokens=max_tokens,
            system=self.system_prompt,
            messages=[{
                "role": "user",
                "content": user_prompt
            }]
        )
        
        return self._parse_response(response)
    
    def _parse_response(self, response) -> dict:
        """解析响应"""
        full_text = response.content[0].text
        
        # 分离译文和术语表
        parts = full_text.split("| 原文 | 香港通用说法 |")
        
        result = {
            "translation": parts[0].strip(),
            "glossary": f"| 原文 | 香港通用说法 |{parts[1]}" if len(parts) > 1 else "",
            "raw": full_text
        }
        return result
    
    def batch_process(
        self,
        file_list: list[dict],
        output_dir: str = "./output"
    ) -> list[dict]:
        """批量处理文件"""
        os.makedirs(output_dir, exist_ok=True)
        results = []
        
        for item in file_list:
            result = self.process(
                content=item["content"],
                file_type=item.get("type", "general")
            )
            
            # 保存到文件
            output_path = os.path.join(
                output_dir,
                f"{item.get('name', 'doc')}_processed.md"
            )
            with open(output_path, "w", encoding="utf-8") as f:
                f.write(result["raw"])
            
            result["output_path"] = output_path
            results.append(result)
            
        return results


# 使用示例
if __name__ == "__main__":
    processor = HongKongDocProcessor()
    
    # 处理单份文件
    mpf_content = """
    Mandatory Provident Fund (MPF) Scheme
    Employee's monthly contribution: 5% of relevant income
    Maximum contribution: HKD 1,500 per month
    Employer must enroll employee within 60 days of employment
    """
    
    result = processor.process(mpf_content, file_type="mpf")
    print("=== 译文 ===")
    print(result["translation"])
    print("\n=== 术语对照表 ===")
    print(result["glossary"])

3.3 文件类型专项配置

不同文件类型需要不同的Prompt配置，下面给出三种常见类型的专用配置：

# type_configs.py

FILE_TYPE_CONFIGS = {
    "mpf": {
        "description": "MPF强积金文件",
        "special_rules": [
            "保留所有百分比和金额数字的原始格式",
            "MPF相关术语使用香港强制性公积金计划管理局（MPFA）官方译法"
        ],
        "output_structure": "表格优先，原文和译文左右对照"
    },
    "contract": {
        "description": "商务合同/协议",
        "special_rules": [
            "法律条款保持严谨，不要意译",
            "定义条款中的术语统一使用同一译法"
        ],
        "output_structure": "保留原文条款编号，逐条对照"
    },
    "email": {
        "description": "商务邮件",
        "special_rules": [
            "保持礼貌语气，符合香港商务习惯",
            "粤语惯用语转换为内地商务对应表达"
        ],
        "output_structure": "直接输出译文，不需要对照表"
    }
}

def get_type_config(file_type: str) -> dict:
    return FILE_TYPE_CONFIGS.get(file_type, FILE_TYPE_CONFIGS["general"])

四、运行效果

处理MPF文件前后对比

输入（原始英文摘要）：

Mandatory Provident Fund (MPF) Scheme
Employee's monthly contribution: 5% of relevant income
Maximum contribution: HKD 1,500 per month
Employer must enroll employee within 60 days of employment

输出（处理后）：

强制性公积金（MPF）计划
雇员每月供款：相关收入的5%
最高供款：每月1,500港元
雇主须在雇佣开始后60天内为雇员登记

术语对照表：

原文	香港通用说法	内地对应说法
MPF	强积金	强制性公积金
Relevant Income	相关收入	核定工资基数
Enroll	登记	参保

五、踩坑记录

踩坑1：出口合规限制

Claude API的国际版在香港访问需要确认合规性。建议通过AWS东京节点中转，延迟约200ms，不影响批量处理。

# 设置代理（如使用AWS中转）
export HTTP_PROXY=http://your-proxy:port
export HTTPS_PROXY=http://your-proxy:port

踩坑2：长文档截断

处理超过10页的合同文件时，max_tokens需要调大。建议分段处理：

def process_long_document(file_path: str, chunk_size: int = 3000):
    """分段处理长文档"""
    with open(file_path, "r", encoding="utf-8") as f:
        content = f.read()
    
    chunks = [content[i:i+chunk_size] for i in range(0, len(content), chunk_size)]
    results = []
    
    for chunk in chunks:
        result = processor.process(chunk)
        results.append(result)
    
    return results

踩坑3：术语一致性

同一个文档中"Employee"可能被译为"雇员"或"员工"，导致术语不统一。

解决：在System Prompt中增加"全文一致性"要求，并在处理完成后用脚本做二次校验。

def check_consistency(text: str, term_pairs: list[tuple]) -> list[str]:
    """校验术语一致性"""
    issues = []
    for eng, zh in term_pairs:
        if eng.lower() in text.lower() and zh not in text:
            issues.append(f"{eng} 未被翻译为 {zh}")
    return issues