出发点

看了这几家的代码,发现模型结构基本都是这些,会有一些参数上的小差别,也会有一些结构上的小变化,本次看的是Qwen1.5-MoE-A2.7B-Chat,因为同属于MOE模型,所以本次是在Deepseek-MOE-16B-chat如何从输入到输出-代码解析上做的修改。

模型结构

from transformers import AutoModelForCausalLM, AutoTokenizer, GenerationConfig
device = "cuda" # the device to load the model onto
model_name = "/usr/downloads/Qwen1.5-MoE-A2.7B-Chat"  
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype="auto",
    device_map="auto"
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
model.generation_config = GenerationConfig.from_pretrained(model_name)

model.generation_config.do_sample = False
prompt = "介绍一下LLM"
messages = [
    {"role": "system", "content": "你是一个有用的AI助手"},
    {"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True
)
model_inputs = tokenizer([text], return_tensors="pt").to(device)

generated_ids = model.generate(
    model_inputs.input_ids,
    max_new_tokens=512
)
generated_ids = [
    output_ids[len(input_ids):] for input_ids, output_ids in zip(model_inputs.input_ids, generated_ids)
]

response = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0]

'LLM stands for "Legal Language Model," which refers to a type of artificial intelligence (AI) system specifically designed to process and generate legal language. These systems are trained on large amounts of legal data, allowing them to understand and analyze legal texts, provide insights, and generate new legal documents or arguments.\n\nLLMs can be particularly useful in the legal field for tasks such as:\n\n1. Legal research: They can quickly scan through vast amounts of legal documents, statutes, cases, and regulations to find relevant information for a particular case or issue.\n\n2. Contract analysis: LLMs can help review contracts, identify key clauses, and flag potential issues or inconsistencies.\n\n3. Legal writing and drafting: They can assist in generating legal memos, briefs, pleadings, and other legal documents by providing suggestions based on established legal precedents and principles.\n\n4. Legal advice: Some advanced LLMs can provide general legal advice based on their training, although it\'s important to note that they should not replace human lawyers in complex or sensitive matters.\n\n5. Legal analytics: LLMs can analyze large datasets to identify trends, predict outcomes, and support strategic decision-making in litigation or regulatory compliance.\n\n6. E-discovery: They can assist in identifying relevant documents during the discovery phase of a legal case.\n\nIt\'s worth noting that while LLMs can be powerful tools, they still have limitations. They lack the ability to exercise judgment and empathy, which are essential components of the legal profession. Additionally, they may not always understand the nuances of language, cultural context, or local laws, which could lead to errors or misinterpretations. Therefore, LLMs are often used in conjunction with human expertise to ensure accuracy and ethical considerations.'
"""
Qwen2MoeForCausalLM(
  (model): Qwen2MoeModel(
    (embed_tokens): Embedding(151936, 2048)
    (layers): ModuleList(
      (0-23): 24 x Qwen2MoeDecoderLayer(
        (self_attn): Qwen2MoeAttention(
          (q_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (k_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (v_proj): Linear(in_features=2048, out_features=2048, bias=True)
          (o_proj): Linear(in_features=2048, out_features=2048, bias=False)
          (rotary_emb): Qwen2MoeRotaryEmbedding()
        )
        (mlp): Qwen2MoeSparseMoeBlock(
          (gate): Linear(in_features=2048, out_features=60, bias=False)
          (experts): ModuleList(
            (0-59): 60 x Qwen2MoeMLP(
              (gate_proj): Linear(in_features=2048, out_features=1408, bias=False)
              (up_proj): Linear(in_features=2048, out_features=1408, bias=False)
              (down_proj): Linear(in_features=1408, out_features=2048, bias=False)
              (act_fn): SiLU()
            )
          )
          (shared_expert): Qwen2MoeMLP(
            (gate_proj): Linear(in_features=2048, out_features=5632, bias=False)
            (up_proj): Linear(in_features=2048, out_features=5632, bias=False)
            (down_proj): Linear(in_features=5632, out_features=2048, bias=False)
            (act_fn): SiLU()
          )
          (shared_expert_gate): Linear(in_features=2048, out_features=1, bias=False)
        )
        (input_layernorm): Qwen2MoeRMSNorm((2048,), eps=1e-06)
        (post_attention_layernorm): Qwen2MoeRMSNorm((2048,), eps=1e-06)
      )
    )
    (norm): Qwen2MoeRMSNorm((2048,), eps=1e-06)
    (rotary_emb): Qwen2MoeRotaryEmbedding()
  )
  (lm_head): Linear(in_features=2048, out_features=151936, bias=False)
)
"""
  • 不尝试一次的话,我永远都想不到这个模型竟然是基于英文的
  • 第0层也是基于专家MLP的
  • 共享专家对应了一个shared_expert_gate
  • q、k、v的三个映射层有bias

参数量

简单分析一下参数量,其实从模型结构里就能很明白的看出来了,我这里就是记录一下


# 激活数量有31亿
622329856+2048+(4096+16777216+6144+8650752*6+34603008+2048+122880)*24=31044096006

# 参数 140亿
622329856+2048+(4096+16777216+6144+8650752*60+34603008+2048+122880)*24=14315784192

"""
# 注意一下,只有q、k、v三个线性层有bias
# word embedding+lm_head
2048*151936*2=622329856
# 最后一层后面的LN
2048

# 0-23层每层都有的参数
# LN
2048*2=4096
# self_attn
2048*2048*4=16777216
# bias
2048*3=6144
# 60个专家都有的
2048*1408*3=8650752
# 共享专家
2048*5632*3=34603008
# 共享专家的gate
2048*1=2048
# gate
2048*60=122880
"""

不变的部分

将Deepseek替换为Qwen2Moe以后,会有一些变量需要做修改、按照报错一个一个的改一下就可以的,基本流程都没有啥变化

Qwen2MoeSparseMoeBlock

这里重点介绍一下Qwen2MoeSparseMoeBlock,即原来的DeepseekMoE

class Qwen2MoeSparseMoeBlock(nn.Module):
    """
    A mixed expert module containing shared experts.
    """
    def __init__(self, config):
        super().__init__()
        self.config = config
        self.num_experts_per_tok = config.num_experts_per_tok# 6
        self.num_experts = config.num_experts
        self.experts = nn.ModuleList([Qwen2MoeMLP(config, intermediate_size = config.moe_intermediate_size) for i in range(config.num_experts)])
        self.gate = MoEGate(config)
        self.shared_expert = Qwen2MoeMLP(config=config, intermediate_size = config.shared_expert_intermediate_size)
        self.shared_expert_gate = torch.nn.Linear(config.hidden_size, 1, bias=False)
    
    def forward(self, hidden_states):
        identity = hidden_states
        orig_shape = hidden_states.shape
        topk_idx, topk_weight, aux_loss = self.gate(hidden_states)
        hidden_states = hidden_states.view(-1, hidden_states.shape[-1])# bsz*seq_len,h
        #flat_topk_idx = topk_idx.view(-1)# bsz*seq_len*6
        y = self.moe_infer(hidden_states, topk_idx, topk_weight).view(*orig_shape)
        y = y + F.sigmoid(self.shared_expert_gate(identity)) * self.shared_expert(identity)
        return y
    
    @torch.no_grad()
    def moe_infer(self, hidden_states, flat_topk_idx, topk_weight):
        hidden_dim = hidden_states.shape[-1]
        final_hidden_states = torch.zeros_like(hidden_states)# bsz*seq_len,h
        # One hot encode the selected experts to create an expert mask
        # this will be used to easily index which expert is going to be sollicitated
        expert_mask = torch.nn.functional.one_hot(flat_topk_idx, num_classes=self.num_experts).permute(2, 1, 0)
        # Loop over all available experts in the model and perform the computation on each expert
        for expert_idx in range(self.num_experts):
            expert_layer = self.experts[expert_idx]# 对应的是第几个专家头
            idx, top_x = torch.where(expert_mask[expert_idx])# top_x记录的是行,即哪个token;idx记录的是列,即哪个专家index
            # Index the correct hidden states and compute the expert hidden state for
            # the current expert. We need to make sure to multiply the output hidden
            # states by `topk_weight` on the corresponding tokens (top-1 and top-2)
            current_state = hidden_states[None, top_x].reshape(-1, hidden_dim)# 当前专家对应的token hidden_states
            current_hidden_states = expert_layer(current_state) * topk_weight[top_x, idx, None]
            # However `index_add_` only support torch tensors for indexing so we'll use
            # the `top_x` tensor here.
            final_hidden_states.index_add_(0, top_x, current_hidden_states.to(hidden_states.dtype))
        return final_hidden_states
  • 这里针对共享专家多了一个shared_expert_gate,即多了一个权重(sigmoid)
  • moe_infer又多了一种实现方式,这里也是对专家头进行遍历,这里的代码逻辑比之前Deepseek-moe的要简单好多

最后一点说明

结束语

又分享了一篇MOE的代码,果然条条大路通罗马,过程不一样,同样能得到相同的结果。最后吐槽一下:它这个名字起的我都看不明白,2.7B究竟对应了什么?有知道的同学欢迎指教,谢谢。

Logo

欢迎加入DeepSeek 技术社区。在这里,你可以找到志同道合的朋友,共同探索AI技术的奥秘。

更多推荐