手撕 DeepseekV3核心模块：Deepseek_MOE & MLA

众所周知，对于Deepseek框架，其中两大模型架构优化分别为：MLA & DeepseekMOE。今天我们结合原理&代码，把DeepSeek的两部分创新点代码理解+手撕部分合并，整合起来观看效果更佳。MLA：核心思路通过低秩联合压缩，来减少注意力键（keys）和值（values）在推理过程中的缓存，从而提高推理效率，原文公式。对于Query，也执行相似的操作：最终的注意力输出 u_t_ 是通过将

程序猿李巡天

1475人浏览 · 2025-02-05 19:42:36

程序猿李巡天 · 2025-02-05 19:42:36 发布

众所周知，对于Deepseek框架，其中两大模型架构优化分别为：MLA & DeepseekMOE。今天我们结合原理&代码，把DeepSeek的两部分创新点代码理解+手撕部分合并，整合起来观看效果更佳。

MLA：

Multi-Head Latent Attention核心原理:

核心思路通过低秩联合压缩，来减少注意力键（keys）和值（values）在推理过程中的缓存，从而提高推理效率，原文公式。

Key：

Query：

对于Query，也执行相似的操作：

Attention：

最终的注意力输出 u_t_ 是通过将Query q_t_ 与Key k_t_ 进行softmax归一化后的点积，再乘以值 v_t_ 来获得。

关键代码：

本期解读采用deepseek-V2的MLA实现。

class MLA(torch.nn.Module):`    `def __init__(self, d_model, n_heads, max_len=1024, rope_theta=10000.0):`        `super().__init__()`        `self.d_model = d_model`        `self.n_heads = n_heads`        `self.dh = d_model // n_heads`        `self.q_proj_dim = d_model // 2`        `self.kv_proj_dim = (2*d_model) // 3``   `        `self.qk_nope_dim = self.dh // 2`        `self.qk_rope_dim = self.dh // 2``   `        `## Q projections`        `# Lora`        `self.W_dq = torch.nn.Parameter(0.01*torch.randn((d_model, self.q_proj_dim)))`        `self.W_uq = torch.nn.Parameter(0.01*torch.randn((self.q_proj_dim, self.d_model)))`        `self.q_layernorm = torch.nn.LayerNorm(self.q_proj_dim)``   `        `## KV projections`        `# Lora`        `self.W_dkv = torch.nn.Parameter(0.01*torch.randn((d_model, self.kv_proj_dim + self.qk_rope_dim)))`        `self.W_ukv = torch.nn.Parameter(0.01*torch.randn((self.kv_proj_dim,`                                                          `self.d_model + (self.n_heads * self.qk_nope_dim))))`        `self.kv_layernorm = torch.nn.LayerNorm(self.kv_proj_dim)``   `        `# output projection`        `self.W_o = torch.nn.Parameter(0.01*torch.randn((d_model, d_model)))``   `        `# RoPE`        `self.max_seq_len = max_len`        `self.rope_theta = rope_theta``   `        `# https://github.com/lucidrains/rotary-embedding-torch/tree/main`        `# visualize emb later to make sure it looks ok`        `# we do self.dh here instead of self.qk_rope_dim because its better`        `freqs = 1.0 / (rope_theta ** (torch.arange(0, self.dh, 2).float() / self.dh))`        `emb = torch.outer(torch.arange(self.max_seq_len).float(), freqs)`        `cos_cached = emb.cos()[None, None, :, :]`        `sin_cached = emb.sin()[None, None, :, :]``   `        `# https://pytorch.org/docs/stable/generated/torch.nn.Module.html#torch.nn.Module.register_buffer`        `# This is like a parameter but its a constant so we can use register_buffer`        `self.register_buffer("cos_cached", cos_cached)`        `self.register_buffer("sin_cached", sin_cached)

MLA中的Key操作：

        `if kv_cache is None:`            `compressed_kv = x @ self.W_dkv`            `KV_for_lora, K_for_rope = torch.split(compressed_kv,`                                                  `[self.kv_proj_dim, self.qk_rope_dim],`                                                  `dim=-1)`            `KV_for_lora = self.kv_layernorm(KV_for_lora)`        `else:`            `new_kv = x @ self.W_dkv`            `compressed_kv = torch.cat([kv_cache, new_kv], dim=1)`            `new_kv, new_K_for_rope = torch.split(new_kv,`                                                 `[self.kv_proj_dim, self.qk_rope_dim],`                                                 `dim=-1)`            `old_kv, old_K_for_rope = torch.split(kv_cache,`                                                 `[self.kv_proj_dim, self.qk_rope_dim],`                                                 `dim=-1)`            `new_kv = self.kv_layernorm(new_kv)`            `old_kv = self.kv_layernorm(old_kv)`            `KV_for_lora = torch.cat([old_kv, new_kv], dim=1)`            `K_for_rope = torch.cat([old_K_for_rope, new_K_for_rope], dim=1)`        `KV = KV_for_lora @ self.W_ukv`        `KV = KV.view(B, -1, self.n_heads, self.dh+self.qk_nope_dim).transpose(1,2)`        `K, V = torch.split(KV, [self.qk_nope_dim, self.dh], dim=-1)`        `S_full = K.size(2)`        `   `        `# K Rope`        `K_for_rope = K_for_rope.view(B, -1, 1, self.qk_rope_dim).transpose(1,2)`        `cos_k = self.cos_cached[:, :, :S_full, :self.qk_rope_dim//2].repeat(1, 1, 1, 2)`        `sin_k = self.sin_cached[:, :, :S_full, :self.qk_rope_dim//2].repeat(1, 1, 1, 2)`        `K_for_rope = apply_rope_x(K_for_rope, cos_k, sin_k)``   `        `# apply position encoding to each head`        `K_for_rope = K_for_rope.repeat(1, self.n_heads, 1, 1)`

MLA中的Query操作：

   `        `B, S, D = x.size()``   `        `# Q Projections`        `compressed_q = x @ self.W_dq`        `compressed_q = self.q_layernorm(compressed_q)`        `Q = compressed_q @ self.W_uq`        `Q = Q.view(B, -1, self.n_heads, self.dh).transpose(1,2)`        `Q, Q_for_rope = torch.split(Q, [self.qk_nope_dim, self.qk_rope_dim], dim=-1)``   `        `# Q Decoupled RoPE`        `cos_q = self.cos_cached[:, :, past_length:past_length+S, :self.qk_rope_dim//2].repeat(1, 1, 1, 2)`        `sin_q = self.sin_cached[:, :, past_length:past_length+S, :self.qk_rope_dim//2].repeat(1, 1, 1, 2)`        `Q_for_rope = apply_rope_x(Q_for_rope, cos_q, sin_q)``

最后注意力机制输出：

        `# split into multiple heads`        `q_heads = torch.cat([Q, Q_for_rope], dim=-1)`        `k_heads = torch.cat([K, K_for_rope], dim=-1)`        `v_heads = V # already reshaped before the split``   `        `# make attention mask`        `mask = torch.ones((S,S_full), device=x.device)`        `mask = torch.tril(mask, diagonal=past_length)`        `mask = mask[None, None, :, :]``   `        `sq_mask = mask == 1``   `        `# attention`        `x = torch.nn.functional.scaled_dot_product_attention(`            `q_heads, k_heads, v_heads,`            `attn_mask=sq_mask`        `)``   `        `x = x.transpose(1, 2).reshape(B, S, D)``   `        `# apply projection`        `x = x @ self.W_o.T``   `        `return x, compressed_kv`

下面我们围绕Deepseek-MoE模块进行解读。

Deepseek_MOE：

用通俗易懂的话来说：将专家网络Expert区分为Routed /Shared ，其中Shared Expert 全部参与最终的结果决策，而Routed Expert 则根据Router的结果，选择Top-k“最相关”的专家网络参与决策。

那这边目标就很明确了，我们将两个核心模块分别进行拆解：

1）Router 2）Expert

Router：

控制各个专家网络的输出权重，我们将这段代码中的关键模块加上注释。

输出TOP-K专家网络权重。

class Gate(nn.Module):`    `"""`    `Gating mechanism for routing inputs in a mixture-of-experts (MoE) model.``   `    `Attributes:`        `dim (int): Dimensionality of input features.`        `topk (int): Number of top experts activated for each input.`        `n_groups (int): Number of groups for routing.`        `topk_groups (int): Number of groups to route inputs to.`        `score_func (str): Scoring function ('softmax' or 'sigmoid').`        `route_scale (float): Scaling factor for routing weights.`        `weight (torch.nn.Parameter): Learnable weights for the gate.`        `bias (Optional[torch.nn.Parameter]): Optional bias term for the gate.`    `"""`    `def __init__(self, args: ModelArgs):`        `"""`        `Initializes the Gate module.``   `        `Args:`            `args (ModelArgs): Model arguments containing gating parameters.`        `"""`        `super().__init__()`        `self.dim = args.dim`        `self.topk = args.n_activated_experts`        `self.n_groups = args.n_expert_groups`        `self.topk_groups = args.n_limited_groups`        `self.score_func = args.score_func`        `self.route_scale = args.route_scale`        `self.weight = nn.Parameter(torch.empty(args.n_routed_experts, args.dim))`        `self.bias = nn.Parameter(torch.empty(args.n_routed_experts)) if self.dim == 7168 else None``   `    `def forward(self, x: torch.Tensor) -> Tuple[torch.Tensor, torch.Tensor]:`        `"""`        `Forward pass for the gating mechanism.``   `        `Args:`            `x (torch.Tensor): Input tensor.``   `        `Returns:`            `Tuple[torch.Tensor, torch.Tensor]: Routing weights and selected expert indices.`        `"""`        `# 计算分数，使用输入张量与权重参数进行线性变换。`        `scores = linear(x, self.weight)``   `        `# 根据评分函数选择，应用softmax或sigmoid对分数进行归一化处理。`        `if self.score_func == "softmax":`            `scores = scores.softmax(dim=-1, dtype=torch.float32)`        `else:`            `scores = scores.sigmoid()`        `original_scores = scores``   `        `# 如果存在偏置项，将其加到分数上。`        `if self.bias is not None:`            `scores = scores + self.bias`        `if self.n_groups > 1:`            `scores = scores.view(x.size(0), self.n_groups, -1)`            `if self.bias is None:`                `group_scores = scores.amax(dim=-1)`            `else:`                `group_scores = scores.topk(2, dim=-1)[0].sum(dim=-1)`            `# 找到前多组得分的索引。`            `indices = group_scores.topk(self.topk_groups, dim=-1)[1]`            `mask = torch.zeros_like(scores[..., 0]).scatter_(1, indices, True)`            `scores = (scores * mask.unsqueeze(-1)).flatten(1)``   `        `# 在每个输入上选择topk个专家的索引。`        `indices = torch.topk(scores, self.topk, dim=-1)[1]``   `        `# 从原始分数中获取相应选中专家的权重。`        `weights = original_scores.gather(1, indices)`        `if self.score_func == "sigmoid":`            `weights /= weights.sum(dim=-1, keepdim=True)``   `        `# 使用权重缩放因子。`        `weights *= self.route_scale`        `return weights.type_as(x), indices

Expert专家网络：

简单基础专家网络的实现。没有额外处理优化。

class Expert(nn.Module):`    `"""`    `Expert layer for Mixture-of-Experts (MoE) models.``   `    `Attributes:`        `w1 (nn.Module): Linear layer for input-to-hidden transformation.`        `w2 (nn.Module): Linear layer for hidden-to-output transformation.`        `w3 (nn.Module): Additional linear layer for feature transformation.`    `"""`    `def __init__(self, dim: int, inter_dim: int):`        `"""`        `Initializes the Expert layer.``   `        `Args:`            `dim (int): Input and output dimensionality.`            `inter_dim (int): Hidden layer dimensionality.`        `"""`        `super().__init__()`        `self.w1 = Linear(dim, inter_dim)`        `self.w2 = Linear(inter_dim, dim)`        `self.w3 = Linear(dim, inter_dim)``   `    `def forward(self, x: torch.Tensor) -> torch.Tensor:`        `"""`        `Forward pass for the Expert layer.``   `        `Args:`            `x (torch.Tensor): Input tensor.``   `        `Returns:`            `torch.Tensor: Output tensor after expert computation.`        `"""`        `return self.w2(F.silu(self.w1(x)) * self.w3(x))

MOE：

最后我们再回到MOE整体架构。

将上一步骤中的Gates输出（权重），以及Expert网络输出进行相乘，最终输出整个专家网络的结果。

class MoE(nn.Module):`    `"""`    `Mixture-of-Experts (MoE) module.``   `    `Attributes:`        `dim (int): Dimensionality of input features.`        `n_routed_experts (int): Total number of experts in the model.`        `n_local_experts (int): Number of experts handled locally in distributed systems.`        `n_activated_experts (int): Number of experts activated for each input.`        `gate (nn.Module): Gating mechanism to route inputs to experts.`        `experts (nn.ModuleList): List of expert modules.`        `shared_experts (nn.Module): Shared experts applied to all inputs.`    `"""`    `def __init__(self, args: ModelArgs):`        `"""`        `Initializes the MoE module.``   `        `Args:`            `args (ModelArgs): Model arguments containing MoE parameters.`        `"""`        `super().__init__()`        `self.dim = args.dim`        `assert args.n_routed_experts % world_size == 0`        `self.n_routed_experts = args.n_routed_experts`        `self.n_local_experts = args.n_routed_experts // world_size`        `self.n_activated_experts = args.n_activated_experts`        `self.experts_start_idx = rank * self.n_local_experts`        `self.experts_end_idx = self.experts_start_idx + self.n_local_experts`        `self.gate = Gate(args)`        `self.experts = nn.ModuleList([Expert(args.dim, args.moe_inter_dim) if self.experts_start_idx <= i < self.experts_end_idx else None`                                      `for i in range(self.n_routed_experts)])`        `self.shared_experts = MLP(args.dim, args.n_shared_experts * args.moe_inter_dim)``   `    `def forward(self, x: torch.Tensor) -> torch.Tensor:`        `"""`        `Forward pass for the MoE module.``   `        `Args:`            `x (torch.Tensor): Input tensor.``   `        `Returns:`            `torch.Tensor: Output tensor after expert routing and computation.`        `"""`        `shape = x.size()`        `x = x.view(-1, self.dim)`        `weights, indices = self.gate(x)`        `y = torch.zeros_like(x)`        `counts = torch.bincount(indices.flatten(), minlength=self.n_routed_experts).tolist()`        `for i in range(self.experts_start_idx, self.experts_end_idx):`            `if counts[i] == 0:`                `continue`            `expert = self.experts[i]`            `idx, top = torch.where(indices == i)`            `y[idx] += expert(x[idx]) * weights[idx, top, None]`        `z = self.shared_experts(x)`        `if world_size > 1:`            `dist.all_reduce(y)`        `return (y + z).view(shape)

如何学习大模型 AI ？

由于新岗位的生产效率，要优于被取代岗位的生产效率，所以实际上整个社会的生产效率是提升的。

但是具体到个人，只能说是：

“最先掌握AI的人，将会比较晚掌握AI的人有竞争优势”。

这句话，放在计算机、互联网、移动互联网的开局时期，都是一样的道理。

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述