
DeepSeek中用到的Grouped-Query Attention技术是什么来头?
本文详细介绍了Grouped-Query Attention的概念、原理以及python实现。
一、概念
最近DeepSeek概念的大火也使得许多在DeepSeek系列模型中用到的技术重新进入大家的视野中,Grouped-Query Attention就是其中一种(应用于模型DeepSeek LLM中)。GQA最初由Google在2023年发表的论文《GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints》中提出。本文详细介绍GQA的概念及原理。
译文:
多查询注意力(MQA)仅使用单个键值头,极大地加快了解码器推理速度。然而,MQA 可能会导致质量下降,而且仅仅为了更快的推理而训练一个单独的模型可能并不理想。我们(1)提出了一种将现有的多头语言模型检查点上训练为具有 MQA 的模型的方法,仅使用原始预训练计算量的 5%;(2)引入分组查询注意力(GQA),它是多查询注意力的一种泛化,使用中间数量(多于一个,少于查询头数量)的键值头。我们表明,经过上训练的 GQA 在速度与 MQA 相当的情况下,实现了接近多头注意力的质量。
二、核心原理
1、Uptraining
作者认为,从多头模型生成多查询模型只需要两个步骤:首先,转换检查点;其次,额外的预训练以允许模型适应其新结构。因此,MQA将键key和值value头的投影矩阵平均池化为单个投影矩阵,这比选择其中的单个键和值头或从头开始随机初始化新的键和值头效果更好。
2、Grouped-Query Attention
GQA将查询头分成 G 组,每个组共享一个键头和值头,GQA-G 是指带有 G 组的分组查询。举例来说,GQA-1 具有单个组,因此具有单个键头和值头,等价于 MQA;而 GQA-H,组等于头的数量,等价于 MHA。
当将多头检查点转换为 GQA 检查点时,作者通过合并该组中的所有原始头来构建每个组键和值头。中间数量的G组会导致一个插值模型,它比 MQA 质量更高,但比 MHA 更快。从 MHA 到 MQA 将 H 个键和值头减少为单个键和值头,从而减少键值缓存的大小,因此需要加载的数据量减少了 H 倍。然而,较大的模型通常会扩展头的数量,因此MQA代表了内存带宽和容量的更大程度削减。GQA 允许随着模型大小的增加保持带宽和容量的相同比例减少。此外,GQA 没有应用于编码器自关注层,因为编码器表示是并行计算的,内存带宽并不是这类模型的主要瓶颈。
三、python实现
这里我们直接把GitHub开源代码中实现的非官方torch版GQA代码po上来。
from typing import Optional, Tuple, Union
import torch
import torch.nn.functional as F
from einops import einsum, rearrange
from torch import Tensor, nn
def scaled_dot_product_gqa(
query: Tensor,
key: Tensor,
value: Tensor,
dropout: float = 0.0,
scale: Optional[float] = None,
mask: Optional[Tensor] = None,
is_causal: Optional[bool] = None,
need_weights: bool = False,
average_attn_weights: bool = False,
force_grouped: bool = False,
):
"""Scaled dot product attention with support for grouped queries.
Einstein notation:
- b: batch size
- n / s: sequence length
- h: number of heads
- g: number of groups
- d: dimension of query/key/value
Args:
query: Query tensor of shape (b, n, h, d)
key: Key tensor of shape (b, s, h, d)
value: Value tensor of shape (b, s, h, d)
dropout: Dropout probability (default: 0.0)
scale: Scale factor for query (default: d_query ** 0.5)
mask: Mask tensor of shape (b, n, s) or (b, s). If 'ndim == 2', the mask is
applied to all 'n' rows of the attention matrix. (default: None)
force_grouped: If True, apply grouped-query attention even if the number of
heads is equal for query, key, and value. (default: False)
Returns:
2-tuple of:
- Attention output with shape (b, n, h, d)
- (Optional) Attention weights with shape (b, h, n, s). Only returned if
'need_weights' is True.
"""
if (mask is not None) and (is_causal is not None):
raise ValueError(
"Only one of 'mask' and 'is_causal' should be provided, but got both."
)
elif not query.ndim == key.ndim == value.ndim == 4:
raise ValueError(
f"Expected query, key, and value to be 4-dimensional, but got shapes "
f"{query.shape}, {key.shape}, and {value.shape}."
)
# Move sequence length dimension to axis 2.
# This makes the attention operations below *much* faster.
query = rearrange(query, "b n h d -> b h n d")
key = rearrange(key, "b s h d -> b h s d")
value = rearrange(value, "b s h d -> b h s d")
bq, hq, nq, dq = query.shape
bk, hk, nk, dk = key.shape
bv, hv, nv, dv = value.shape
if not (bq == bk == bv and dq == dk == dv):
raise ValueError(
"Expected query, key, and value to have the same batch size (dim=0) and "
f"embedding dimension (dim=3), but got query: {query.shape}, "
f"key: {key.shape}, and value: {value.shape}."
)
elif (hk != hv) or (nk != nv):
raise ValueError(
"Expected key and value to have the same size in dimensions 1 and 2, but "
f"got key: {key.shape} and value: {value.shape}."
)
elif hq % hk != 0:
raise ValueError(
"Expected query heads to be a multiple of key/value heads, but got "
f"query: {query.shape} and key/value: {key.shape}."
)
if scale is None:
scale = query.size(-1) ** 0.5
query = query / scale
num_head_groups = hq // hk
query = rearrange(query, "b (h g) n d -> b g h n d", g=num_head_groups)
similarity = einsum(query, key, "b g h n d, b h s d -> b g h n s")
if is_causal:
# Mask out the upper triangular portion of the attention matrix. This prevents
# the model from attending to tokens in the future.
mask = torch.ones((bq, nq, nk), device=query.device, dtype=torch.bool).tril_()
if mask is not None:
# Expand mask to match the shape of the attention matrix.
# If mask is 2D, assume that it is applied to the key/value sequence dimension.
# Else if mask is 3D, assume that it is applied to the query/key/value sequence
# dimension for all attention heads.
#
# Users could also provide a 4D mask, which is applied to the query/key/value
# sequence dimension for each attention head (though I don't have a particular
# use case in mind for that).
if mask.ndim == 2:
mask = rearrange(mask, "b s -> b () () () s")
elif mask.ndim == 3:
mask = rearrange(mask, "b n s -> b () () n s")
# Mask similarity values by setting them to negative infinity. This guarantees
# that they will not contribute to the softmax computation below.
similarity.masked_fill_(~mask, torch.finfo(similarity.dtype).min)
attention = F.softmax(similarity, dim=-1)
if dropout > 0.0:
attention = F.dropout(attention, p=dropout)
# Apply attention matrix to the value Tensor.
out = einsum(attention, value, "b g h n s, b h s d -> b g h n d")
# Move head dimension back to axis 2
out = rearrange(out, "b g h n d -> b n (h g) d")
attn_weights: Optional[Tensor] = None
if need_weights:
# Move the sequence dimensions back to positions 1, 2. Move the head dimension
# to position 3. This more closely matches the return shape of the attention
# output: (b, n, h, d).
attn_weights = rearrange(attention, "b g h n s -> b n s (h g)")
if average_attn_weights:
attn_weights = attn_weights.mean(dim=1)
return out, attn_weights
class MultiheadGQA(nn.Module):
"""Multi-head grouped query attention (GQA) layer.
Reference:
"GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints"
https://arxiv.org/pdf/2305.13245v1.pdf
GQA is a variant of multihead attention (MHA) that uses fewer write heads
(key / value) than query heads. GQA can be viewed as a generalization of
multi-query attention (MQA), which uses a single write head. GQA and MQA give
significant speedups over standard MHA in decoder layers, with minimal loss in
accuracy. In the paper, GQA is shown to be more accurate than MQA, while still
having a significant speedup over MHA.
NOTE: The original authors only benchmark GQA by adapting the T5 (XL or XXL) model
from MHA to GQA. As a result, they do not mention parameter initialization or
layer normalization strategies. I follow the best practices laid out in the
MAGNETO paper, which improves Transformer performance through better parameter
initialization and layer norm placement. See:
https://arxiv.org/pdf/2210.06423.pdf, Fig. 2
"""
def __init__(
self,
embed_dim: int,
query_heads: int,
kv_heads: int,
dropout: float = 0.0,
bias: bool = True,
layer_norm: bool = True,
layer_norm_eps: float = 1e-5,
gamma_init: float = 1.0,
device: Optional[Union[torch.device, str]] = None,
dtype: Optional[torch.dtype] = None,
):
super().__init__()
self.query_heads = query_heads
self.kv_heads = kv_heads
self.dropout = dropout
self.layer_norm = layer_norm
self.gamma_init = gamma_init
if self.query_heads % self.kv_heads != 0:
raise ValueError(
f"query_heads ({query_heads}) must be divisible by "
f"kv_heads ({kv_heads})"
)
elif (embed_dim % self.query_heads != 0) or (embed_dim % self.kv_heads != 0):
raise ValueError(
f"embed_dim ({embed_dim}) must be divisible by "
f"query_heads ({query_heads}) and kv_heads ({kv_heads})"
)
head_dim = embed_dim // query_heads
if not head_dim % 8 == 0:
raise ValueError(
f"head_dim (embed_dim / num_heads = {head_dim}) must be divisible by 8"
)
if not head_dim <= 128:
raise ValueError(
f"head_dim (embed_dim / num_heads = {head_dim}) must be <= 128"
)
# Query projection layer is the same as in vanilla MHA.
self.q_proj = nn.Linear(
embed_dim, embed_dim, bias=bias, device=device, dtype=dtype
)
# Key/value projection layers have a smaller output dimension, so that
# the we have fewer key/value attention heads after reshaping.
kv_embed_dim = embed_dim // query_heads * kv_heads
self.k_proj = nn.Linear(
embed_dim, kv_embed_dim, bias=bias, device=device, dtype=dtype
)
self.v_proj = nn.Linear(
embed_dim, kv_embed_dim, bias=bias, device=device, dtype=dtype
)
self.norm: Optional[nn.LayerNorm] = None
if layer_norm:
self.norm = nn.LayerNorm(
embed_dim, eps=layer_norm_eps, device=device, dtype=dtype
)
# Grouped attention output will have the same embedding dimension as the
# key/value Tensors. So the output projection layer needs to accept the
# same dimension (kv_embed_dim).
self.out_proj = nn.Linear(
embed_dim, embed_dim, bias=bias, device=device, dtype=dtype
)
self._reset_parameters()
def _reset_parameters(self):
nn.init.xavier_normal_(self.q_proj.weight)
if self.q_proj.bias is not None:
nn.init.constant_(self.q_proj.bias, 0)
nn.init.xavier_normal_(self.k_proj.weight)
if self.k_proj.bias is not None:
nn.init.constant_(self.k_proj.bias, 0)
# NOTE: We follow the initialization strategy from MAGNETO. See:
# https://arxiv.org/pdf/2210.06423.pdf, Fig. 2
# Gain (self.gamma_init) should be provided as a keyword argument when
# initializing the larger Transformer model, since it requires knowledge
# of the number of encoder/decoder layers in the model.
nn.init.xavier_normal_(self.v_proj.weight, gain=self.gamma_init)
if self.v_proj.bias is not None:
nn.init.constant_(self.v_proj.bias, 0)
nn.init.xavier_normal_(self.out_proj.weight, gain=self.gamma_init)
if self.out_proj.bias is not None:
nn.init.constant_(self.out_proj.bias, 0)
def forward(
self,
query: Tensor,
key: Tensor,
value: Tensor,
need_weights: bool = False,
# TODO
# attn_mask: Optional[Tensor] = None,
is_causal: bool = False,
average_attn_weights: bool = False,
) -> Tuple[Tensor, Optional[Tensor]]:
# Notation:
# b - batch size
# n - sequence length
# h - number of heads
# d - embedding dimension
#
# Input shape: (b, n, d)
q: Tensor = self.q_proj(query)
k: Tensor = self.k_proj(key)
v: Tensor = self.v_proj(value)
# Unfold 'd' dimension into 'h' separate attention heads.
q = rearrange(q, "b n (h d) -> b n h d", h=self.query_heads)
k = rearrange(k, "b n (h d) -> b n h d", h=self.kv_heads)
v = rearrange(v, "b n (h d) -> b n h d", h=self.kv_heads)
# Apply attention, then fold 'h' attention heads back into 'd'.
x, attn = scaled_dot_product_gqa(
query=q,
key=k,
value=v,
# TODO
# mask=attn_mask,
is_causal=is_causal,
need_weights=need_weights,
average_attn_weights=average_attn_weights,
force_grouped=False,
)
x = rearrange(x, "b n h d -> b n (h d)")
# NOTE: This is different from 'nn.MultiheadAttention'! We follow the MAGNETO
# architecture (https://arxiv.org/pdf/2210.06423.pdf), which applies an extra
# layer norm before the linear output projection. The cross-attention layer in
# the MAGNETO decoder does not include this layer norm, so users have the
# option to disable it (layer_norm=False).
if self.layer_norm:
assert self.norm is not None
x = self.norm(x)
# Linear projection on attention outputs.
x = self.out_proj(x)
return x, attn
更多推荐
所有评论(0)