如何在 DeepSpeed 中开启梯度检查点（gradient checkpointing）：中英双语介绍

Gradient checkpointing in DeepSpeed is a technique designed to reduce memory usage when training large models by storing only a subset of intermediate activations during the forward pass.

阿正的梦工坊

11535人浏览 · 2024-11-29 13:48:50

阿正的梦工坊 · 2024-11-29 13:48:50 发布

中文版

在 DeepSpeed 中开启梯度检查点（gradient checkpointing）可以显著减少训练大规模模型时的显存占用，尤其是在内存有限的情况下。DeepSpeed 提供了一个简单的配置方法来启用该功能。

如何在 DeepSpeed 中开启梯度检查点

配置文件设置：
DeepSpeed 通过在配置文件中设置 gradient_checkpointing 参数来启用梯度检查点。你需要在 DeepSpeed 的配置文件（通常是一个 JSON 文件）中添加以下设置：
```
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 4,
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": true
  },
  "gradient_checkpointing": true
}
```
关键字段解释：
- "gradient_checkpointing": true：启用梯度检查点功能。
- "zero_optimization": { "stage": 2 }：启用 ZeRO 优化，通常与梯度检查点配合使用来优化显存利用。
启用梯度检查点的工作原理：
- 在启用梯度检查点后，DeepSpeed 会在训练过程中自动将某些层的中间激活值丢弃（不保留在内存中），而在反向传播时需要这些激活时再重新计算。这种方式减少了显存的占用，但也会增加计算负担，因为需要在反向传播过程中重新计算丢弃的激活。
如何选择 checkpointing 的层：
DeepSpeed 会自动在每个训练步骤中选择合适的层来进行梯度检查点。通常来说，越深的网络越适合使用这种技术，因为它可以显著减少显存占用。
训练时的显存占用：
启用梯度检查点后，你会发现显存占用大大减少，但训练时的计算量会有所增加。因为每次反向传播时，需要重新计算丢弃的激活，因此可能会导致训练时间的延长，尤其是在需要多次重新计算的情况下。

代码示例

假设你正在使用一个简单的模型和 DeepSpeed 进行训练，以下是如何配置并启用梯度检查点的代码示例：

import deepspeed
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# 加载模型和tokenizer
model_name = 'bert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# 定义训练数据
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")

# 定义 DeepSpeed 配置
deepspeed_config = {
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true
    },
    "gradient_checkpointing": True  # 启用梯度检查点
}

# 初始化 DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config_params=deepspeed_config)

# 训练步骤
outputs = model_engine(**inputs)
loss = outputs.loss
model_engine.backward(loss)
model_engine.step()

在这个示例中，我们使用了 deepspeed.initialize 来加载 DeepSpeed 配置，并通过设置 gradient_checkpointing 为 True 启用了梯度检查点。

启用梯度检查点的优缺点

优点：

减少显存占用：通过丢弃部分中间激活，只保存必要的激活，显存占用显著减少，允许更大模型的训练。
能够训练更大模型：在内存受限的情况下，能让我们训练更大规模的模型（如 7B、10B 参数的模型），而不需要增加显存。
支持大型批次训练：由于显存消耗减少，可以增加批次大小，从而提高训练效率。

缺点：

增加计算开销：由于需要在反向传播时重新计算丢弃的激活，会增加额外的计算开销，导致训练速度变慢。
训练时间增加：在计算资源有限的情况下，重新计算丢失的激活会显著增加训练时间。
复杂性增加：启用梯度检查点后，需要在训练过程中动态管理内存，可能会增加一定的复杂度。

总结

梯度检查点技术是 DeepSpeed 和其他深度学习框架中一种非常有效的节省显存的方法，特别适用于需要训练超大规模模型的场景。虽然它会带来额外的计算负担，但它允许我们在显存受限的环境中训练更大的模型，因此是一种在资源有限的情况下进行大规模模型训练的实用技术。

通过合理配置 gradient_checkpointing 和其他 DeepSpeed 参数，你可以在显存限制下最大化利用硬件资源，进行高效的训练。

英文版

Gradient checkpointing in DeepSpeed is a technique designed to reduce memory usage when training large models by storing only a subset of intermediate activations during the forward pass. This allows for larger models to be trained without exceeding memory limits, although it comes at the cost of additional computation, as some activations need to be recomputed during the backward pass.

How to Enable Gradient Checkpointing in DeepSpeed

Configuration File Setup:
To enable gradient checkpointing in DeepSpeed, you need to add the "gradient_checkpointing": true option in your DeepSpeed configuration file (usually a JSON file). Here’s an example configuration:
```
{
  "train_batch_size": 16,
  "gradient_accumulation_steps": 4,
  "zero_optimization": {
    "stage": 2,
    "contiguous_gradients": true
  },
  "gradient_checkpointing": true
}
```
Key Fields Explanation:
- "gradient_checkpointing": true: This enables gradient checkpointing during training.
- "zero_optimization": { "stage": 2 }: This enables ZeRO optimization (typically used alongside gradient checkpointing for better memory efficiency).
How Gradient Checkpointing Works:
- Gradient checkpointing allows DeepSpeed to drop some intermediate activations during the forward pass and recompute them during the backward pass. This reduces the memory required to store activations, but increases the computation time since those activations need to be recomputed when needed for the backward pass.
Which Layers are Checkpointed?:
- DeepSpeed automatically selects which layers should have checkpointing applied based on memory usage and model structure. Typically, deeper models with more layers benefit from this approach.
Memory Usage During Training:
- When gradient checkpointing is enabled, you will observe a reduction in memory usage because not all intermediate activations are kept in memory. However, since activations are recomputed during the backward pass, training might take longer due to the extra computations required.

Code Example

Below is an example of how to configure and use gradient checkpointing with DeepSpeed in your training script:

import deepspeed
from transformers import AutoModelForSequenceClassification, AutoTokenizer
import torch

# Load model and tokenizer
model_name = 'bert-base-uncased'
model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=2)
tokenizer = AutoTokenizer.from_pretrained(model_name)

# Define training data
inputs = tokenizer("Hello, DeepSpeed!", return_tensors="pt")

# Define DeepSpeed configuration
deepspeed_config = {
    "train_batch_size": 16,
    "gradient_accumulation_steps": 4,
    "zero_optimization": {
        "stage": 2,
        "contiguous_gradients": true
    },
    "gradient_checkpointing": True  # Enable gradient checkpointing
}

# Initialize DeepSpeed
model_engine, optimizer, _, _ = deepspeed.initialize(model=model, config_params=deepspeed_config)

# Training loop
outputs = model_engine(**inputs)
loss = outputs.loss
model_engine.backward(loss)
model_engine.step()

In this example, gradient_checkpointing is set to True in the deepspeed_config dictionary, which enables the checkpointing feature during training.

Advantages of Gradient Checkpointing

Reduced Memory Usage: By storing fewer intermediate activations, you can significantly reduce memory consumption, allowing you to train larger models.
Ability to Train Larger Models: With less memory required for activations, you can train models with more parameters (e.g., 7B+ parameter models) on hardware with limited memory.
Supports Larger Batch Sizes: Reduced memory usage allows for larger batch sizes, which can speed up training and improve model convergence.

Disadvantages of Gradient Checkpointing

Increased Computational Overhead: Since activations are recomputed during the backward pass, this adds extra computation time, slowing down training.
Increased Training Time: The additional computation required for recomputing activations can significantly extend the overall training time.
Increased Complexity: While gradient checkpointing can save memory, it adds complexity to memory management during training, which could potentially lead to difficulties when debugging or fine-tuning.

Summary

Gradient checkpointing is a powerful technique that DeepSpeed offers for reducing memory usage, enabling the training of large models that would otherwise exceed GPU memory limits. However, it introduces additional computational costs due to the need to recompute activations during backpropagation. This trade-off makes gradient checkpointing particularly useful when working with very large models, allowing for a more memory-efficient training process.

By enabling gradient_checkpointing in your DeepSpeed configuration, you can achieve better memory efficiency, enabling the training of larger models on GPUs with limited memory.