DeepSeek-R1复现方案解读之「Open-R1」

整体grpo原理如下：‍分组采样：对每个问题，从旧策略中采样一组输出（组大小比如为16或64）。优势计算：对每个输出，计算标准化优势值：其中是输出的奖励，优势值通过组内标准化消除全局奖励偏差。目标函数：最大化以下目标，同时约束策略与参考策略的KL散度：其中控制策略更新的幅度，调节KL散度惩罚项。奖励函数的设计与应用在模型训练过程中，奖励函数扮演着至关重要的角色，它们指导模型如何优化

AGI大模型老王

3814人浏览 · 2025-02-05 21:54:02

AGI大模型老王 · 2025-02-05 21:54:02 发布

大家好，我是蘑菇先生。随着deepseek的爆火，越来越多的复现工作随之而来，号称30刀即可见证aha moment，比如：

港科大simplerl-reason，https://github.com/hkust-nlp/simpleRL-reason；
hugging face的open-r1，https://github.com/huggingface/open-r1；
UC伯克利，https://github.com/Jiayi-Pan/TinyZero，实现类似24点的游戏；
RAGEN，第一个复现deepseek-R1用于训练agent，RAGEN: A General-Purpose Reasoning Agent Training Framework，正全力押注 RL（强化学习）+ LLM（大语言模型）+ Agents（智能体）融合的未来，https://github.com/ZihanWang314/ragen。

DeepSeek-R1开源了，但仍然遗留了几个值得深思的问题：

数据收集：推理特定数据集是如何精挑细选出来的？如何制作数据集可能是最核心的工作，论文中针对R1的数据集有比较详细的描述，但是对Zero训练所使用的数据集提的比较少。
模型训练：由于没有公开DeepSeek对DeepSeek-R1进行训练的代码，因此不清楚最佳超参数是什么，以及在不同模型家族和规模下它们之间有何差异。
Scaling Law：在训练推理模型时，计算资源与数据集之间存在怎样的权衡？

本文先介绍huggingface的Open-R1项目，这是一个旨在系统性地重构DeepSeek-R1的数据集及其训练流程、验证paper里的成果、推进开源推理模型发展。通过构建Open-R1，目标是阐明强化学习如何提升推理能力、向开源社区分享可复现的洞察，并为未来基于这些技术开发的模型奠定基础。

DeepSeek是如何做到的？

这部分不做过多的展开，有非常多的解读文章。整体原理图如下所示：

本文主要关注代码实践部分。
Open-R1：弥补缺失的部分

DeepSeek-R1模型权重是开放的，但用于训练的数据集和代码没有提供。

Open-R1的目标是填补这一空白，以便整个研究和行业社区都可以利用这些资源构建类似或更好的模型。以下是复现行动计划：

Step 1: 复现通过R1蒸馏Qwen等小模型。Replicate the R1-Distill models by distilling a high-quality reasoning dataset from DeepSeek-R1。
Step 2: 复现通过纯强化学习训练R1-Zero的过程，包括如何生成推理数据集。Replicate the pure RL pipeline that DeepSeek used to create R1-Zero. This will involve curating new, large-scale datasets for math, reasoning, and code。
Step 3: 复现训练R1的完整pipeline，包括2阶段SFT、2阶段RL。Show we can go from base model → SFT → RL via multi-stage training。

可参照原论文深度理解这3个步骤，此处不做赘述。

下面将围绕step1、2、3，给大家重点展示模型的输入、输出、核心代码片段。

step1：通过DeepSeek-R1蒸馏的数据训练

使用 DeepSeek-R1 的蒸馏数据创建了 Bespoke-Stratos-17k——一个包含问题、推理轨迹和答案的推理数据集。制作好的数据集参见：https://huggingface.co/datasets/bespokelabs/Bespoke-Stratos-17k。这里面的关键是推理轨迹，通过R1得到。

该数据集包含：

5,000 条来自 APPs 和 TACO 的编程数据；
10,000 条来自 NuminaMATH 数据集的 AIME、MATH 和 Olympiads 子集的数学数据；
1,000 条来自 STILL-2 的科学和谜题数据。

生成过程：

高效数据生成：通过 Bespoke Curator(用于生成和管理高质量合成数据的项目) 和 DeepSeek-R1，仅用 1.5 小时生成高质量推理数据集，成本控制在 800 美元。
改进拒绝采样：引入 Ray 集群加速代码验证，作者后面计划直接集成代码执行验证器。
优化推理轨迹：DeepSeek-R1 的推理轨迹质量高，无需额外格式化步骤，简化了流程。
提升数学解题精度：通过 gpt-4o-mini，过滤错误的数学解决方案，显著提高了正确数学解决方案的保留率（从 25% 提升至 73%）。

import pandas as pd   df = pd.read_parquet("hf://datasets/bespokelabs/Bespoke-Stratos-17k/data/train-00000-of-00001.parquet")

system：

Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. This requires engaging in a comprehensive cycle of analysis, summarizing, exploration, reassessment, reflection, backtracing, and iteration to develop well-considered thinking process. Please structure your response into two main sections: Thought and Solution. In the Thought section, detail your reasoning process using the specified format: <|begin_of_thought|> {thought with steps separated with '\n\n'} <|end_of_thought|> Each step should include detailed considerations such as analisying questions, summarizing relevant findings, brainstorming new ideas, verifying the accuracy of the current steps, refining any errors, and revisiting previous steps. In the Solution section, based on various attempts, explorations, and reflections from the Thought section, systematically present the final solution that you deem correct. The solution should remain a logical, accurate, concise expression style and detail necessary step needed to reach the conclusion, formatted as follows: <|begin_of_solution|> {final formatted, precise, and clear solution} <|end_of_solution|> Now, try to solve the following question through the above guidelines:

conversion:

{'from': 'user', 'value': 'Return your final response within \\boxed{}. The operation $\\otimes$ is defined for all nonzero numbers by $a\\otimes b =\\frac{a^{2}}{b}$. Determine $[(1\\otimes 2)\\otimes 3]-[1\\otimes (2\\otimes 3)]$.\n$\\text{(A)}\\ -\\frac{2}{3}\\qquad\\text{(B)}\\ -\\frac{1}{4}\\qquad\\text{(C)}\\ 0\\qquad\\text{(D)}\\ \\frac{1}{4}\\qquad\\text{(E)}\\ \\frac{2}{3}$'}{'from': 'assistant', 'value': "<|begin_of_thought|>\n\nOkay, let me try to figure out this problem. So, we have this operation defined as a⊗b = a²/b. And we need to compute [(1⊗2)⊗3] - [1⊗(2⊗3)]. Then choose the correct answer from the options given. Alright, let's break it down step by step.\n\nFirst, I need to remember that the operation ⊗ is not associative, right? Because the problem is asking for the difference between two different groupings: (1⊗2)⊗3 and 1⊗(2⊗3). So, the order in which we perform the operations matters here. That's probably why there's a subtraction between them.\n\nLet me start by computing each part separately. Let's tackle the first part: (1⊗2)⊗3.\n\nStarting with the innermost operation, which is 1⊗2. According to the definition, a⊗b = a²/b. So here, a is 1 and b is 2. Plugging those in: 1² / 2 = 1/2. So, 1⊗2 equals 1/2.\n\nNow, we take that result and perform the next operation with 3. So, (1⊗2)⊗3 becomes (1/2)⊗3. Again, using the same definition: a is now 1/2 and b is 3. So, ( (1/2)² ) / 3 = (1/4) / 3 = 1/12. So, (1⊗2)⊗3 equals 1/12.\n\nAlright, that's the first part. Now let's compute the second part: 1⊗(2⊗3). Again, starting with the innermost operation, which is 2⊗3. Applying the definition: a is 2 and b is 3. So, 2² / 3 = 4/3. Therefore, 2⊗3 equals 4/3.\n\nNow, we need to compute 1⊗(4/3). Here, a is 1 and b is 4/3. Using the operation definition: 1² / (4/3) = 1 / (4/3) = 3/4. So, 1⊗(2⊗3) equals 3/4.\n\nNow, the problem asks for the difference between the two results: [(1⊗2)⊗3] - [1⊗(2⊗3)] = (1/12) - (3/4). To subtract these fractions, they need a common denominator. The denominators are 12 and 4, so 12 is the common denominator.\n\nConverting 3/4 to twelfths: 3/4 = 9/12. So, 1/12 - 9/12 = (1 - 9)/12 = -8/12. Simplifying that fraction by dividing numerator and denominator by 4: -8/12 = -2/3.\n\nHmm, looking at the answer choices, option A is -2/3. So, is that the answer? Wait, but let me double-check my calculations to make sure I didn't make a mistake somewhere.\n\nFirst, checking (1⊗2): 1² / 2 = 1/2. Correct. Then, (1/2)⊗3: (1/2)² / 3 = (1/4)/3 = 1/12. That seems right.\n\nNow, for 2⊗3: 2² / 3 = 4/3. Correct. Then, 1⊗(4/3): 1² / (4/3) = 1 / (4/3) = 3/4. Yes, that's correct.\n\nSubtracting 3/4 from 1/12: 1/12 - 3/4. Convert 3/4 to 9/12, so 1/12 - 9/12 = -8/12 = -2/3. Yes, that all checks out. So the answer should be -2/3, which is option A.\n\nWait, but let me think again. The operation is defined for all nonzero numbers, so we don't have any issues with division by zero here. 2⊗3 is 4/3, which is fine, and then 1⊗(4/3) is 3/4. Correct.\n\nAlternatively, maybe there's a different way to approach the problem? Let me try expanding both expressions using variables to see if there's a pattern.\n\nLet’s denote the first expression: (a⊗b)⊗c. Using the definition:\n\nFirst, compute a⊗b = a²/b.\n\nThen, take that result and ⊗ with c: (a²/b)⊗c = ( (a²/b)² ) / c = a⁴ / (b² c).\n\nNow, the second expression: a⊗(b⊗c). First compute b⊗c = b²/c.\n\nThen, a⊗(b²/c) = a² / (b²/c) = a² * (c / b²) = (a² c) / b².\n\nTherefore, the difference between the two expressions is:\n\n(a⁴ / (b² c)) - (a² c / b²) = (a⁴ - a² c²) / (b² c) = a² (a² - c²) / (b² c).\n\nHmm, factoring that, it's a² (a - c)(a + c) / (b² c).\n\nBut in our specific problem, a = 1, b = 2, c = 3. Plugging those values in:\n\n1² (1 - 3)(1 + 3) / (2² * 3) = 1 * (-2)(4) / (4 * 3) = (-8) / 12 = -2/3. Same result. So that confirms the answer is indeed -2/3.\n\nTherefore, I think my initial calculation was correct, and the answer is option A.\n\n**Final Answer**\n\\boxed{A}\n\n<|end_of_thought|>\n\n<|begin_of_solution|>\n\nTo determine the value of \\([(1 \\otimes 2) \\otimes 3] - [1 \\otimes (2 \\otimes 3)]\\) where the operation \\(\\otimes\\) is defined by \\(a \\otimes b = \\frac{a^2}{b}\\), we proceed as follows:\n\nFirst, compute \\(1 \\otimes 2\\):\n\\[\n1 \\otimes 2 = \\frac{1^2}{2} = \\frac{1}{2}\n\\]\nNext, use this result to compute \\((1 \\otimes 2) \\otimes 3\\):\n\\[\n\\left(\\frac{1}{2}\\right) \\otimes 3 = \\frac{\\left(\\frac{1}{2}\\right)^2}{3} = \\frac{\\frac{1}{4}}{3} = \\frac{1}{12}\n\\]\n\nNow, compute \\(2 \\otimes 3\\):\n\\[\n2 \\otimes 3 = \\frac{2^2}{3} = \\frac{4}{3}\n\\]\nThen, use this result to compute \\(1 \\otimes (2 \\otimes 3)\\):\n\\[\n1 \\otimes \\left(\\frac{4}{3}\\right) = \\frac{1^2}{\\frac{4}{3}} = \\frac{1}{\\frac{4}{3}} = \\frac{3}{4}\n\\]\n\nFinally, find the difference between the two results:\n\\[\n\\frac{1}{12} - \\frac{3}{4} = \\frac{1}{12} - \\frac{9}{12} = \\frac{1 - 9}{12} = \\frac{-8}{12} = -\\frac{2}{3}\n\\]\n\nThus, the answer is \\(\\boxed{A}\\).\n\n<|end_of_solution|>"}可视化结果如下：

可视化结果如下所示：

使用上述数据蒸馏qwen，使用huggingface的trl框架SFTTrainer训练即可。

accelerate launch --config_file=configs/zero3.yaml src/open_r1/sft.py \       --model_name_or_path Qwen/Qwen2.5-Math-1.5B-Instruct \       --dataset_name HuggingFaceH4/Bespoke-Stratos-17k \       --learning_rate 2.0e-5 \       --num_train_epochs 1 \       --packing \       --max_seq_length 4096 \       --per_device_train_batch_size 4 \       --per_device_eval_batch_size 4 \       --gradient_accumulation_steps 4 \       --gradient_checkpointing \       --bf16 \       --logging_steps 5 \       --eval_strategy steps \       --eval_steps 100 \       --output_dir data/Qwen2.5-1.5B-Open-R1-Distill

复现的宏观效果如下，基本和deepseek论文中reported的蒸馏模型效果差不多。

用于实际推理生成数据，distill-qwen-7b-r1的微观效果如下：

problem：

generation， 蒸馏后的推理模型生成的结果，包括think+solution。思考会包在<think>xxx</think>中，可视化输出如下，没有截全，只列出部分think的内容。作者对每个问题一共生成4次，每次的think的内容都不一样，但答案都是对的。

生成结果1：

生成结果2：

至此，我们可以看到R1蒸馏的Qwen模型也具备了一定的推理能力。

step2：通过推理数据集+GRPO训练R1-Zero

此处未完整复现推理数据集的构造全流程，这个可能是R1-Zero最重要的点。

Open-R1直接采用了推理数据集：AI-MO/NuminaMath-TIR，共72540条数据，['problem', 'solution', 'messages']

按照R1-Zero论文的描述，problem字段是提供给模型的输入，通过规则化的RL奖励来引导模型进行思维链(CoT)慢思考。规则化的RL奖励包括答案的准确性和格式的准确性。答案的准确性会通过比对：

数学问题：generation结构化输出(即\boxed{}中answer) vs 数学问题确定性解solution。此处AI-MO/NuminaMath-TIR数据集的solution中，不仅包含了确定性的答案，还包括了CoT。按照deepseek-R1原文，此处训练Zero只输入了确定性答案(\boxed{}中的结果 ) ，未使用CoT，通过RL激发Zero自主探索CoT。
代码问题：如LeetCode，基于预定义好的测试用例(input、output)，通过代码编译器执行generation结构化输出(如\boxed{}中的代码)，喂入input，然后比对编译器输出和测试用例的output。

    accelerate launch --config_file configs/zero3.yaml --num_processes=7 src/open_r1/grpo.py \       --output_dir DeepSeek-R1-Distill-Qwen-7B-GRPO \       --model_name_or_path deepseek-ai/DeepSeek-R1-Distill-Qwen-7B \       --dataset_name AI-MO/NuminaMath-TIR \       --max_prompt_length 512 \       --max_completion_length 1024 \       --per_device_train_batch_size 1 \       --gradient_accumulation_steps 16 \       --logging_steps 10 \       --bf16 \       --use_vllm \       --vllm_device auto \       --vllm_gpu_memory_utilization 0.7

grpo训练展开介绍

https://huggingface.co/docs/trl/main/en/grpo_trainer

整体grpo原理如下：

‍

分组采样：对每个问题，从旧策略中采样一组输出（组大小比如为16或64）。
优势计算：对每个输出，计算标准化优势值：

其中是输出的奖励，优势值通过组内标准化消除全局奖励偏差。
目标函数：最大化以下目标，同时约束策略与参考策略的KL散度：

其中控制策略更新的幅度，调节KL散度惩罚项。

奖励函数的设计与应用

在模型训练过程中，奖励函数扮演着至关重要的角色，它们指导模型如何优化其行为以适应特定的任务需求。上述代码中定义了两个主要的奖励函数：

accuracy_reward：

重要性：准确度奖励函数确保模型在训练过程中尽可能地输出正确的答案，是衡量模型性能的核心指标。

功能：计算模型完成内容与正确答案之间的匹配程度。
步骤：

提取所有生成内容。
对于每个生成和解，分别解析解和生成内容。
使用验证函数（verify）检查解析结果是否一致。
返回奖励分数，若解不可解析，则跳过该sample。
format_reward：

重要性：格式奖励函数确保模型不仅能够输出正确的答案，还能以正确的方式呈现结果，满足用户对输出形式的需求。

功能：检查完成内容是否符合特定的格式要求。
步骤：

使用正则表达式匹配特定的格式模式。
返回匹配结果作为奖励分数。

数据预处理与对话结构

为了使模型能够与人类交互并有效训练，代码中定义了以下数据预处理逻辑：

SYSTEM_PROMPT：

重要性：系统提示为对话奠定了基础，明确了用户与模型之间的互动方式。

功能：定义系统的对话结构。

内容：

A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant    first thinks about the reasoning process in the mind and then provides the user with the answer.

**make_conversation**函数：

重要性：通过对话结构的标准化处理，确保模型能够理解并生成符合规范的回答。

功能：将原始数据转换为符合系统提示的对话格式。
步骤：

对于每个示例，生成一个包含系统提示和用户问题的对话结构。
去除不必要的列（如 messages），根据前文，仅使用problem，计算reward时，会解析solution中boxed包裹的解和模型生成结果。

def accuracy_reward(completions, solution, **kwargs):       """计算用户的回答与正确解算器之间的准确性奖励。            Args:           completions: 用户提供的可能包含多个答案的内容列表。           solution: 正确的解算器内容。           **kwargs: 可选参数，包括配置信息。          Returns:           列表：每个完成的准确性奖励分数。       """       # 初始化变量       contents = [completion[0]["content"] for completion in completions]       rewards = []  # 存储每个回答的奖励分数          # 处理每个用户回答和正确解算器内容配对       for content, sol in zip(contents, solution):           # 解析正确答案到结构化数据           gold_parsed = parse(sol, extraction_mode="first_match", extraction_config=[LatexExtractionConfig()])                # 如果正确答案可以被解析，则进行比较           if len(gold_parsed) != 0:               # 正确解析用户的回答               try:                   answer_parsed = parse(                       content,                       extraction_config=[                           LatexExtractionConfig(                               normalization_config=NormalizationConfig(                                   nits=False,    # 允许错误的NITs标记                                   malformed_operators=False,  # 忽略无效运算符                                   basic_latex=True,          # 使用基本LaTeX                                   equations=True,            # 处理方程式                                   boxed=True,               # 加框                                   units=True,              # 单位处理                               ),                               boxed_match_priority=0,    # 优先提取boxed结果，意味着上述solution的cot不会解析，只会使用最后的准确答案                               try_extract_without_anchor=False,                           )                       ],                       extraction_mode="first_match",                   )                      # 验证用户回答是否正确                   reward = float(verify(answer_parsed, gold_parsed))               except:                   # 如果解析失败，奖励为1以表示通过此问题                   reward = 1.0           else:               # 正确答案无法被有效解析，则所有用户的回答视为正确               reward = 1.0              rewards.append(reward)          return rewards      # 格式化reward   def format_reward(completions, **kwargs):       """Reward function that checks if the completion has a specific format."""       pattern = r"^<think>.*?</think><answer>.*?</answer>$"       completion_contents = [completion[0]["content"] for completion in completions]       matches = [re.match(pattern, content) for content in completion_contents]       return [1.0 if match else 0.0 for match in matches]         reward_funcs_registry = {       "accuracy": accuracy_reward,       "format": format_reward,   }      SYSTEM_PROMPT = (       "A conversation between User and Assistant. The user asks a question, and the Assistant solves it. The assistant "       "first thinks about the reasoning process in the mind and then provides the user with the answer. The reasoning "       "process and answer are enclosed within <think> </think> and <answer> </answer> tags, respectively, i.e., "       "<think> reasoning process here </think><answer> answer here </answer>"   )      def make_conversation(example):           return {               "prompt": [                   {"role": "system", "content": SYSTEM_PROMPT},                   {"role": "user", "content": example["problem"]},               ],           }         trainer = GRPOTrainer(           model=model_args.model_name_or_path,           reward_funcs=reward_funcs,           args=training_args,           train_dataset=dataset[script_args.dataset_train_split],           eval_dataset=dataset[script_args.dataset_test_split] if training_args.eval_strategy != "no" else None,           peft_config=get_peft_config(model_args),           callbacks=get_callbacks(training_args, model_args),       )

‍

step3：通过多阶段pipeline训练R1

源码还未复现，待下次解读。

总结

目前，open-R1还处在迭代阶段，仅R1蒸馏QWen模型的代码和复现实验是比较完善的。通过结合报告原文、模型的输入、输出、核心代码，希望对大家更好的理解原理有一定帮助。R1-zero\R1部分还处在迭代中，后续有机会将进一步分享。下期将分享其他复现方案。

在这里插入图片描述

如何学习AI大模型？

我在一线互联网企业工作十余年里，指导过不少同行后辈。帮助很多人得到了学习和成长。

我意识到有很多经验和知识值得分享给大家，也可以通过我们的能力和经验解答大家在人工智能学习中的很多困惑，所以在工作繁忙的情况下还是坚持各种整理和分享。但苦于知识传播途径有限，很多互联网行业朋友无法获得正确的资料得到学习提升，故此将并将重要的AI大模型资料包括AI大模型入门学习思维导图、精品AI大模型学习书籍手册、视频教程、实战学习等录播视频免费分享出来。

在这里插入图片描述