当前位置：首页 > news >正文

网站制作计算机/站内推广的方法和工具

news 2025/7/3 19:08:13

网站制作计算机,站内推广的方法和工具,怎么在工商局网站查公司,好的装修网站欢迎关注我的CSDN：https://spike.blog.csdn.net/ 本文地址：https://spike.blog.csdn.net/article/details/146838740 免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。 OpenR1…

欢迎关注我的CSDN：https://spike.blog.csdn.net/
本文地址：https://spike.blog.csdn.net/article/details/146838740

免责声明：本文来源于个人知识与公开资料，仅用于学术交流，欢迎讨论，不支持转载。

OpenR1

OpenR1 是一个开源的强化学习框架，复现 DeepSeek-R1 的训练流程，为研究人员和开发者提供了一个完整的推理优化训练工具链。该项目由 Hugging Face 发起，通过开源的方式，详细展示了从知识蒸馏到强化学习，再到多阶段训练的完整过程。OpenR1 包含了用于训练和评估模型以及生成合成数据的脚本，支持 GRPO 训练、监督微调（SFT）等多种训练方法。它还封装了多个开源框架，如 TRL 和 distilabel，方便用户快速上手。通过开源代码、模型和数据集，OpenR1 为推理领域开源社区的发展奠定了基础。

安装依赖库 trl / lighteval / flash-attn 和配置 open-r1 环境，即：

pip install setuptools# 安装 TRL 库
# trl @ git+https://github.com/huggingface/trl.git@69ad852e5654a77f1695eb4c608906fe0c7e8624
git clone https://github.com/huggingface/trl.git
cd trl
git checkout 69ad852e5654a77f1695eb4c608906fe0c7e8624
pip install --no-build-isolation -e "."
pip show trl
# Version: 0.16.0.dev0# 安装 lighteval 库
# lighteval @ git+https://github.com/huggingface/lighteval.git@ed084813e0bd12d82a06d9f913291fdbee774905
git clone https://github.com/huggingface/lighteval.git
cd lighteval
git checkout ed084813e0bd12d82a06d9f913291fdbee774905
pip install --no-build-isolation -e "."
pip show lighteval
# Version: 0.6.0.dev0# 安装 flash-attn
git clone https://github.com/Dao-AILab/flash-attention.git
python setup.py install
pip show flash-attn
# Version: 2.7.4.post1cd [your path]/llm/open-r1
pip install --no-build-isolation -e "."

其中 accelerate 冲突与修复版本 1.3.0，即：

ERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
mergekit 0.0.6 requires accelerate~=1.3.0, but you have accelerate 1.4.0 which is incompatible.pip install accelerate==1.3.0

准备模型 Qwen/Qwen2.5-1.5B-Instruct 与训练集 OpenR1-Math-220k 和 Bespoke-Stratos-17k，即：

huggingface-cli download --token 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 Qwen/Qwen2.5-1.5B-Instruct --local-dir Qwen/Qwen2.5-1.5B-Instruct
huggingface-cli download --token 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 Qwen/Qwen2.5-VL-7B-Instruct --local-dir Qwen/Qwen2.5-VL-7B-Instruct
huggingface-cli download --repo-type dataset --token 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 open-r1/OpenR1-Math-220k --local-dir open-r1/OpenR1-Math-220k
huggingface-cli download --repo-type dataset --token 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 HuggingFaceH4/Bespoke-Stratos-17k --local-dir HuggingFaceH4/Bespoke-Stratos-17k

其中，模型与数据集的路径：

[your path]/huggingface/Qwen/Qwen2.5-1.5B-Instruct/
[your path]/llm/openr1_datasets/open-r1/OpenR1-Math-220k/
[your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k/

使用 SFT 训练 Open-R1，即：

accelerate launch --config_file=recipes/accelerate_configs/zero3.yaml src/open_r1/sft.py \--model_name_or_path "[your path]/huggingface/Qwen/Qwen2.5-1.5B-Instruct" \--dataset_name "[your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k" \--learning_rate 1.0e-5 \--num_train_epochs 1 \--packing \--max_seq_length 8096 \--per_device_train_batch_size 1 \--gradient_checkpointing \--bf16 \--output_dir data/Qwen2.5-1.5B-Open-R1-Distill# [your path]/llm/openr1_datasets/HuggingFaceH4/Bespoke-Stratos-17k  # 训练较快
# [your path]/llm/openr1_datasets/open-r1/OpenR1-Math-220k

注意：per_device_train_batch_size 直接影响显存占用。

强化学习 GRPO 的训练范式：

ACCELERATE_LOG_LEVEL=info accelerate launch --config_file recipes/accelerate_configs/zero2.yaml \--num_processes=7 src/open_r1/grpo.py \--config recipes/DeepSeek-R1-Distill-Qwen-1.5B/grpo/config_demo.yaml

其中，zero3.yaml 配置文件，如下：

compute_environment: LOCAL_MACHINE  # 指定运行环境为本地机器
debug: false  # 是否开启调试模式（false 表示关闭调试）
deepspeed_config:  # DeepSpeed 配置部分deepspeed_multinode_launcher: standard  # 指定 DeepSpeed 多节点启动方式（标准启动）offload_optimizer_device: none  # 优化器卸载设备（none 表示不卸载到其他设备）offload_param_device: none  # 模型参数卸载设备（none 表示不卸载到其他设备）zero3_init_flag: true  # 是否启用 Zero3 初始化标志（true 表示启用）zero3_save_16bit_model: true  # 是否保存 16 位模型（true 表示保存）zero_stage: 3  # 指定 Zero 阶段（使用 Zero-3）
distributed_type: DEEPSPEED  # 分布式训练类型（使用 DeepSpeed）
downcast_bf16: 'no'  # 是否将浮点数下转换为 bf16（no 表示不转换）
machine_rank: 0  # 机器的排名（通常用于多节点训练，0 表示单节点）
main_training_function: main  # 主训练函数的名称
mixed_precision: bf16  # 混合精度类型（使用 bfloat16）
num_machines: 1  # 使用的机器数量（1 表示单机）
num_processes: 8  # 每台机器上的进程数（8 表示每个节点运行 8 个进程）
rdzv_backend: static  # 分布式训练的后端类型（使用静态后端）
same_network: true  # 是否使用同一网络（true 表示所有节点在同一网络）
tpu_env: []  # TPU 环境变量（未启用 TPU）
tpu_use_cluster: false  # 是否使用 TPU 集群（false 表示不使用）
tpu_use_sudo: false  # 是否使用 sudo 权限运行 TPU（false 表示不使用）
use_cpu: false  # 是否使用 CPU 进行训练（false 表示不使用 CPU，通常使用 GPU 或 TPU）

模型训练的过程：第 0 个卡先处理数据，其余 7 个卡再处理数据，即：

Applying chat template to train dataset		# 01:02
Tokenizing train dataset		# 16:45
Packing train dataset  			# 27:59

其中，accelerate launch 训练方式，通过 Accelerator 类，简化分布式训练和混合精度训练的配置和实现，专注于模型的开发，无需过多关注硬件和分布式环境。

Accelerator 自动处理分布式环境的配置，将模型、优化器和数据加载器，传递给 accelerator.prepare() 方法，即在任何分布式设置 (包括单 GPU、多 GPU、TPU) 上运行。
Accelerator 自动管理设备分配，数据加载器，自动将数据发送到正确的设备，无需手动调用 .to(device)。
通过配置参数，如 --mixed_precision=fp16，即可启用混合精度训练，自动处理梯度缩放等细节。
提供内置的梯度累积功能，即 accelerator.accumulate()，和梯度裁剪功能，即 accelerator.clip_grad_norm_()。
通过 accelerate config 工具，快速配置训练环境，通过命令行参数，如 --num_processes 和 --mixed_precision，灵活调整。
无需关心底层硬件细节，相同的代码可以在不同硬件上运行，减少了代码的复杂性和维护成本。

训练数据格式，包括 system 和 conversations 字段，user 是问题，assistant 是答案，如下：

{'system': "Your role as an assistant involves thoroughly exploring questions through a systematic long thinking process before providing the final precise and accurate solutions. ...",'conversations': array([{'from': 'user','value': 'Return your final response within \\boxed{}. The operation $\\otimes$ is defined for all nonzero numbers by $a\\otimes b =\\frac{a^{2}}{b}$. ...'},{'from': 'assistant','value': "<|begin_of_thought|>\n\nOkay, let me try to figure out this problem. So, we have this operation defined as a⊗b = a²/b. And we need to compute [(1⊗2)⊗3] - [1⊗(2⊗3)]. ...<|end_of_thought|>\n\n<|begin_of_solution|>\n\nTo determine the value of \\([(1 \\otimes 2) \\otimes 3] - [1 \\otimes (2 \\otimes 3)]\\) where the operation \\(\\otimes\\) is defined by \\(a \\otimes b = \\frac{a^2}{b}\\)..., the answer is \\(\\boxed{A}\\).\n\n<|end_of_solution|>"}],dtype = object)
}

参数：

# 模型相关参数
model_name_or_path: deepseek-ai/DeepSeek-R1-Distill-Qwen-1.5B  # 模型名称或路径，使用 DeepSeek-R1-Distill-Qwen-1.5B 模型[^1^]
model_revision: main  # 模型版本，指定使用的模型分支
torch_dtype: bfloat16  # 模型的计算精度，使用 bfloat16 提升计算效率[^5^]
attn_implementation: flash_attention_2  # 注意力机制的实现方式，使用 flash_attention_2 提升推理效率[^1^]# 数据训练相关参数
# 编辑 DeepSeek 聊天模板，确保推理时包含推理过程的 <think> 标签内容，并且格式奖励正常工作
chat_template: "
{% if not add_generation_prompt is defined %}{% set add_generation_prompt = false %}
{% endif %}
{% set ns = namespace(is_first=false, is_tool=false, is_output_first=true, system_prompt='') %}
{%- for message in messages %}{%- if message['role'] == 'system' %}{% set ns.system_prompt = message['content'] %}{%- endif %}
{%- endfor %}
{{bos_token}}{{ns.system_prompt}}
{%- for message in messages %}{%- if message['role'] == 'user' %}{%- set ns.is_tool = false -%}{{'<｜User｜>' + message['content']}}{%- endif %}{%- if message['role'] == 'assistant' and message['content'] is none %}{%- set ns.is_tool = false -%}{%- for tool in message['tool_calls'] %}{%- if not ns.is_first %}{{'<｜Assistant｜><｜tool▁calls▁begin｜><｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{%- set ns.is_first = true -%}{%- else %}{{'\\n' + '<｜tool▁call▁begin｜>' + tool['type'] + '<｜tool▁sep｜>' + tool['function']['name'] + '\\n' + '```json' + '\\n' + tool['function']['arguments'] + '\\n' + '```' + '<｜tool▁call▁end｜>'}}{{'<｜tool▁calls▁end｜><｜end▁of▁sentence｜>'}}{%- endif %}{%- endfor %}{%- endif %}
{%- if message['role'] == 'assistant' and message['content'] is not none %}{%- if ns.is_tool %}{{'<｜tool▁outputs▁end｜>' + message['content'] + '<｜end▁of▁sentence｜>'}}{%- set ns.is_tool = false -%}{%- else %}{% set content = message['content'] %}{{'<｜Assistant｜>' + content + '<｜end▁of▁sentence｜>'}}{%- endif %}
{%- endif %}
{%- if message['role'] == 'tool' %}{%- set ns.is_tool = true -%}{%- if ns.is_output_first %}{{'<｜tool▁outputs▁begin｜><｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- set ns.is_output_first = false %}{%- else %}{{'\\n<｜tool▁output▁begin｜>' + message['content'] + '<｜tool▁output▁end｜>'}}{%- endif %}
{%- endif %}
{%- endfor -%}
{% if ns.is_tool %}
{{'<｜tool▁outputs▁end｜>'}}
{% endif %}
{% if add_generation_prompt and not ns.is_tool %}
{{'<｜Assistant｜>'}}
{% endif %}
"
dataset_name: open-r1/OpenR1-Math-220k  # 使用的数据集名称，这里为 OpenR1-Math-220k[^2^]
system_prompt: "You are a helpful AI Assistant that provides well-reasoned and detailed responses. You first think about the reasoning process as an internal monologue and then provide the user with the answer. Respond in the following format: <think>\n...\n</think>\n<answer>\n...\n</answer>"  # 系统提示，定义模型的回复风格和格式[^2^]# GRPO 训练器配置
bf16: true  # 是否使用 bfloat16 精度进行训练[^5^]
use_vllm: true  # 是否使用 vLLM 引擎进行高效推理[^5^]
vllm_device: auto  # 自动选择 vLLM 的设备[^5^]
vllm_gpu_memory_utilization: 0.7  # GPU 内存利用率设置为 0.7[^5^]
do_eval: false  # 是否进行评估，这里设置为不进行评估
gradient_accumulation_steps: 4  # 梯度累积步数，用于在小批量训练时累积梯度[^2^]
gradient_checkpointing: true  # 是否启用梯度检查点，以减少显存占用[^2^]
gradient_checkpointing_kwargs:use_reentrant: false  # 梯度检查点的参数，设置为非重入模式，避免递归调用导致的显存问题
hub_model_id: DeepSeek-R1-Distill-Qwen-1.5B-GRPO  # 模型在 Hugging Face Hub 上的 ID，用于模型保存和推送
hub_strategy: every_save  # 推送到 Hugging Face Hub 的策略，每次保存时推送
learning_rate: 1.0e-06  # 学习率，设置为 1e-6，用于控制模型参数更新的速度
log_completions: true  # 是否记录生成的完成文本，用于调试和分析
log_level: info  # 日志级别，设置为 info，记录重要信息
logging_first_step: true  # 是否记录第一步的日志，便于观察训练初期情况
logging_steps: 1  # 每隔多少步记录一次日志，这里设置为每步记录
logging_strategy: steps  # 日志记录策略，按步记录
lr_scheduler_type: cosine_with_min_lr  # 学习率调度器类型，使用余弦退火调度器并设置最小学习率
lr_scheduler_kwargs:  # 学习率调度器的参数min_lr_rate: 0.1  # 最小学习率比例，设置为 0.1
max_prompt_length: 512  # 最大提示长度，限制输入文本的长度
max_completion_length: 2048  # 最大生成长度，限制生成文本的长度
max_steps: -1  # 最大训练步数，设置为 -1 表示按 epoch 训练
num_generations: 16  # 每次生成的样本数量，用于评估和调试
num_train_epochs: 1  # 训练的轮数，设置为 1 轮
output_dir: data/DeepSeek-R1-Distill-Qwen-1.5B-GRPO  # 输出目录，用于保存训练结果
overwrite_output_dir: true  # 是否覆盖输出目录，方便重新训练
per_device_eval_batch_size: 16  # 每个设备的评估批量大小
per_device_train_batch_size: 16  # 每个设备的训练批量大小
push_to_hub: true  # 是否将模型推送到 Hugging Face Hub
report_to:  # 报告工具，用于记录训练过程
- wandb  # 使用 Weights & Biases 进行可视化和报告
reward_funcs:  # 奖励函数列表，用于优化生成文本的质量
- accuracy  # 准确性奖励
- format  # 格式奖励
- tag_count  # 标签数量奖励
reward_weights:  # 奖励权重，每个奖励函数的权重均为 1.0
- 1.0
- 1.0
- 1.0
save_strategy: "epoch"  # 保存策略，按轮保存模型
save_total_limit: 1  # 最大保存模型数量，限制为 1 个
seed: 42  # 随机种子，用于确保结果可复现
temperature: 0.7  # 生成温度，控制生成文本的多样性
warmup_ratio: 0.1  # 预热比例，用于学习率预热