Where are policy_lm and critic_lm? #7

zhengshf · 2024-11-15T02:56:47Z

scripts/config/main/webrl.yaml:

defaults:

default
self

save_path: /workspace/WebRL/scripts/output
run_name: "webrl"

critic_lm# training
policy_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the actor model
critic_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the critic model

critic_epochs: 1 # number of epochs for the critic each phase
actor_epochs: 1 # number of epochs for training the actor each phase
batch_size: 1 # batch size for training the actor and critic

critic_resume_path: /workspace/WebRL/webrl-glm-4-9b # .bin file of paramerters of the critic model

offline_data_path: /workspace/WebRL/scripts/offline_data

checkpointing_steps: 400
~

K-THU · 2024-11-21T06:38:06Z

scripts/config/main/webrl.yaml:

defaults:

default

self

save_path: /workspace/WebRL/scripts/output run_name: "webrl"

critic_lm# training policy_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the actor model critic_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the critic model

critic_epochs: 1 # number of epochs for the critic each phase actor_epochs: 1 # number of epochs for training the actor each phase batch_size: 1 # batch size for training the actor and critic

critic_resume_path: /workspace/WebRL/webrl-glm-4-9b # .bin file of paramerters of the critic model

offline_data_path: /workspace/WebRL/scripts/offline_data

checkpointing_steps: 400 ~

I also encountered this problem. Did you solve it?

zhengshf · 2024-11-21T06:46:46Z

No,i have give up! But digiRL can setup and run well.

QZH-777 · 2024-11-21T13:06:02Z

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:

Phase 1

Set the policy_lm to point to path of SFT-trained model.
Set the critic_lm to point to path of SFT-trained model.
Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

Set policy_lm to the actor model trained in the previous phase.
For critic_lm, continue using the SFT-trained model.
Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path
Set offline_data_path to path of data including both rollouts and past experiences from replay buffer

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

zhengshf · 2024-11-22T01:22:51Z

thks a lot!!!

K-THU · 2024-11-22T02:05:00Z

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:

Phase 1

Set the policy_lm to point to path of SFT-trained model.

Set the critic_lm to point to path of SFT-trained model.

Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

Set policy_lm to the actor model trained in the previous phase.

For critic_lm, continue using the SFT-trained model.

Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path Set offline_data_path to path of data including both rollouts and past experiences from replay buffer

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

Thank you for such a detailed answer, but I am still a bit confused.Where is the Offline Data in the paper experiments ？Is it the /WebRL/LLaMA-Factory/data/web_policy_sft.json？

QZH-777 · 2024-11-22T06:32:06Z

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:
Phase 1

Set the policy_lm to point to path of SFT-trained model.

Set the critic_lm to point to path of SFT-trained model.

Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

Set policy_lm to the actor model trained in the previous phase.

For critic_lm, continue using the SFT-trained model.

Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path Set offline_data_path to path of data including both rollouts and past experiences from replay buffer
We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

Thank you for such a detailed answer, but I am still a bit confused.Where is the Offline Data in the paper experiments ？Is it the /WebRL/LLaMA-Factory/data/web_policy_sft.json？

Below is the pseudocode of WebRL training process

LLaMA-Factory/data/web_policy_sft.json is used to perform SFT. Once the model is fine-tuned, it interacts with WebArena to collect rollout data. These rollouts, along with previously gathered experiences, are combined to create the Offline Data.

wangjinghan666 · 2024-11-27T19:50:58Z

Apologies for responding at this time. The model is the trained actor, not critic. The WebRL training process consists of multiple phases:/workspace/WebRL/webrl-glm-4-9b很抱歉此时回复。模型是训练有素的演员，而不是批评者。WebRL 训练过程包括多个阶段： /workspace/WebRL/webrl-glm-4-9b

Phase 1 第一阶段

Set the policy_lm to point to path of SFT-trained model.将 policy_lm 设置为指向 SFT 训练模型的路径。

Set the critic_lm to point to path of SFT-trained model.将 critic_lm 设置为指向 SFT 训练模型的路径。

Leave the critic_resume_path blank, as there is no trained critic during this phase.将 critic_resume_path 留空，因为在此阶段没有训练有素的评论家。

Phase i (i > 1)第一阶段（i > 1）

Set policy_lm to the actor model trained in the previous phase.将 policy_lm 设置为在上一阶段训练的 actor 模型。

For critic_lm, continue using the SFT-trained model.对于critic_lm，请继续使用 SFT 训练的模型。

Specify the critic_resume_path with the path to the critic model trained during the previous phase.指定 critic_resume_path 以及在上一阶段训练的批评者模型的路径。

Offline Data Path 离线数据路径 Set offline_data_path to path of data including both rollouts and past experiences from replay buffer将 offline_data_path 设置为数据路径，包括重播缓冲区中的转出和过去的体验

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.我们建议您查看我们的论文以充分了解完整的培训过程。此外，在第 4 期中，我们简要介绍了整个训练过程。

那请问critic_lm应该用哪个模型呢，用Llama-3.1-8B可以吗

QZH-777 · 2024-11-28T07:07:54Z

Apologies for responding at this time. The model is the trained actor, not critic. The WebRL training process consists of multiple phases:/workspace/WebRL/webrl-glm-4-9b很抱歉此时回复。模型是训练有素的演员，而不是批评者。WebRL 训练过程包括多个阶段： /workspace/WebRL/webrl-glm-4-9b
Phase 1 第一阶段

Set the policy_lm to point to path of SFT-trained model.将 policy_lm 设置为指向 SFT 训练模型的路径。

Set the critic_lm to point to path of SFT-trained model.将 critic_lm 设置为指向 SFT 训练模型的路径。

Leave the critic_resume_path blank, as there is no trained critic during this phase.将 critic_resume_path 留空，因为在此阶段没有训练有素的评论家。

Phase i (i > 1)第一阶段（i > 1）

Set policy_lm to the actor model trained in the previous phase.将 policy_lm 设置为在上一阶段训练的 actor 模型。

For critic_lm, continue using the SFT-trained model.对于critic_lm，请继续使用 SFT 训练的模型。

Specify the critic_resume_path with the path to the critic model trained during the previous phase.指定 critic_resume_path 以及在上一阶段训练的批评者模型的路径。

Offline Data Path 离线数据路径 Set offline_data_path to path of data including both rollouts and past experiences from replay buffer将 offline_data_path 设置为数据路径，包括重播缓冲区中的转出和过去的体验
We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.我们建议您查看我们的论文以充分了解完整的培训过程。此外，在第 4 期中，我们简要介绍了整个训练过程。

那请问critic_lm应该用哪个模型呢，用Llama-3.1-8B可以吗

Set critic_lm to the path to the SFT-trained model

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Where are policy_lm and critic_lm? #7

Where are policy_lm and critic_lm? #7

zhengshf commented Nov 15, 2024 •

edited

Loading

K-THU commented Nov 21, 2024

zhengshf commented Nov 21, 2024

QZH-777 commented Nov 21, 2024

zhengshf commented Nov 22, 2024

K-THU commented Nov 22, 2024

QZH-777 commented Nov 22, 2024

wangjinghan666 commented Nov 27, 2024

QZH-777 commented Nov 28, 2024

Where are policy_lm and critic_lm? #7

Where are policy_lm and critic_lm? #7

Comments

zhengshf commented Nov 15, 2024 • edited Loading

K-THU commented Nov 21, 2024

zhengshf commented Nov 21, 2024

QZH-777 commented Nov 21, 2024

zhengshf commented Nov 22, 2024

K-THU commented Nov 22, 2024

QZH-777 commented Nov 22, 2024

wangjinghan666 commented Nov 27, 2024

QZH-777 commented Nov 28, 2024

zhengshf commented Nov 15, 2024 •

edited

Loading