Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Where are policy_lm and critic_lm? #7

Open
zhengshf opened this issue Nov 15, 2024 · 8 comments
Open

Where are policy_lm and critic_lm? #7

zhengshf opened this issue Nov 15, 2024 · 8 comments

Comments

@zhengshf
Copy link

zhengshf commented Nov 15, 2024

scripts/config/main/webrl.yaml:

defaults:

  • default
  • self

save_path: /workspace/WebRL/scripts/output
run_name: "webrl"

critic_lm# training
policy_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the actor model
critic_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the critic model

critic_epochs: 1 # number of epochs for the critic each phase
actor_epochs: 1 # number of epochs for training the actor each phase
batch_size: 1 # batch size for training the actor and critic

critic_resume_path: /workspace/WebRL/webrl-glm-4-9b # .bin file of paramerters of the critic model

offline_data_path: /workspace/WebRL/scripts/offline_data

checkpointing_steps: 400
~

@K-THU
Copy link

K-THU commented Nov 21, 2024

scripts/config/main/webrl.yaml:

defaults:

  • default
  • self

save_path: /workspace/WebRL/scripts/output run_name: "webrl"

critic_lm# training policy_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the actor model critic_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the critic model

critic_epochs: 1 # number of epochs for the critic each phase actor_epochs: 1 # number of epochs for training the actor each phase batch_size: 1 # batch size for training the actor and critic

critic_resume_path: /workspace/WebRL/webrl-glm-4-9b # .bin file of paramerters of the critic model

offline_data_path: /workspace/WebRL/scripts/offline_data

checkpointing_steps: 400 ~

scripts/config/main/webrl.yaml:

defaults:

  • default
  • self

save_path: /workspace/WebRL/scripts/output run_name: "webrl"

critic_lm# training policy_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the actor model critic_lm: /workspace/WebRL/webrl-glm-4-9b? # safetensors files of paramerters of the critic model

critic_epochs: 1 # number of epochs for the critic each phase actor_epochs: 1 # number of epochs for training the actor each phase batch_size: 1 # batch size for training the actor and critic

critic_resume_path: /workspace/WebRL/webrl-glm-4-9b # .bin file of paramerters of the critic model

offline_data_path: /workspace/WebRL/scripts/offline_data

checkpointing_steps: 400 ~

I also encountered this problem. Did you solve it?

@zhengshf
Copy link
Author

No,i have give up! But digiRL can setup and run well.

@QZH-777
Copy link
Collaborator

QZH-777 commented Nov 21, 2024

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:

Phase 1

  • Set the policy_lm to point to path of SFT-trained model.
  • Set the critic_lm to point to path of SFT-trained model.
  • Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

  • Set policy_lm to the actor model trained in the previous phase.
  • For critic_lm, continue using the SFT-trained model.
  • Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path
Set offline_data_path to path of data including both rollouts and past experiences from replay buffer

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

@zhengshf
Copy link
Author

thks a lot!!!

@K-THU
Copy link

K-THU commented Nov 22, 2024

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:

Phase 1

  • Set the policy_lm to point to path of SFT-trained model.
  • Set the critic_lm to point to path of SFT-trained model.
  • Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

  • Set policy_lm to the actor model trained in the previous phase.
  • For critic_lm, continue using the SFT-trained model.
  • Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path Set offline_data_path to path of data including both rollouts and past experiences from replay buffer

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

Thank you for such a detailed answer, but I am still a bit confused.Where is the Offline Data in the paper experiments ?Is it the /WebRL/LLaMA-Factory/data/web_policy_sft.json?

@QZH-777
Copy link
Collaborator

QZH-777 commented Nov 22, 2024

Apologies for responding at this time. The model /workspace/WebRL/webrl-glm-4-9b is the trained actor, not critic. The WebRL training process consists of multiple phases:
Phase 1

  • Set the policy_lm to point to path of SFT-trained model.
  • Set the critic_lm to point to path of SFT-trained model.
  • Leave the critic_resume_path blank, as there is no trained critic during this phase.

Phase i (i > 1)

  • Set policy_lm to the actor model trained in the previous phase.
  • For critic_lm, continue using the SFT-trained model.
  • Specify the critic_resume_path with the path to the critic model trained during the previous phase.

Offline Data Path Set offline_data_path to path of data including both rollouts and past experiences from replay buffer
We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.

Thank you for such a detailed answer, but I am still a bit confused.Where is the Offline Data in the paper experiments ?Is it the /WebRL/LLaMA-Factory/data/web_policy_sft.json?

Below is the pseudocode of WebRL training process
image

LLaMA-Factory/data/web_policy_sft.json is used to perform SFT. Once the model is fine-tuned, it interacts with WebArena to collect rollout data. These rollouts, along with previously gathered experiences, are combined to create the Offline Data.

@wangjinghan666
Copy link

Apologies for responding at this time. The model is the trained actor, not critic. The WebRL training process consists of multiple phases:/workspace/WebRL/webrl-glm-4-9b很抱歉此时回复。模型是训练有素的演员,而不是批评者。WebRL 训练过程包括多个阶段: /workspace/WebRL/webrl-glm-4-9b

Phase 1 第一阶段

  • Set the policy_lm to point to path of SFT-trained model.将 policy_lm 设置为指向 SFT 训练模型的路径。
  • Set the critic_lm to point to path of SFT-trained model.将 critic_lm 设置为 指向 SFT 训练模型的路径。
  • Leave the critic_resume_path blank, as there is no trained critic during this phase.将 critic_resume_path 留空,因为在此阶段没有训练有素的评论家。

Phase i (i > 1)第一阶段 (i > 1)

  • Set policy_lm to the actor model trained in the previous phase.将 policy_lm 设置为在上一阶段训练的 actor 模型。
  • For critic_lm, continue using the SFT-trained model.对于critic_lm,请继续使用 SFT 训练的模型。
  • Specify the critic_resume_path with the path to the critic model trained during the previous phase.指定 critic_resume_path 以及在上一阶段训练的批评者模型的路径。

Offline Data Path 离线数据路径 Set offline_data_path to path of data including both rollouts and past experiences from replay buffer将 offline_data_path 设置为数据路径,包括重播缓冲区中的转出和过去的体验

We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.我们建议您查看我们的论文以充分了解完整的培训过程。此外,在第 4 期中,我们简要介绍了整个训练过程。

那请问critic_lm应该用哪个模型呢,用Llama-3.1-8B可以吗

@QZH-777
Copy link
Collaborator

QZH-777 commented Nov 28, 2024

Apologies for responding at this time. The model is the trained actor, not critic. The WebRL training process consists of multiple phases:/workspace/WebRL/webrl-glm-4-9b很抱歉此时回复。模型是训练有素的演员,而不是批评者。WebRL 训练过程包括多个阶段: /workspace/WebRL/webrl-glm-4-9b
Phase 1 第一阶段

  • Set the policy_lm to point to path of SFT-trained model.将 policy_lm 设置为指向 SFT 训练模型的路径。
  • Set the critic_lm to point to path of SFT-trained model.将 critic_lm 设置为 指向 SFT 训练模型的路径。
  • Leave the critic_resume_path blank, as there is no trained critic during this phase.将 critic_resume_path 留空,因为在此阶段没有训练有素的评论家。

Phase i (i > 1)第一阶段 (i > 1)

  • Set policy_lm to the actor model trained in the previous phase.将 policy_lm 设置为在上一阶段训练的 actor 模型。
  • For critic_lm, continue using the SFT-trained model.对于critic_lm,请继续使用 SFT 训练的模型。
  • Specify the critic_resume_path with the path to the critic model trained during the previous phase.指定 critic_resume_path 以及在上一阶段训练的批评者模型的路径。

Offline Data Path 离线数据路径 Set offline_data_path to path of data including both rollouts and past experiences from replay buffer将 offline_data_path 设置为数据路径,包括重播缓冲区中的转出和过去的体验
We recommend reviewing our paper to fully understand the complete training process. Additionally, in issue 4, we briefly introduced the entire training process.我们建议您查看我们的论文以充分了解完整的培训过程。此外,在第 4 期中,我们简要介绍了整个训练过程。

那请问critic_lm应该用哪个模型呢,用Llama-3.1-8B可以吗

Set critic_lm to the path to the SFT-trained model

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants