Skip to content

Commit

Permalink
feat: update reward model to support scaled and margin BT (#361)
Browse files Browse the repository at this point in the history
Signed-off-by: Zhilin Wang <[email protected]>
Signed-off-by: NeMo-Aligner CI <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Terry Kong <[email protected]>
  • Loading branch information
3 people authored Nov 1, 2024
1 parent b8dde4c commit d3493c7
Show file tree
Hide file tree
Showing 10 changed files with 445 additions and 14 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/cicd-main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -84,7 +84,7 @@ jobs:
test_case:
- ppo-llama3-pp2-reshard
- dpo-llama3

- rm-llama3
with:
RUNNER: self-hosted-azure
# Fairly aggresive timeout that all functional tests should try to adhere to
Expand Down
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
# Consume all durations and reset internal store
durations = timer.consume_durations()
```
- Add code and instructions for replicating Reward Modeling training in HelpSteer2 and HelpSteer2-Preference

### Breaking Changes
- Upgrade TRTLLM dependency from v0.10.0 to v0.12.0 and migrate from `GPTSession` cpp runtime to `ModelRunner` python runtime. Please use the latest Dockerfile.
Expand Down
161 changes: 152 additions & 9 deletions docs/user-guide/steerlm.rst
Original file line number Diff line number Diff line change
Expand Up @@ -40,16 +40,16 @@ The two methods approach model alignment from different angles: RLHF reinforces
For details on SteerLM, please refer to our paper `SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF <https://arxiv.org/abs/2310.05344>`_.
For details about the HelpSteer dataset, please refer to our paper `HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM <https://arxiv.org/abs/2311.09528>`_.

Train a SteerLM model
Train a SteerLM Model
#####################

This section is a step-by-step tutorial that walks you through how to run a full SteerLM pipeline with a Llama2 70B LLM model.

.. note::
Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.

Download the Llama 2 LLM model
^^^^^^^^^^^^^^^^^^^^^^^^^^
Download the Llama 2 LLM Model
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#. Download the Llama 2 70B LLM model from HF <https://huggingface.co/meta-llama/Llama-2-70b-hf> into the models folder.

Expand All @@ -74,8 +74,13 @@ Download the Llama 2 LLM model
The prefix for the tokenizer would be different when extracted. Ensure that the correct tokenizer file is used when running the preceding command.

Download and Preprocess Data for Attribute Prediction Modeling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

To follow the HelpSteer2 and HelpSteer2-Preference line of works, you need to use the LLama 3 70B and LLama 3.1 70B Instruct models, respectively.

You need to obtain access to them, download them, and then convert them in a similar manner.

Download and Preprocess Data for SteerLM Regression Reward Modeling
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

#. Download and convert both datasets into a common format:

Expand All @@ -85,7 +90,7 @@ Download and Preprocess Data for Attribute Prediction Modeling
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=data/helpsteer
#. Merge the two datasets for the train and val subset respectively:
#. Merge the two datasets for the train and val subset, respectively:

.. code-block:: bash
Expand All @@ -106,10 +111,58 @@ Download and Preprocess Data for Attribute Prediction Modeling
--output-file=data/merge_val_reg.jsonl
If you are interested in replicating Reward Modeling training in HelpSteer2, please follow the steps below instead.


.. code-block:: bash
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2/train.jsonl \
--output-file=data/helpsteer2/train_reg.jsonl
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2/val.jsonl \
--output-file=data/helpsteer2/val_reg.jsonl
cat data/helpsteer2/train_reg.jsonl data/helpsteer2/train_reg.jsonl > data/helpsteer2/train_reg_2_epoch.jsonl
If you're interested in replicating Reward Modeling training in HelpSteer2-Preference, please follow the steps below instead.

.. code-block:: bash
# for first stage of Reward Model training (i.e. SteerLM Regression)
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2-only_helpfulness --only_helpfulness
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2-only_helpfulness/train.jsonl \
--output-file=data/helpsteer2-only_helpfulness/train_reg.jsonl
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2-only_helpfulness/val.jsonl \
--output-file=data/helpsteer2-only_helpfulness/val_reg.jsonl
cat data/helpsteer2-only_helpfulness/train_reg.jsonl data/helpsteer2-only_helpfulness/train_reg.jsonl > data/helpsteer2-only_helpfulness/train_reg_2_epoch.jsonl
# for second stage of Reward Model training (i.e. Scaled Bradley Terry)
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2-pref -pref
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2-pref/train.jsonl \
--output-file=data/helpsteer2-pref/train_reg.jsonl
python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
--input-file=data/helpsteer2-pref/val.jsonl \
--output-file=data/helpsteer2-pref/val_reg.jsonl
Train the Regression Reward Model on OASST+HelpSteer Data
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

For this tutorial, train the regression reward model for 800 steps.

In this tutorial, you train the regression reward model for 800 steps.

.. note::
Depending on the type of cluster you use, you may need to set up multi-node training in your cluster env. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html.
Expand All @@ -125,7 +178,6 @@ For this tutorial, train the regression reward model for 800 steps.
pretrained_checkpoint.restore_from_path=/models/llama13b/llama13b.nemo \
"model.data.data_prefix={train: ["data/merge_train_reg.jsonl"], validation: ["data/merge_val_reg.jsonl"], test: ["data/merge_val_reg.jsonl"]}" \
exp_manager.explicit_log_dir=/results/reward_model_13b \
trainer.rm.save_interval=100 \
trainer.rm.val_check_interval=10 \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.project=steerlm \
Expand All @@ -135,10 +187,101 @@ For this tutorial, train the regression reward model for 800 steps.
++model.tensor_model_parallel_size=4 \
++model.pipeline_model_parallel_size=1 \
++model.activations_checkpoint_granularity="selective" \
model.optim.sched.constant_steps=0 \
model.reward_model_type="regression" \
model.regression.num_attributes=9
If you're interested in replicating Reward Modeling training in HelpSteer2, please follow the steps below instead.

.. code-block:: bash
python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
trainer.num_nodes=8 \
trainer.devices=8 \
++model.micro_batch_size=2 \
++model.global_batch_size=128 \
++model.data.data_impl=jsonl \
pretrained_checkpoint.restore_from_path=/models/llama-3-70b.nemo \
"model.data.data_prefix={train: ["data/helpsteer2/train_reg_2_epoch.jsonl"], validation: ["data/helpsteer2/val_reg.jsonl"], test: ["data/helpsteer2/val_reg.jsonl"]}" \
exp_manager.explicit_log_dir=/results/reward_model_13b \
trainer.rm.val_check_interval=10 \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.project=steerlm \
exp_manager.wandb_logger_kwargs.name=rm_training \
trainer.rm.save_interval=10 \
trainer.rm.max_steps=317 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=2 \
++model.activations_checkpoint_method="uniform" \
++model.activations_checkpoint_num_layers=1 \
++model.sequence_parallel=False \
model.optim.sched.constant_steps=0 \
model.optim.sched.warmup_steps=10 \
model.reward_model_type="regression" \
model.optim.lr=2e-6 \
model.optim.sched.min_lr=2e-6 \
model.regression.num_attributes=9
If you're interested in replicating Reward Modeling training in HelpSteer2-Preference, please follow the steps below instead.

.. code-block:: bash
python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
trainer.num_nodes=8 \
trainer.devices=8 \
++model.micro_batch_size=2 \
++model.global_batch_size=128 \
++model.data.data_impl=jsonl \
pretrained_checkpoint.restore_from_path=/models/llama-3.1-70b-instruct.nemo \
"model.data.data_prefix={train: ["data/helpsteer2-only_helpfulness/train_reg_2_epoch.jsonl"], validation: ["data/helpsteer2-only_helpfulness/val_reg.jsonl"], test: ["data/helpsteer2-only_helpfulness/val_reg.jsonl"]}" \
exp_manager.explicit_log_dir=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct \
trainer.rm.val_check_interval=10 \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.project=steerlm \
exp_manager.wandb_logger_kwargs.name=rm_training \
trainer.rm.save_interval=10 \
trainer.rm.max_steps=317 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=2 \
++model.activations_checkpoint_method="uniform" \
++model.activations_checkpoint_num_layers=1 \
++model.sequence_parallel=False \
model.optim.sched.constant_steps=0 \
model.optim.sched.warmup_steps=10 \
model.reward_model_type="regression" \
model.optim.lr=2e-6 \
model.optim.sched.min_lr=2e-6 \
model.regression.num_attributes=9
python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
trainer.num_nodes=4 \
trainer.devices=8 \
++model.micro_batch_size=2 \
++model.global_batch_size=128 \
++model.data.data_impl=jsonl \
pretrained_checkpoint.restore_from_path=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct/checkpoints/megatron_gpt.nemo \
"model.data.data_prefix={train: ["data/helpsteer2-pref/train_reg.jsonl"], validation: ["data/helpsteer2-pref/val_reg.jsonl"], test: ["data/helpsteer2-pref/val_reg.jsonl"]}" \
exp_manager.explicit_log_dir=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct-then-scaled-bt \
trainer.rm.val_check_interval=10 \
exp_manager.create_wandb_logger=True \
exp_manager.wandb_logger_kwargs.project=steerlm \
exp_manager.wandb_logger_kwargs.name=rm_training \
trainer.rm.save_interval=10 \
trainer.rm.max_steps=105 \
++model.tensor_model_parallel_size=8 \
++model.pipeline_model_parallel_size=4 \
++model.activations_checkpoint_method="uniform" \
model.global_batch_size=512 \
++model.activations_checkpoint_num_layers=1 \
++model.sequence_parallel=False \
model.optim.sched.constant_steps=0 \
model.optim.sched.warmup_steps=10 \
model.reward_model_type="regression" \
trainer.rm.train_random_sampler=False \
model.regression.loss_func=scaled_bt \
model.regression.load_rm_head_weights=True \
model.optim.lr=1e-6 \
model.optim.sched.min_lr=1e-6 \
model.regression.num_attributes=9
Expand Down
124 changes: 124 additions & 0 deletions examples/nlp/data/steerlm/preprocess_helpsteer2_data.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,124 @@
# Copyright (c) 2023, NVIDIA CORPORATION. All rights reserved.
#
# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at
#
# http://www.apache.org/licenses/LICENSE-2.0
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.

"""
This script is to preprocess HelpSteer2 dataset from HuggingFace format into Attribute-conditioned SFT training format.
"""

import argparse
import json
import os

from common import ALL_STEERLM_ATTRIBUTES, SYSTEM_PROMPT
from datasets import load_dataset


def download_helpsteer2():
ds = load_dataset("nvidia/HelpSteer2")
train = ds["train"]
val = ds["validation"]
return train, val


def download_helpsteer2_preference():
ds = load_dataset("nvidia/HelpSteer2", data_dir="preference")["train"]
train = []
val = []

for dp in ds:
new_dp1 = {"prompt": dp["prompt"], "response": dp["response_1"], "quality": dp["preference_strength"]}

new_dp2 = {"prompt": dp["prompt"], "response": dp["response_2"], "quality": dp["preference_strength"]}

if dp["split"] == "train":
train.append(new_dp1)
train.append(new_dp2)
else:
val.append(new_dp1)
val.append(new_dp2)

return train, val


def format_label(dp, only_helpfulness=False):
label_list = []
for attr in ALL_STEERLM_ATTRIBUTES:
if attr in dp:
if only_helpfulness and attr != "helpfulness":
continue
label_list.append(f"{attr}:{dp[attr]}")
return ",".join(label_list)


def process_dataset(data, only_helpfulness=False):
output = []
for dp in data:
conversation_obj = {}
conversation_obj["conversations"] = [
{"value": dp["prompt"], "from": "User", "label": None},
{
"value": dp["response"],
"from": "Assistant",
"label": format_label(dp, only_helpfulness=only_helpfulness),
},
]
conversation_obj["system"] = SYSTEM_PROMPT
conversation_obj["mask"] = "User"
conversation_obj["type"] = "VALUE_TO_TEXT"
output.append(conversation_obj)
return output


def main(output_dir, preference=False, only_helpfulness=False):
if preference:
train, val = download_helpsteer2_preference()
else:
train, val = download_helpsteer2()

os.makedirs(output_dir, exist_ok=True)
processed_train = process_dataset(train, only_helpfulness=only_helpfulness)

with open(f"{output_dir}/train.jsonl", "w", encoding="utf-8") as f:
for record in processed_train:
f.write(json.dumps(record, ensure_ascii=False) + "\n")

processed_val = process_dataset(val, only_helpfulness=only_helpfulness)
with open(f"{output_dir}/val.jsonl", "w", encoding="utf-8") as f:
for record in processed_val:
f.write(json.dumps(record, ensure_ascii=False) + "\n")


if __name__ == "__main__":
parser = argparse.ArgumentParser()

parser.add_argument(
"-dir",
"--output_directory",
required=True,
help="folder to store the created train.jsonl and val.jsonl; will be created if it does not exist",
)

parser.add_argument(
"-oh", "--only_helpfulness", action="store_true", help="Use only the Helpfulness attribute",
)

parser.add_argument(
"-pref",
"--preference",
action="store_true",
help="Use HelpSteer2-preference meant for Bradley-Terry reward modelling instead of regular HelpSteer2",
)
args = parser.parse_args()

main(args.output_directory, preference=args.preference, only_helpfulness=args.only_helpfulness)
8 changes: 6 additions & 2 deletions examples/nlp/gpt/conf/training_rm.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,8 @@ trainer:
max_steps: -1
val_check_interval: 100
save_interval: 100
train_random_sampler: True # whether you want to randomly shuffle train set
val_random_sampler: False # whether you want to randomly shuffle val set

# how many GBS we loop over
# set to float for a percentage
Expand Down Expand Up @@ -63,10 +65,12 @@ model:
merge_attributes: False # whether to merge multiple attributes into a scalar
attribute_weights: null # apply these weights to each attributes when merging them into a scalar
loss_mask_val: -100 # mask dimensions with this value when calculating MSE loss
loss_func: regression # ["regression", "regular_bt", "margin_bt", "scaled_bt"]
load_rm_head_weights: False # [False, True] False only loads base model while True loads rm_head weights as well (useful for intializing rm_head with model containing existing rm_head)
output_sequence: False # Whether to output a single scalar or a sequence of scalars.
use_avg_pool: False # Whether to use avg pool to sum across the sequence dim in reward model
force_head_dtype: float32 # enforce specific dtype for the final projection in the model head
micro_batch_size: 1
force_head_dtype: bfloat16 #float32 # enforce specific dtype for the final projection in the model head
micro_batch_size: 2 # please do not adjust MBS to other values for xxx_bt implementations
global_batch_size: 64
megatron_amp_O2: True

Expand Down
Loading

0 comments on commit d3493c7

Please sign in to comment.