feat: update reward model to support scaled and margin BT (#361)

Signed-off-by: Zhilin Wang <[email protected]> Signed-off-by: NeMo-Aligner CI <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Terry Kong <[email protected]>
NVIDIA · Nov 1, 2024 · d3493c7 · d3493c7
1 parent b8dde4c
commit d3493c7
Show file tree

Hide file tree

Showing 10 changed files with 445 additions and 14 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -84,7 +84,7 @@ jobs:
         test_case:
           - ppo-llama3-pp2-reshard
           - dpo-llama3
-
+          - rm-llama3
     with:
       RUNNER: self-hosted-azure
       # Fairly aggresive timeout that all functional tests should try to adhere to

diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -34,6 +34,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
     # Consume all durations and reset internal store
     durations = timer.consume_durations()
     ```
+- Add code and instructions for replicating Reward Modeling training in HelpSteer2 and HelpSteer2-Preference
 
 ### Breaking Changes
 - Upgrade TRTLLM dependency from v0.10.0 to v0.12.0 and migrate from `GPTSession` cpp runtime to `ModelRunner` python runtime. Please use the latest Dockerfile.

diff --git a/docs/user-guide/steerlm.rst b/docs/user-guide/steerlm.rst
@@ -40,16 +40,16 @@ The two methods approach model alignment from different angles: RLHF reinforces
    For details on SteerLM, please refer to our paper `SteerLM: Attribute Conditioned SFT as an (User-Steerable) Alternative to RLHF <https://arxiv.org/abs/2310.05344>`_.
    For details about the HelpSteer dataset, please refer to our paper `HelpSteer: Multi-attribute Helpfulness Dataset for SteerLM <https://arxiv.org/abs/2311.09528>`_.
 
-Train a SteerLM model 
+Train a SteerLM Model 
 #####################
 
 This section is a step-by-step tutorial that walks you through how to run a full SteerLM pipeline with a Llama2 70B LLM model.
 
 .. note::
    Before starting this tutorial, be sure to review the :ref:`introduction <model-aligner-intro>` for tips on setting up your NeMo-Aligner environment.
 
-Download the Llama 2 LLM model 
-^^^^^^^^^^^^^^^^^^^^^^^^^^
+Download the Llama 2 LLM Model
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 #. Download the Llama 2 70B LLM model from HF <https://huggingface.co/meta-llama/Llama-2-70b-hf> into the models folder.
 
@@ -74,8 +74,13 @@ Download the Llama 2 LLM model
 
 The prefix for the tokenizer would be different when extracted. Ensure that the correct tokenizer file is used when running the preceding command.
 
-Download and Preprocess Data for Attribute Prediction Modeling
-^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
+
+To follow the HelpSteer2 and HelpSteer2-Preference line of works, you need to use the LLama 3 70B and LLama 3.1 70B Instruct models, respectively.
+
+You need to obtain access to them, download them, and then convert them in a similar manner.
+
+Download and Preprocess Data for SteerLM Regression Reward Modeling
+^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
 #. Download and convert both datasets into a common format:
 
@@ -85,7 +90,7 @@ Download and Preprocess Data for Attribute Prediction Modeling
    
       python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer_data.py --output_directory=data/helpsteer
 
-#. Merge the two datasets for the train and val subset respectively:
+#. Merge the two datasets for the train and val subset, respectively:
 
    .. code-block:: bash
 
@@ -106,10 +111,58 @@ Download and Preprocess Data for Attribute Prediction Modeling
          --output-file=data/merge_val_reg.jsonl
 
 
+If you are interested in replicating Reward Modeling training in HelpSteer2, please follow the steps below instead.
+
+
+.. code-block:: bash
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2
+      
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2/train.jsonl \
+      --output-file=data/helpsteer2/train_reg.jsonl
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2/val.jsonl \
+      --output-file=data/helpsteer2/val_reg.jsonl
+
+   cat data/helpsteer2/train_reg.jsonl data/helpsteer2/train_reg.jsonl > data/helpsteer2/train_reg_2_epoch.jsonl
+
+If you're interested in replicating Reward Modeling training in HelpSteer2-Preference, please follow the steps below instead.
+
+.. code-block:: bash
+
+   # for first stage of Reward Model training (i.e. SteerLM Regression)
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2-only_helpfulness --only_helpfulness
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2-only_helpfulness/train.jsonl \
+      --output-file=data/helpsteer2-only_helpfulness/train_reg.jsonl
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2-only_helpfulness/val.jsonl \
+      --output-file=data/helpsteer2-only_helpfulness/val_reg.jsonl
+
+   cat data/helpsteer2-only_helpfulness/train_reg.jsonl data/helpsteer2-only_helpfulness/train_reg.jsonl > data/helpsteer2-only_helpfulness/train_reg_2_epoch.jsonl
+   
+
+   # for second stage of Reward Model training (i.e. Scaled Bradley Terry)
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py --output_directory=data/helpsteer2-pref -pref
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2-pref/train.jsonl \
+      --output-file=data/helpsteer2-pref/train_reg.jsonl
+
+   python /opt/NeMo-Aligner/examples/nlp/data/steerlm/process_to_regression_format.py \
+      --input-file=data/helpsteer2-pref/val.jsonl \
+      --output-file=data/helpsteer2-pref/val_reg.jsonl
+
+
 Train the Regression Reward Model on OASST+HelpSteer Data
 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
 
-For this tutorial, train the regression reward model for 800 steps. 
+
+In this tutorial, you train the regression reward model for 800 steps. 
 
 .. note::
    Depending on the type of cluster you use, you may need to set up multi-node training in your cluster env. For details, please refer to https://lightning.ai/docs/pytorch/stable/clouds/cluster.html.
@@ -125,7 +178,6 @@ For this tutorial, train the regression reward model for 800 steps.
          pretrained_checkpoint.restore_from_path=/models/llama13b/llama13b.nemo \
          "model.data.data_prefix={train: ["data/merge_train_reg.jsonl"], validation: ["data/merge_val_reg.jsonl"], test: ["data/merge_val_reg.jsonl"]}" \
          exp_manager.explicit_log_dir=/results/reward_model_13b \
-         trainer.rm.save_interval=100 \
          trainer.rm.val_check_interval=10 \
          exp_manager.create_wandb_logger=True \
          exp_manager.wandb_logger_kwargs.project=steerlm \
@@ -135,10 +187,101 @@ For this tutorial, train the regression reward model for 800 steps.
          ++model.tensor_model_parallel_size=4 \
          ++model.pipeline_model_parallel_size=1 \
          ++model.activations_checkpoint_granularity="selective" \
+         model.optim.sched.constant_steps=0 \
+         model.reward_model_type="regression" \
+         model.regression.num_attributes=9
+
+If you're interested in replicating Reward Modeling training in HelpSteer2, please follow the steps below instead.
+
+.. code-block:: bash
+   
+   python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
+         trainer.num_nodes=8 \
+         trainer.devices=8 \
+         ++model.micro_batch_size=2 \
+         ++model.global_batch_size=128 \
+         ++model.data.data_impl=jsonl \
+         pretrained_checkpoint.restore_from_path=/models/llama-3-70b.nemo \
+         "model.data.data_prefix={train: ["data/helpsteer2/train_reg_2_epoch.jsonl"], validation: ["data/helpsteer2/val_reg.jsonl"], test: ["data/helpsteer2/val_reg.jsonl"]}" \
+         exp_manager.explicit_log_dir=/results/reward_model_13b \
+         trainer.rm.val_check_interval=10 \
+         exp_manager.create_wandb_logger=True \
+         exp_manager.wandb_logger_kwargs.project=steerlm \
+         exp_manager.wandb_logger_kwargs.name=rm_training \
+         trainer.rm.save_interval=10 \
+         trainer.rm.max_steps=317 \
+         ++model.tensor_model_parallel_size=8 \
+         ++model.pipeline_model_parallel_size=2 \
+         ++model.activations_checkpoint_method="uniform" \
+         ++model.activations_checkpoint_num_layers=1 \
+         ++model.sequence_parallel=False \
+         model.optim.sched.constant_steps=0 \
+         model.optim.sched.warmup_steps=10 \
+         model.reward_model_type="regression" \
+         model.optim.lr=2e-6 \
+         model.optim.sched.min_lr=2e-6 \
+         model.regression.num_attributes=9
+         
+
+If you're interested in replicating Reward Modeling training in HelpSteer2-Preference, please follow the steps below instead.
+
+.. code-block:: bash
+   
+   python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
+         trainer.num_nodes=8 \
+         trainer.devices=8 \
+         ++model.micro_batch_size=2 \
+         ++model.global_batch_size=128 \
+         ++model.data.data_impl=jsonl \
+         pretrained_checkpoint.restore_from_path=/models/llama-3.1-70b-instruct.nemo \
+         "model.data.data_prefix={train: ["data/helpsteer2-only_helpfulness/train_reg_2_epoch.jsonl"], validation: ["data/helpsteer2-only_helpfulness/val_reg.jsonl"], test: ["data/helpsteer2-only_helpfulness/val_reg.jsonl"]}" \
+         exp_manager.explicit_log_dir=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct \
+         trainer.rm.val_check_interval=10 \
+         exp_manager.create_wandb_logger=True \
+         exp_manager.wandb_logger_kwargs.project=steerlm \
+         exp_manager.wandb_logger_kwargs.name=rm_training \
+         trainer.rm.save_interval=10 \
+         trainer.rm.max_steps=317 \
+         ++model.tensor_model_parallel_size=8 \
+         ++model.pipeline_model_parallel_size=2 \
+         ++model.activations_checkpoint_method="uniform" \
+         ++model.activations_checkpoint_num_layers=1 \
+         ++model.sequence_parallel=False \
+         model.optim.sched.constant_steps=0 \
+         model.optim.sched.warmup_steps=10 \
+         model.reward_model_type="regression" \
+         model.optim.lr=2e-6 \
+         model.optim.sched.min_lr=2e-6 \
+         model.regression.num_attributes=9
+   
+   python /opt/NeMo-Aligner/examples/nlp/gpt/train_reward_model.py \
+         trainer.num_nodes=4 \
+         trainer.devices=8 \
+         ++model.micro_batch_size=2 \
+         ++model.global_batch_size=128 \
+         ++model.data.data_impl=jsonl \
+         pretrained_checkpoint.restore_from_path=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct/checkpoints/megatron_gpt.nemo \
+         "model.data.data_prefix={train: ["data/helpsteer2-pref/train_reg.jsonl"], validation: ["data/helpsteer2-pref/val_reg.jsonl"], test: ["data/helpsteer2-pref/val_reg.jsonl"]}" \
+         exp_manager.explicit_log_dir=/results/helpsteer2-only_helpfulness-llama-3.1-70b-instruct-then-scaled-bt \
+         trainer.rm.val_check_interval=10 \
+         exp_manager.create_wandb_logger=True \
+         exp_manager.wandb_logger_kwargs.project=steerlm \
+         exp_manager.wandb_logger_kwargs.name=rm_training \
+         trainer.rm.save_interval=10 \
+         trainer.rm.max_steps=105 \
+         ++model.tensor_model_parallel_size=8 \
+         ++model.pipeline_model_parallel_size=4 \
          ++model.activations_checkpoint_method="uniform" \
-         model.global_batch_size=512 \
+         ++model.activations_checkpoint_num_layers=1 \
+         ++model.sequence_parallel=False \
          model.optim.sched.constant_steps=0 \
+         model.optim.sched.warmup_steps=10 \
          model.reward_model_type="regression" \
+         trainer.rm.train_random_sampler=False \
+         model.regression.loss_func=scaled_bt \
+         model.regression.load_rm_head_weights=True \
+         model.optim.lr=1e-6 \
+         model.optim.sched.min_lr=1e-6 \
          model.regression.num_attributes=9
 
 

diff --git a/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py b/examples/nlp/data/steerlm/preprocess_helpsteer2_data.py
@@ -0,0 +1,124 @@
+# Copyright (c) 2023, NVIDIA CORPORATION.  All rights reserved.
+#
+# Licensed under the Apache License, Version 2.0 (the "License");
+# you may not use this file except in compliance with the License.
+# You may obtain a copy of the License at
+#
+#     http://www.apache.org/licenses/LICENSE-2.0
+#
+# Unless required by applicable law or agreed to in writing, software
+# distributed under the License is distributed on an "AS IS" BASIS,
+# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+# See the License for the specific language governing permissions and
+# limitations under the License.
+
+"""
+This script is to preprocess HelpSteer2 dataset from HuggingFace format into Attribute-conditioned SFT training format.
+"""
+
+import argparse
+import json
+import os
+
+from common import ALL_STEERLM_ATTRIBUTES, SYSTEM_PROMPT
+from datasets import load_dataset
+
+
+def download_helpsteer2():
+    ds = load_dataset("nvidia/HelpSteer2")
+    train = ds["train"]
+    val = ds["validation"]
+    return train, val
+
+
+def download_helpsteer2_preference():
+    ds = load_dataset("nvidia/HelpSteer2", data_dir="preference")["train"]
+    train = []
+    val = []
+
+    for dp in ds:
+        new_dp1 = {"prompt": dp["prompt"], "response": dp["response_1"], "quality": dp["preference_strength"]}
+
+        new_dp2 = {"prompt": dp["prompt"], "response": dp["response_2"], "quality": dp["preference_strength"]}
+
+        if dp["split"] == "train":
+            train.append(new_dp1)
+            train.append(new_dp2)
+        else:
+            val.append(new_dp1)
+            val.append(new_dp2)
+
+    return train, val
+
+
+def format_label(dp, only_helpfulness=False):
+    label_list = []
+    for attr in ALL_STEERLM_ATTRIBUTES:
+        if attr in dp:
+            if only_helpfulness and attr != "helpfulness":
+                continue
+            label_list.append(f"{attr}:{dp[attr]}")
+    return ",".join(label_list)
+
+
+def process_dataset(data, only_helpfulness=False):
+    output = []
+    for dp in data:
+        conversation_obj = {}
+        conversation_obj["conversations"] = [
+            {"value": dp["prompt"], "from": "User", "label": None},
+            {
+                "value": dp["response"],
+                "from": "Assistant",
+                "label": format_label(dp, only_helpfulness=only_helpfulness),
+            },
+        ]
+        conversation_obj["system"] = SYSTEM_PROMPT
+        conversation_obj["mask"] = "User"
+        conversation_obj["type"] = "VALUE_TO_TEXT"
+        output.append(conversation_obj)
+    return output
+
+
+def main(output_dir, preference=False, only_helpfulness=False):
+    if preference:
+        train, val = download_helpsteer2_preference()
+    else:
+        train, val = download_helpsteer2()
+
+    os.makedirs(output_dir, exist_ok=True)
+    processed_train = process_dataset(train, only_helpfulness=only_helpfulness)
+
+    with open(f"{output_dir}/train.jsonl", "w", encoding="utf-8") as f:
+        for record in processed_train:
+            f.write(json.dumps(record, ensure_ascii=False) + "\n")
+
+    processed_val = process_dataset(val, only_helpfulness=only_helpfulness)
+    with open(f"{output_dir}/val.jsonl", "w", encoding="utf-8") as f:
+        for record in processed_val:
+            f.write(json.dumps(record, ensure_ascii=False) + "\n")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+
+    parser.add_argument(
+        "-dir",
+        "--output_directory",
+        required=True,
+        help="folder to store the created train.jsonl and val.jsonl; will be created if it does not exist",
+    )
+
+    parser.add_argument(
+        "-oh", "--only_helpfulness", action="store_true", help="Use only the Helpfulness attribute",
+    )
+
+    parser.add_argument(
+        "-pref",
+        "--preference",
+        action="store_true",
+        help="Use HelpSteer2-preference meant for Bradley-Terry reward modelling instead of regular HelpSteer2",
+    )
+    args = parser.parse_args()
+
+    main(args.output_directory, preference=args.preference, only_helpfulness=args.only_helpfulness)
diff --git a/examples/nlp/gpt/conf/training_rm.yaml b/examples/nlp/gpt/conf/training_rm.yaml
@@ -13,6 +13,8 @@ trainer:
     max_steps: -1
     val_check_interval: 100
     save_interval: 100
+    train_random_sampler: True # whether you want to randomly shuffle train set
+    val_random_sampler: False # whether you want to randomly shuffle val set
 
     # how many GBS we loop over
     # set to float for a percentage
@@ -63,10 +65,12 @@ model:
     merge_attributes: False # whether to merge multiple attributes into a scalar
     attribute_weights: null # apply these weights to each attributes when merging them into a scalar
     loss_mask_val: -100 #  mask dimensions with this value when calculating MSE loss
+    loss_func: regression # ["regression", "regular_bt", "margin_bt", "scaled_bt"]
+    load_rm_head_weights: False # [False, True] False only loads base model while True loads rm_head weights as well (useful for intializing rm_head with model containing existing rm_head)
   output_sequence: False  # Whether to output a single scalar or a sequence of scalars.
   use_avg_pool: False  # Whether to use avg pool to sum across the sequence dim in reward model
-  force_head_dtype: float32  # enforce specific dtype for the final projection in the model head
-  micro_batch_size: 1
+  force_head_dtype: bfloat16 #float32  # enforce specific dtype for the final projection in the model head
+  micro_batch_size: 2 # please do not adjust MBS to other values for xxx_bt implementations
   global_batch_size: 64
   megatron_amp_O2: True