feat: Upgrading TRTLLM to v13 (#320)

Signed-off-by: Terry Kong <[email protected]> Signed-off-by: NeMo-Aligner CI <[email protected]> Co-authored-by: oliver könig <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Signed-off-by: Terry Kong <[email protected]>
NVIDIA · Nov 1, 2024 · b8dde4c · b8dde4c
1 parent 29faee3
commit b8dde4c
Show file tree

Hide file tree

Showing 39 changed files with 1,337 additions and 346 deletions.
diff --git a/.github/workflows/cicd-main.yml b/.github/workflows/cicd-main.yml
@@ -57,16 +57,22 @@ jobs:
     uses: ./.github/workflows/_build_container.yml
 
   Unit_Tests:
+    name: ${{ matrix.test_case }}
     needs: [build-container, pre-flight]
     uses: ./.github/workflows/_run_test.yml
     if: contains(fromJSON(needs.pre-flight.outputs.test_to_run), 'unit') || needs.pre-flight.outputs.all == 'true'
+    strategy:
+      matrix:
+        test_case:
+          - run_unit.sh
+          - run_mpi_unit.sh
     with:
       RUNNER: self-hosted-azure
       TIMEOUT: 10
       SCRIPT: |
         nvidia-smi
         cd ${ALIGNER_REPO_DIR}
-        bash tests/run_unit.sh
+        bash tests/${{ matrix.test_case }}
 
   Functional_Tests:
     name: ${{ matrix.test_case }}
@@ -76,15 +82,12 @@ jobs:
     strategy:
       matrix:
         test_case:
-          #- ppo-pp-llama3
+          - ppo-llama3-pp2-reshard
           - dpo-llama3
+
     with:
       RUNNER: self-hosted-azure
       # Fairly aggresive timeout that all functional tests should try to adhere to
-      TIMEOUT: 10
+      TIMEOUT: 8
       SCRIPT: |
-        export PYTHONPATH=${ALIGNER_REPO_DIR}:${PYTHONPATH:-}
-        nvidia-smi
-        git config --global --add safe.directory ${ALIGNER_REPO_DIR}
-        cd ${ALIGNER_REPO_DIR}
-        bash tests/functional/test_cases/${{ matrix.test_case }}.sh
+        bash /opt/NeMo-Aligner/tests/functional/test_cases/${{ matrix.test_case }}
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -11,16 +11,42 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 ### Breaking Changes
 
 ### Bug Fixes
+
+### Deprecation Notices
 -->
 
 ## [Next Version]
 
 ### New Features and Optimizations
 - Added support for Megatron’s distributed optimizer, which can be configured using `++model.optim.name=mcore_distributed_optim`.
+- Introduced `ScopedTimer` as a successor to `SyncedTimer`. `SyncedTimer` is marked for deprecation and will be removed in the next version.
+    ```python
+    from nemo_aligner.utils.distributed import ScopedTimer
+    timer = ScopedTimer()
+
+    # All durations are logged in the timer
+    with timer("step_time"):
+        with timer("fwd"):
+            model.fwd()
+        with timer("bwd"):
+            model.bwd()
+
+    # Consume all durations and reset internal store
+    durations = timer.consume_durations()
+    ```
 
 ### Breaking Changes
+- Upgrade TRTLLM dependency from v0.10.0 to v0.12.0 and migrate from `GPTSession` cpp runtime to `ModelRunner` python runtime. Please use the latest Dockerfile.
+- Using latest TransformerEngine versions may require `++model.dist_ckpt_load_strictness=log_all` when loading from a older pre-existing checkpoint to not error out.
+- NeMo-Aligner now requires Megatron-LM==0.9.0 for the APIs to calculate the microbatch sizes (API introduced `megatron.core.num_microbatches_calculator.reconfigure_num_microbatch_calculator`).
+- NeMo-Aligner now requires a version of NeMo with this change to how the MoE spec is handled: https://github.com/NVIDIA/NeMo/pull/9035 .
 
 ### Bug Fixes
+- It is now required, for stability, to add `export NCCL_ALGO=...` to scripts launching PPO training loop. Please see the [RLHF docs](./docs/user-guide/rlhf.rst) for information.
+
+### Deprecation Notices
+- `SyncedTimer` is marked for deprecation and will be removed in `0.7.0`. Please switch to `ScopedTimer`
+- `broadcast_2d_tensor` and `broadcast_2d_tensor_within_pp` is marked for deprecation and will be removed in `0.7.0`. Please switch to `broadcast_tensor` and `broadcast_tensor_within_pp`. 
 
 ## NVIDIA NeMo-Aligner 0.5.0
 
@@ -32,6 +58,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 
 ### Bug Fixes
 - Change `log_prob_forward_micro_batch_size` in DPO to mean the same as the `micro_batch_size`, which is how many samples(chosen and rejected included) that we process at once.
+- PPO TensorRT-LLM acceleration now no longer errors if using a tokenizer without a `pad_id`. Examples being llama3 and llama3.1 tokenizers from huggingface.
 
 ## NVIDIA NeMo-Aligner 0.4.0
 - Implement reward-aware preference optimization.
@@ -51,7 +78,7 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 ### Breaking Changes
 - `inference.micro_batch_size` is now renamed to `inference.inference_micro_batch_size` when running reward model inference in `inference_rm.yaml`.  This is to stay consistent with the naming scheme of the PPO critic.
 - It is no longer possible to specify `add_EOS` when running reward model or critic inference.
-- NeMo-Aligner now requires Megatron-LM>=0.8.0 for the APIs to calculate the microbatch sizes.
+- NeMo-Aligner now requires Megatron-LM==0.8.0 for the APIs to calculate the microbatch sizes (API introduced `megatron.core.num_microbatches_calculator.reconfigure_microbatch_calculator`).
 
 ### Bug Fixes
 - Make `num_workers` for dataloaders 0 by default. This prevents issues when using MPI (with TRT-LLM) or more sophisticated launchers.

diff --git a/Dockerfile b/Dockerfile
@@ -4,7 +4,7 @@
 #
 # To update NeMo-Aligner from a pre-built NeMo-Framework container:
 #
-#   docker buildx build --target=aligner-bump --build-arg=BASE_IMAGE=nvcr.io/nvidia/nemo:24.07 -t aligner:latest .
+#   docker buildx build --target=aligner-bump -t aligner:latest .
 #
 
 # Number of parallel threads for compute heavy build jobs
@@ -13,13 +13,12 @@ ARG MAX_JOBS=8
 # Git refs for dependencies
 ARG TE_TAG=7d576ed25266a17a7b651f2c12e8498f67e0baea
 ARG PYTRITON_VERSION=0.5.10
-ARG NEMO_TAG=e033481e26e6ae32764d3e2b3f16afed00dc7218  # On: r2.0.0rc1
-ARG MLM_TAG=a3fe0c75df82218901fa2c3a7c9e389aa5f53182  # On: core_r0.8.0
+ARG NEMO_TAG=19668e5320a2e2af0199b6d5e0b841993be3a634  # On: main
+ARG MLM_TAG=25059d3bbf68be0751800f3644731df12a88f3f3   # On: main
 ARG ALIGNER_COMMIT=main
-ARG TRTLLM_VERSION=v0.10.0
+ARG TRTLLM_VERSION=v0.13.0
 ARG PROTOBUF_VERSION=4.24.4
-
-ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.03-py3
+ARG BASE_IMAGE=nvcr.io/nvidia/pytorch:24.07-py3
 
 FROM ${BASE_IMAGE} AS aligner-bump
 ARG ALIGNER_COMMIT
@@ -36,13 +35,40 @@ git checkout -f $ALIGNER_COMMIT
 # case 2: ALIGNER_COMMIT is a commit, so git-pull is expected to fail
 git pull --rebase || true
 
-pip install --no-deps -e .
+pip install --no-cache-dir --no-deps -e .
 EOF
 
 FROM ${BASE_IMAGE} as final
 WORKDIR /opt
 # needed in case git complains that it can't detect a valid email, this email is fake but works
 RUN git config --global user.email "[email protected]"
+# install latest apex
+ARG APEX_TAG
+RUN pip uninstall -y apex && \
+    git clone https://github.com/NVIDIA/apex && \
+    cd apex && \
+    if [ ! -z $APEX_TAG ]; then \
+        git fetch origin $APEX_TAG && \
+        git checkout FETCH_HEAD; \
+    fi && \
+    pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./
+
+# Git LFS
+RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
+    apt-get install git-lfs && \
+    git lfs install && \
+    apt-get clean
+
+# TRTLLM
+ARG TRTLLM_VERSION
+RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
+    cd TensorRT-LLM && \
+    git checkout ${TRTLLM_VERSION} && \
+    . docker/common/install_tensorrt.sh && \
+    python3 ./scripts/build_wheel.py --job_count $(nproc) --trt_root /usr/local/tensorrt  --python_bindings --benchmarks && \
+    pip install -e .
+ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/compat/lib.real/
+
 # install TransformerEngine
 ARG MAX_JOBS
 ARG TE_TAG
@@ -56,17 +82,6 @@ RUN pip uninstall -y transformer-engine && \
     git submodule init && git submodule update && \
     NVTE_FRAMEWORK=pytorch NVTE_WITH_USERBUFFERS=1 MPI_HOME=/usr/local/mpi pip install .
 
-# install latest apex
-ARG APEX_TAG
-RUN pip uninstall -y apex && \
-    git clone https://github.com/NVIDIA/apex && \
-    cd apex && \
-    if [ ! -z $APEX_TAG ]; then \
-        git fetch origin $APEX_TAG && \
-        git checkout FETCH_HEAD; \
-    fi && \
-    pip install -v --no-build-isolation --disable-pip-version-check --no-cache-dir --config-settings "--build-option=--cpp_ext --cuda_ext --fast_layer_norm --distributed_adam --deprecated_fused_adam" ./
-
 # place any util pkgs here
 ARG PYTRITON_VERSION
 RUN pip install --upgrade-strategy only-if-needed nvidia-pytriton==$PYTRITON_VERSION
@@ -99,29 +114,32 @@ RUN pip uninstall -y megatron-core && \
     fi && \
     pip install -e .
 
-# Git LFS
-RUN curl -s https://packagecloud.io/install/repositories/github/git-lfs/script.deb.sh | bash && \
-    apt-get install git-lfs && \
-    git lfs install
-
 COPY --from=aligner-bump /opt/NeMo-Aligner /opt/NeMo-Aligner
 RUN cd /opt/NeMo-Aligner && \
     pip install --no-deps -e .
 
-# TRTLLM
-ARG TRTLLM_VERSION
-RUN git clone https://github.com/NVIDIA/TensorRT-LLM.git && \
-    cd TensorRT-LLM && \
-    git checkout ${TRTLLM_VERSION} && \
-    patch -p1 < ../NeMo-Aligner/setup/trtllm.patch && \
-    . docker/common/install_tensorrt.sh && \
-    python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt 
-
-RUN cd TensorRT-LLM && \
-    pip install ./build/tensorrt_llm*.whl
-ENV LD_LIBRARY_PATH=$LD_LIBRARY_PATH:/usr/local/cuda-12/compat/lib.real/
+RUN cd TensorRT-LLM && patch -p1 < ../NeMo-Aligner/setup/trtllm.patch
 
-# WAR(0.4.0): The pin of NeMo requires a higher nvidia-modelopt version than
-#             TRT-LLM allows. This installation must follow TRT-LLM and is
-#             only necessary when NeMo 2.0.0rc1 is installed with TRT-LLM v10.
-RUN pip install --upgrade-strategy only-if-needed nvidia-modelopt==0.13.0
+# TODO(terryk): This layer should be deleted ASAP after NeMo is bumped to include all of these PRs
+RUN <<"EOF" bash -exu
+cd NeMo
+# Ensures we don't cherry-pick "future" origin/main commits
+git fetch -a
+# 0c92fe17df4642ffc33d5d8c0c83fda729e3910c: [fix] Ensures disabling exp_manager with exp_manager=null does not error NeMo#10651
+# 60e677423667c029dd05875da72bf0719774f844: [feat] Update get_model_parallel_src_rank to support tp-pp-dp ordering NeMo#10652
+# 0deaf6716cb4f20766c995ce25d129795f1ae200: fix[export]: update API for disabling device reassignment in TRTLLM for Aligner NeMo#10863
+# (superceded by 10863) 148543d6e9c66ff1f8562e84484448202249811d: feat: Migrate GPTSession refit path in Nemo export to ModelRunner for Aligner NeMo#10654
+for pr_and_commit in \
+  "10651 0c92fe17df4642ffc33d5d8c0c83fda729e3910c" \
+  "10652 60e677423667c029dd05875da72bf0719774f844" \
+  "10863 0deaf6716cb4f20766c995ce25d129795f1ae200" \
+; do
+  pr=$(cut -f1 -d' ' <<<"$pr_and_commit")
+  head_pr_commit=$(cut -f2 -d' ' <<<"$pr_and_commit")
+  git fetch origin $head_pr_commit:PR-${pr}
+  # cherry-picks all commits between main and the top of the PR
+  git cherry-pick --allow-empty $(git merge-base origin/main PR-${pr})..PR-${pr}
+  # Tag cherry-picks to help
+  git tag cherry-pick-PR-${pr}
+done
+EOF
diff --git a/docs/user-guide/dpo.rst b/docs/user-guide/dpo.rst
@@ -184,7 +184,7 @@ For the following parameters, the ``model.dpo.ref_policy_kl_penalty`` correspond
                ++model.dpo.ref_policy_kl_penalty=0.1
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 The default DPO training tunes all parameters. To use LoRA, we can set ``model.peft.peft_scheme=lora`` and use different parameters in ``model.peft.lora_tuning``. Please check the parameters in `the config file <https://github.com/NVIDIA/NeMo-Aligner/blob/main/examples/nlp/gpt/conf/gpt_dpo.yaml>`__.

diff --git a/docs/user-guide/draftp.rst b/docs/user-guide/draftp.rst
@@ -164,7 +164,7 @@ To start reward model training, you need checkpoints for both the `UNet <https:/
                exp_manager.wandb_logger_kwargs.project=${PROJECT}
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 

diff --git a/docs/user-guide/rlhf.rst b/docs/user-guide/rlhf.rst
@@ -139,7 +139,7 @@ To launch reward model training, you must start with a pretrained or SFT-trained
                exp_manager.wandb_logger_kwargs.project=${PROJECT}
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 
@@ -257,6 +257,11 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
    #SBATCH hetjob
    #SBATCH -N 1 --ntasks-per-node 8 -A <<ACCOUNT>> -p <<PARTITION>> --job-name <<JOBNAME>> -t 4:00:00 --exclusive
 
+   # To ensure determinism when calculating log probabilities between two forward-passes with identical weights, it is strongly
+   # recommended to set NCCL_ALGO. See https://github.com/NVIDIA/Megatron-LM/blob/b3375a0e38c10e2300ef4be031f7dcabab52b448/megatron/training/arguments.py#L593-L595
+   # for options.
+   export NCCL_ALGO=Tree
+
    NAME="2p_ppo"
 
    # PARAMETERS
@@ -305,7 +310,7 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
       pretrained_checkpoint.restore_from_path=${RM_NEMO_FILE}
    EOF
 
-   srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
+   srun --no-container-mount-home --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
 
    sleep 30
 
@@ -356,7 +361,7 @@ You can use Slurm to launch both jobs and coordinate them together in a full RLH
       remote_critic_rm.critic.port=${CRITIC_PORT}
    EOF
 
-   srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_ppo}" &
+   srun --no-container-mount-home --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_ppo}" &
 
    wait
 

diff --git a/docs/user-guide/rs.rst b/docs/user-guide/rs.rst
@@ -160,7 +160,7 @@ You can use Slurm to launch the two jobs and get them to coordinate together in
       inference.port=${CRITIC_PORT}
    EOF
 
-   srun --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
+   srun --no-container-mount-home --het-group=0 -o $CRITIC_OUTFILE -e $CRITIC_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_critic_inference}" &
 
    sleep 30
 
@@ -216,7 +216,7 @@ You can use Slurm to launch the two jobs and get them to coordinate together in
       model.rs.top_n_rollouts=1
    EOF
 
-   srun --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_rs}" &
+   srun --no-container-mount-home --het-group=1 -o $PPO_OUTFILE -e $PPO_ERRFILE --container-image=${CONTAINER} $MOUNTS bash -c "${cmd_rs}" &
 
    wait
 

diff --git a/docs/user-guide/sft.rst b/docs/user-guide/sft.rst
@@ -227,7 +227,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner.
                exp_manager.checkpoint_callback_params.monitor=val_loss
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 If using sequence packing, replace the data paths with the paths to your packed datasets. For each packed dataset, you should also set ``packed_sequence=True`` in the config:
@@ -391,7 +391,7 @@ Now, you will use the data for supervised fine-tuning with NeMo-Aligner. Compare
                exp_manager.checkpoint_callback_params.monitor=validation_loss
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 

diff --git a/docs/user-guide/spin.rst b/docs/user-guide/spin.rst
@@ -165,7 +165,7 @@ For the following parameters, the ``model.spin.ref_policy_kl_penalty`` correspon
                model.data.train_ds.max_seq_length=4096
             EOF
 
-            srun -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
+            srun --no-container-mount-home -o $OUTFILE -e $ERRFILE --container-image=$CONTAINER $MOUNTS bash -c "${cmd}"
             set +x
 
 During SPIN training, several metrics will be recorded to WandB for you to monitor, chiefly acc (representing the percentage by which the model's implicit reward for the ground truth response exceeds that of the response generated by the reference policy).

diff --git a/examples/mm/stable_diffusion/train_sd_draftp.py b/examples/mm/stable_diffusion/train_sd_draftp.py
@@ -120,7 +120,7 @@ def main(cfg) -> None:
     ptl_model.reward_model = reward_model
 
     ckpt_callback = add_custom_checkpoint_callback(trainer, ptl_model)
-    timer = Timer(cfg.exp_manager.get("max_time_per_run", "0:12:00:00"))
+    timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)
 
     draft_p_trainer = SupervisedTrainer(
         cfg=cfg.trainer.draftp_sd,

diff --git a/examples/mm/stable_diffusion/train_sdxl_draftp.py b/examples/mm/stable_diffusion/train_sdxl_draftp.py
@@ -243,7 +243,7 @@ def checkpoint_check_fn(module):
     torch.distributed.barrier()
 
     ckpt_callback = add_custom_checkpoint_callback(trainer, ptl_model)
-    timer = Timer(cfg.exp_manager.get("max_time_per_run", "0:24:00:00"))
+    timer = Timer(cfg.exp_manager.get("max_time_per_run") if cfg.exp_manager else None)
 
     draft_p_trainer = SupervisedTrainer(
         cfg=cfg.trainer.draftp_sd,