Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode (linkedi…
…n#435) ## Summary <!--- This is a required section; please describe the main purpose of this proposed code change. ---> Closes linkedin#421 This pull request addresses an issue in the `cross_entropy_forward` function where the `reduction="none"` mode did not behave as expected. Previously, the function always returned a single scalar value, even when reduction="none" was specified. This update ensures that when reduction="none" is used, the function directly outputs the unreduced loss array (loss_1d) instead of summing it. <!--- ## Details This is an optional section; is there anything specific that reviewers should be aware of? ---> ### Changes Made: - Added a condition to handle `reduction="none"`, ensuring the function outputs loss_1d directly. - Updated the computation of z_loss to respect the reduction="none" mode. - Add test for cases when `reduction="none"` ### Why we pass `gradient` to `output.backward()`? #### Background on Gradients in PyTorch - **Scalar Outputs**: When a tensor is a scalar (a single number), PyTorch can compute gradients automatically by assuming the scalar has an implicit gradient of 1.0. - **Non-Scalar Outputs**: For tensors that are not scalars, gradients must be provided explicitly because PyTorch cannot infer the shape or distribution of gradients. Without this, it raises the error: "grad can be implicitly created only for scalar outputs." #### Why reduction="none" Needs Explicit Gradients When `reduction="none"`, the loss function does not reduce the per-example loss values into a single scalar. Instead, it outputs a vector of losses, with one value per example in the batch. This means that the loss tensor has multiple values, and PyTorch cannot assume what the gradient for each of these values should be unless explicitly provided. #### The Fix By passing `gradient=torch.ones_like(loss)` to `backward()`: - **Gradient Tensor**: The `torch.ones_like(loss)` serves as the gradient tensor. It specifies that each element in the loss tensor contributes equally to the gradients during backpropagation. - **Shape Match**: The gradient tensor's shape matches the loss tensor's shape, fulfilling PyTorch's requirements for non-scalar outputs during backward(). ## Testing Done <!--- This is a required section; please describe how this change was tested. ---> make test `pytest /home/jobuser/Liger-Kernel/test/transformers/test_cross_entropy.py` shows: ``` =================================== 93 passed, 1 warning in 13.18s =================================== ``` <!-- Replace BLANK with your device type. For example, A100-80G-PCIe Complete the following tasks before sending your PR, and replace `[ ]` with `[x]` to indicate you have done them. --> - Hardware Type: NVIDIA A100-SXM4-80GB - [x] run `make test` to ensure correctness - [x] run `make checkstyle` to ensure code style - [x] run `make test-convergence` to ensure convergence
- Loading branch information