Releases: linkedin/Liger-Kernel
v0.5.2: Fix Qwen2VL mrope for transformer>=4.47
What's Changed
- Disable Qwen2 VL test for with logits conv test by @ByronHsu in #463
- Fix Qwen2VL mrope for transformers 4.47.0 by @li-plus in #464
- Revert Workaround of Disabling QWEN2_VL in Convergence Tests by @austin362667 in #466
Full Changelog: v0.5.1...v0.5.2
v0.5.1: Patch Fix Import Error
What's Changed
Full Changelog: v0.5.0...v0.5.1
v0.5.0: First open source optimized Post Training Loss, AMD CI, XPU Support
Highlights
- Post Training Loss: Introducing the first open-source optimized post-training losses in Liger Kernel with ~80% memory reduction, featuring DPO, CPO, ORPO, SimPO, JSD, and more. No more OOM nightmares for post-training ML researchers!
- AMD CI: With AMDβs generous sponsorship of MI300s, weβve integrated them into our CI. Special thanks to Embedded LLM for building the AMD CI infrastructure. #428
- XPU Support: In collaboration with Intel, we now support XPU, demonstrating comparable performance gains with other vendors. #407
What's Changed
- Adds the CPO Alignment Loss Function by @pramodith in #382
- Qwen2-VL Training Example w/ Liger by @tyler-romero in #389
- Support Qwen2-VL's multimodal RoPE implementation by @li-plus in #384
- add xpu device support for
rms_norm
by @faaany in #379 - fix qwen2 import failure in test by @ByronHsu in #394
- Add Chunked SimPO Loss by @pramodith in #386
- Add script to reproducibly run examples on Modal by @tyler-romero in #397
- add nn.module support for chunked loss function by @shivam15s in #402
- Generalize JSD to FKL/RKL by @yundai424 in #393
- Enable keyword arguments for liger functional by @hongpeng-guo in #400
- add reference model logps to chunkedloss interface and fix dpo loss fn by @shivam15s in #405
- Optimize CE Loss by casting dtype to float32 inside kernel by @pramodith in #406
- Xpu support by @mgrabban in #407
- Fix
get_batch_loss_metrics
comments by @austin362667 in #413 - Add rebuild to CI by @ByronHsu in #415
- Fix os env by @ByronHsu in #416
- Adjust QWEN2 VL Loss
rtol
by @austin362667 in #412 - [tiny] Add QwQ to readme (same arch as Qwen2) by @tyler-romero in #424
- Enhance Cross Entropy Softcap Unit Test by @austin362667 in #423
- Add ORPO Trainer + support HF metrics directly from chunked loss functions + fixes to avoid torch compile recompilations by @shivam15s in #429
- Add Build Success/Fail Badge by @hebiao064 in #431
- Switch amd-ci to use MI300X runner. by @saienduri in #428
- [CI] rename ci and add cron job for amd by @ByronHsu in #433
- [CI] shorten ci name by @ByronHsu in #434
- update ci icon on readme by @bboyleonp666 in #440
- Introduce Knowledge Distillation Base by @austin362667 in #432
- [AMD] [CI] Clean up
amd-ci
by @tjtanaa in #436 - Add xpu in env report by @abhilash1910 in #443
- Specify scheduled CI in AMD badge by @ByronHsu in #446
- improve code quality for chunk loss by @ByronHsu in #448
- Add paper link and formula for preference loss by @ByronHsu in #449
- Make kernel doc lean by @ByronHsu in #450
- Fix LigerCrossEntropyLoss Reduction Behavior for "None" Mode by @hebiao064 in #435
- add eng blog by @ByronHsu in #452
- add chunked loss to readme by @shivam15s in #453
- change chunked readme by @shivam15s in #454
- add sponsorship and collab by @ByronHsu in #457
- version bump to 0.5.0 by @shivam15s in #455
- Add HIP (ROCm) and Liger Kernel to env report by @Comet0322 in #456
New Contributors
- @li-plus made their first contribution in #384
- @faaany made their first contribution in #379
- @hongpeng-guo made their first contribution in #400
- @mgrabban made their first contribution in #407
- @hebiao064 made their first contribution in #431
- @saienduri made their first contribution in #428
- @bboyleonp666 made their first contribution in #440
- @abhilash1910 made their first contribution in #443
- @Comet0322 made their first contribution in #456
v0.4.2: Fix 'RMSNorm' object has no attribute 'in_place'
Highlights
What's Changed
- modify readmes and create license/acknowledgement docs by @shivam15s in #377
- Add Chunked ORPO Loss by @shivam15s in #362
- Refactor
LigerFusedLinearPreferenceBase
by @pramodith in #381 - Support Chunked DPO Loss Kernel by @austin362667 in #378
- Fix flce not being patched after reverting in convergence test by @Tcc0403 in #385
- Qwen2-VL Bug / Incompatibility Fixes by @tyler-romero in #388
- Fix incomplete RMSNorm patch by @Tcc0403 in #392
Full Changelog: v0.4.1...v0.4.2
v0.4.1: Gemma 2 Support, CrossEntropy Patching FIx, and GroupNorm
Highlights
-
Gemma 2 Support: The long pending gemma 2 is finally supported thanks to @Tcc0403! He has implemented the nasty softcapping in fused linear cross entropy (#320) and discovered the convergence issue which later fixed by @ByronHsu and @Tcc0403 together. (#376)
-
CrossEntropy Patching FIx: If you use monkey patch for
CrossEntropy
(Not FLCE), it is actually not patched after transformers4.46.1
. This is becauseCrossEntropy
was replaced withF.cross_entropy
in the model code. We fixed the issue in the PR (#375) -
GroupNorm Kernel: Our new contributor @pramodith implemented a GroupNorm kernel #375 with 2x Speedup.
What's Changed
- BUG: Fix bug in layer norm tests. by @pramodith in #359
- Support Z Loss in CE by @Tcc0403 in #239
- Improve compatibility to access the base models by @why-in-Shanghaitech in #340
- poke test again by @ByronHsu in #360
- Kernels for GroupNorm by @pramodith in #353
- Remove trailing newline. by @ckckjw in #364
- Fix typo in the description of FusedLinearJSD by @Tcc0403 in #366
- Updates Readme to add GroupNorm by @pramodith in #365
- Support FusedLinearCrossEntropy for Gemma2 by @Tcc0403 in #320
- Rotate modal and pypi tokens by @ByronHsu in #372
- Fix release password by @ByronHsu in #373
- Support CE after grad acc fix by @ByronHsu in #375
- Support out-of-place RMSNorm to fix gemma2 by @ByronHsu in #376
New Contributors
- @pramodith made their first contribution in #359
- @why-in-Shanghaitech made their first contribution in #340
- @ckckjw made their first contribution in #364
Full Changelog: v0.4.0...v0.4.1
v0.4.0: Full AMD support, Tech Report, Modal CI, Llama-3.2-Vision!
Highlights
-
AMD GPU: We have partnered with Embedding LLM to adjust the Triton configuration to fully support AMD! With version 0.4.0, you can run multi-GPU training with 26% higher speed and 60% lower memory usage on AMD. See the full blogpost from https://embeddedllm.com/blog/cuda-to-rocm-portability-case-study-liger-kernel. @Edenzzzz @DocShotgun @tjtanaa
-
Technical Report: We have published a technical report on arXiv (https://arxiv.org/pdf/2410.10989) with abundant details.
-
Modal CI: We have moved our entire GPU CI stack to Modal! Thanks to intelligent Docker layer caching and blazingly fast container startup time and scheduling, we have reduced the CI overhead by over 10x (from minutes to seconds).
-
LLaMA 3.2-Vision Model: We have added kernel support for the LLaMA 3.2-Vision model. You can easily use
liger_kernel.transformers.apply_liger_kernel_to_mllama
to patch the model. @tyler-romero @shivam15s -
JSD Kernel: We have added the JSD kernel for distillation, which also comes with a chunking version! @Tcc0403 @yundai424 @qingquansong
-
HuggingFace Gradient Accumulation Fixes: We have fixed the notorious HuggingFace gradient accumulation issue (huggingface/transformers#34191) by carefully adjusting the cross entropy scalar. You can now safely use v0.4.0 with the latest HuggingFace gradient accumulation fixes (transformers>=4.46.2)!
What's Changed
- Acknowledgement in NOTICE file by @momochen in #287
- Add JSD kernel by @Tcc0403 in #264
- Cancel in-progress but out-of-date GPU actions by @tyler-romero in #289
- Fix assert_verbose_allclose bugs by @Tcc0403 in #261
- fix qwen2-vl: create correct rope position_ids when position_ids is None by @Sanster in #276
- Add missing Qwen2-VL monkey patch test by @tyler-romero in #283
- FIX: tl.program_id() does indeed not have a cast method in triton2.3.1 by @wizyoung in #274
- RMSNorm aggregation by @Tcc0403 in #255
- FEAT Adding experimental feature : Triton mm int8xint2 by @MekkCyber in #195
- Add beta support for jsd by @Tcc0403 in #290
- chore: update cross_entropy.py by @eltociear in #293
- Apache and MIT license reference by @momochen in #294
- Monkeypatch for Llama 3.2-Vision by @tyler-romero in #282
- Add FusedLinearJSD by @Tcc0403 in #300
- Move
logits.float()
call by @ringohoffman in #308 - Added contributors and back to top by @barbarian360 in #304
- Add ignore_index and label to jsd and fl-jsd by @Tcc0403 in #306
- Monkey patch layer norm in mllama by @shivam15s in #302
- Introducing Liger Kernel Guru on Gurubase.io by @kursataktas in #316
- Update citation and add tech report by @ByronHsu in #317
- fix FLCE AMP issue by @yundai424 in #318
- fix fused JSD with ignore index by @yundai424 in #330
- Add missing ignore_index tests by @Tcc0403 in #310
- docs(CONTRIBUTING): fix typo by @novanish in #331
- Fix huggingface GA issue for llama by @ByronHsu in #333
- Fix incorrect training of first and last Medusa heads by @chiwanpark in #325
- Fix FusedLinearJSD precision issue when using AMP by @yundai424 in #336
- Fix llama forward patch by @hiyouga in #339
- [AMD] [ROCm] Pick
num_warps
based on platform by @tjtanaa in #326 - set up modal ci by @ByronHsu in #344
- avoid duplicate ci by @ByronHsu in #345
- Aggressively trim unit test bloat by @ByronHsu in #346
- Trim conv test by @ByronHsu in #348
- merge two tests into one by @ByronHsu in #349
- broadcast grad acc fix to all models by @ByronHsu in #354
New Contributors
- @Sanster made their first contribution in #276
- @MekkCyber made their first contribution in #195
- @ringohoffman made their first contribution in #308
- @barbarian360 made their first contribution in #304
- @kursataktas made their first contribution in #316
- @novanish made their first contribution in #331
- @hiyouga made their first contribution in #339
- @tjtanaa made their first contribution in #326
Full Changelog: v0.3.1...v0.4.0
v0.3.1: Patch Release
Summary
This patch release brings important updates and fixes to Liger-Kernel. Notable changes include:
- KLDiv calculation fix: KLDiv now functions correctly with larger vocab sizes
- SwiGLU/GeGLU casting fix: Program IDs are now cast to int64 in SwiGLU/GeGLU kernels to prevent memory errors with larger dimensions.
- AutoLigerKernelForCausalLM fix: The model now properly passes through all original keyword arguments
- Post-init model patching fix: Fix to post-init model patching to ensure HF Trainer integration works correctly
- Relaxed transformers dependency: Improve compatibility with a broader range of versions.
What's Changed
- Remove debug print statement by @EdoardoLuciani in #247
- [Easy] Cast program_id to int64 in SwiGLU/GeGLU kernels by @hansonw in #251
- Fix a comment typo in flce by @Tcc0403 in #256
- Fix AutoLigerKernelForCausalLM to pass through original kwargs by @shimizust in #263
- Update contributing guide for adding a new model by @shivam15s in #260
- chore: Add Qwen2.5 and Phi3.5 to Readme by @tyler-romero in #265
- rename cuda mode to gpu mode by @msaroufim in #267
- Fix sharing a ResBlock layer for each head in Medusa example by @chiwanpark in #269
- Fix/kldiv by @S1ro1 in #262
- Post-init model patching fix by @shimizust in #280
- Relaxed transformers dependency by @shimizust in #270
- Disable gemma2 and qwen2_vl tests by @shimizust in #288
- Release version 0.3.1 by @shimizust in #286
New Contributors
- @EdoardoLuciani made their first contribution in #247
- @msaroufim made their first contribution in #267
Full Changelog: v0.3.0...v0.3.1
v0.3.0 Release Note
Opening Thoughts
Thank you, everyone! Your overwhelming support continues to fuel our passion for innovation. With your engagement, we've pushed the boundaries further in this release!
We are hosting our 1st IRL event, 'Scaling AI Infra - GPUs, Kernels, LLMs and More'. We will discuss Liger-Kernel and invite speakers to talk about DeepSpeed, SGLang, and the TensorCore team. Please RSVP at our event page. |
---|
What's New
π Large Vision Language Model Support
Welcome Qwen-VL, our first venture into the large vision language models! This expansion allows more versatility in applying our solutions across different AI domains.
β¨ Patch Kernels on Model Instances
Enhancing flexibility, our latest API update supports model name string and instance as input, streamlining the integration with Hugging Face's SFT trainer. This enhancement ensures that you can easily patch Liger kernels into your models, whether you're starting from scratch or adapting an existing model setup.
π SWIFT Trainer Integration
We're excited to be integrated into the SWIFT Trainer Framework. This integration signifies our commitment to delivering cutting-edge tools that empower the community toward enhancing training efficiency across all supported models.
π§ New Kernels and Features
KL Divergence Kernel: Dive deeper into model behaviors with our new KL divergence kernel, perfect for those needing model distillation, alignment, and beyond.
Experimental Kernel for Embedding: Explore acceleration possibilities with our experimental kernel that optimizes embedding operations.
Extended Cross Entropy Functionality: Now we support label smoothing and sum reduction, enabling more robust training and flexible loss calculations for neural networks.
Get Involved and Stay Tuned
Join us on our journey! Connect with us on our CUDA MODE server's Discord channel, and don't forget to follow our official account on X for the latest updates: https://x.com/liger_kernel.
A Look Ahead
We're not stopping here! Looking forward, we plan to expand our support to include even more model families and to explore further optimizations and innovative features. Your feedback is invaluable, so please keep it coming as we shape the future of Liger together!
π Acknowledgments
Your contributions make a difference! Thanks to everyone who has starred, contributed, and provided feedback. Each contribution enriches our community and helps us grow stronger together.
What's Changed
- Skip Tests for GPUs Not Supporting
bf16
by @austin362667 in #159 - [Operators] LayerNorm Kernels + LigerLayerNorm by @AndreSlavescu in #169
- README: ensure modeling code is patched before model instantiation by @tmm1 in #170
- Updated wave snippet to use AutoLigerKernelForCausalLM by @shimizust in #181
- [Documentation] LayerNorm added to README by @AndreSlavescu in #180
- Remove torch compile from benchmark scripts by @shimizust in #183
- Update release guide by @yundai424 in #167
- Extract forward/backward core computation bits outside of torch autograd context for easy reuse by @qingquansong in #178
- custom Embedding kernel by @AndreSlavescu in #135
- Feat/functional api by @S1ro1 in #172
- [feat] FusedLinearCrossEntropy support for Mixtral by @ryankert01 in #136
- [Docs] Update README to include LigerEmbedding by @AndreSlavescu in #186
- compute quantiles for memory usage by @kvignesh1420 in #187
- TypoFixed repo_foward -> rope_forward by @LucioPalmucci in #191
- Switch Lightning 1 GPU example to Qwen2 0.5B instruct model with 1024 max seq length by @qingquansong in #193
- [BUILD] Add pyproject.toml by @AndreSlavescu in #150
- ci fix by @AndreSlavescu in #202
- Update the casting logic of RMSNorm by @lancerts in #201
- Update test_rms_norm.py by @lancerts in #203
- Refactored benchmark tests by @shimizust in #196
- Update layer_norm.py by @lancerts in #207
- Uplift kernel APIs to top level by @austin362667 in #210
- Feat: Kl Divergence kernel by @S1ro1 in #194
- minor refactor of rms and layernorm by @lancerts in #213
- Fix compatibility issue on triton=2.3.1 by @Tcc0403 in #219
- Elaborate ack section by @ByronHsu in #222
- Add license in ack section by @ByronHsu in #224
- Reference Unsloth in header by @momochen in #216
- Add label smoothing for cross entropy by @Tcc0403 in #198
- Added HF use-case benchmark script by @shimizust in #223
- (fix) fix pyproject.toml by @wizyoung in #218
- Update swiglu and geglu forward: zeros_like -> empty_like by @IvanYashchuk in #217
- add repr infomation for layer_norm and rms_norm by @wizyoung in #220
- (fix) fix pyproject.toml by @wizyoung in #226
- Refactor/benchmarking visualizer by @S1ro1 in #212
- Feat: add kl div to readme by @S1ro1 in #229
- Monkeypatch for Qwen2-VL by @tyler-romero in #175
- Optimize fused_linear_cross_entropy when weight does not require grads by @hansonw in #237
- SWIFT Trainer Integration by @tastelikefeet in #240
- Add label smoothing to FLCE and unit tests by @Tcc0403 in #244
- Restore monkey patched modules by @austin362667 in #232
- Support for patching post-model initialization by @shimizust in #199
- Reduction support for CrossEntropy and Division by 0 Fix by @shivam15s in #153
- Release Liger-Kernel version 0.3.0 by @qingquansong in #246
New Contributors
- @austin362667 made their first contribution in #159
- @tmm1 made their first contribution in #170
- @S1ro1 made their first contribution in #172
- @ryankert01 made their first contribution in #136
- @kvignesh1420 made their first contribution in #187
- @LucioPalmucci made their first contribution in #191
- @momochen made their first contribution in #216
- @wizyoung made their first contribution in #218
- @IvanYashchuk made their first contribution in #217
- @hansonw made their first contribution in #237
- @tastelikefeet made their first contribution in #240
Full Changelog: v0.2.1...v0.3.0
v0.2.1
Patch Release
Fix bug in Gemma patch function that FLCE and CE are both true by default ruh roh
What's Changed
- Bug fix for gemma: fused_linear_cross_entropy flag and cross_entropy flag are mutual exclusive by @JasonZhu1313 in #168
- Add gemma 7b it benchmark by @JasonZhu1313 in #166
- bump patch ver by @yundai424 in #171
Full Changelog: v0.2.0...v0.2.1
v0.2.0 Release Note
Opening Thoughts π«Ά
Thank You!
We'd love to take this chance to express our sincere gratefulness to the community! 2500+ β , 10+ new contributors, 50+ PRs, plus integration into Hugging Face π€, axolotl and LLaMA-Factory in less than one week since going open sourced is totally beyond our expectation. Being able to work together with all the cool people in the community is a bliss and we can't wait for further collaborations down the road!
Looking Ahead
We look forward to further enhancing our collaboration with the community, to work together on a lot of cool stuff -- support for more model families, squeeze out all optimization opportunities for kernels, and, why not, llama.triton? π
Get Involved and Stay Tuned
Please feel free to join our discord channel hosted in CUDA MODE server, and follow our repo's official account on X: https://x.com/liger_kernel !
Welcome Phi3 and Qwen2 π
This release ships with support for other popular models including Phi3 and Qwen2. All existing kernels in Liger repo can be leveraged to boost your training with models from these families now. Please refer to our API guide for how to use.
Even Easier API β€οΈ
Experimenting with different model families and tired of having if-else everywhere just to switch between kernel patching functions? You can now try out our new model-agnostic API to apply Liger kernels. Still a one-liner, but more elegant :) For example:
from liger_kernel.transformers import AutoLigerKernelForCausalLM
# This AutoModel wrapper class automatically monkey-patches the
# model with the optimized Liger kernels if the model is supported.
model = AutoLigerKernelForCausalLM.from_pretrained(...)
More Features
- Support optional bias term in FusedLinearCrossEntropy (#144)
- Mistral is now equipped with the humongous memory reduction from FusedLinearCrossEntropy now (#93)
- Gemma is now equipped with the humongous memory reduction from FusedLinearCrossEntropy now (#111)
Bug Fixes
- Fixed import error when using
triton>=3.0.0
on NGC containers (#79) - Fixed the missing offset in Gemma RMSNorm (#85) oops
- Added back missing dataclass entries in efficiency callback (#116)
- There was some confusion on which Gemma do we support, we now support all! (#125)
- Fallback to torch native linear + CrossEntropy when without label (#128)
- Match the exact dtype up and downcasting in Llama & Gemma for RMSNorm (#92)
- Address the bug that RoPE gets very slow when using dynamic sequence length (#149)
What's Changed
- Updated test tolerances for H100 by @shimizust in #55
- Update README.md by @lancerts in #58
- Update benchmark result of Medusa for batch size = 6 setup by @JasonZhu1313 in #59
- Add star graph by @shivam15s in #60
- Add monkey patch for Qwen2 models by @chiwanpark in #69
- Add pytest and datasets to dev dependencies by @chiwanpark in #68
- Fix typos by @pchng in #77
- Remove unused images in
examples/medusa/docs/images/
by @pchng in #78 - chore: update cross_entropy.py by @eltociear in #84
- Fix incorrect import for triton 3 by @arvindsun in #79
- update install from source guide by @yundai424 in #86
- Fix Gemma RMSNorm by @davidgonmar in #85
- Fix example bugs by @qingquansong in #88
- Make tests passing on AMD GPU with 24GB ram by @helloworld1 in #90
- modified: README.md by @leaf-soba in #91
- pytest without need to dealing with PYTHONPATH by @helloworld1 in #95
- Update test_cross_entropy.py by @lancerts in #94
- Add FusedLinerCrossEntropy support for Mistral by @Tcc0403 in #93
- Remove duplicate images by @qingquansong in #107
- Add Qwen benchmarks by @shivam15s in #108
- Fix Mixtral typo by @Tcc0403 in #109
- Explicitly add dependencies in req.txt for medusa example by @JasonZhu1313 in #110
- Add convergence tests and trainer integration test for Qwen2 by @Tcc0403 in #105
- [Bug fix] Efficiency callback missing dataclass entries by @tyler-romero in #116
- Monkeypatch for Phi3 by @tyler-romero in #76
- Add FusedLinearCrossEntropy to Gemma by @Luke-Chesley in #111
- Makefile command for env-report by @tyler-romero in #114
- [WIP] Fix confusion on Gemma by @yundai424 in #121
- [tiny] reformat code by @tyler-romero in #122
- Revert "[WIP] Fix confusion on Gemma (#121)" by @yundai424 in #123
- Fix gemma 1 and 2 support by @yundai424 in #125
- Adding AutoLigerKernelForCausalLM by @shimizust in #115
- fallback to torch native linear+CE when without label by @yundai424 in #128
- Add code to save medusa heads and model by @JasonZhu1313 in #130
- Add FusedLinerCrossEntropy support for Phi3 by @tyler-romero in #103
- Add GPU CI support by @helloworld1 in #134
- Make GPU CI optional until it is more stable by @helloworld1 in #141
- Add gemma lightning example for single L40 GPU by @qingquansong in #120
- feat: correct casts in RMSNorm to match references by @davidgonmar in #92
- Bias for fused linear cross entropy by @davidgonmar in #144
- Rerun FLCE benchmark after bias added by @ByronHsu in #148
- updated sl to be non-constexpr by @AndreSlavescu in #149
- update readme to use absolute paths by @shaoruu in #157
- fix convergence test, phi3 import and update benchmark by @yundai424 in #155
- bump lowest HF version by @yundai424 in #158
- Add missing tf_keras to req.txt by @JasonZhu1313 in #161
- Re-enable GPU CI enforce by @helloworld1 in #142
- Bump package ver by @yundai424 in #163
- Update version in setup.py to 0.2.0 by @yundai424 in #164
New Contributors
- @chiwanpark made their first contribution in #69
- @pchng made their first contribution in #77
- @eltociear made their first contribution in #84
- @arvindsun made their first contribution in #79
- @davidgonmar made their first contribution in #85
- @leaf-soba made their first contribution in #91
- @Tcc0403 made their first contribution in #93
- @tyler-romero made their first contribution in #116
- @Luke-Chesley made their first contribution in #111
- @AndreSlavescu made their first contribution in #149
- @shaoruu made their first contribution in #157
Full Changelog: v0.1.1...v0.2.0