Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Couple of FA optimizations #608

Merged
merged 11 commits into from
Jul 19, 2024
Merged

Couple of FA optimizations #608

merged 11 commits into from
Jul 19, 2024

Commits on Jun 19, 2024

  1. Add Perf Kernels

    Add Perf Kernels
    
    This is a combination of 2 commits.
    
    Add Perf Kernels
    
    Add Perf Kernels
    
    This is a combination of 6 commits.
    
    add perf-kernels
    
    fix formating issues
    
    fix unused variables and other bugs
    
    fix other issues
    
    remove scripts
    
    save
    
    check changes
    
    format
    
    save
    
    save
    
    try
    
    pre-commit check
    
    save
    micmelesse committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    3788c64 View commit details
    Browse the repository at this point in the history
  2. skip backward (#586)

    micmelesse committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    cc535d3 View commit details
    Browse the repository at this point in the history
  3. Change all block pointers to tensor pointers (#585)

    Change all block pointers to tensor pointers
    
    Block pointers are for nvidia TMAs. They are useful for regular loads as well but not well supported.
    
    Also cleaned up some code I came across along the way and updated comment at the top.
    vgokhale authored and micmelesse committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    cfb231f View commit details
    Browse the repository at this point in the history
  4. Add support for bshd layout (#587)

    Add support for layouts commonly used by users.
    
    Add option for varlen / thd layout to specify equal context lengths for all batches. Also often used by users.
    vgokhale authored and micmelesse committed Jun 19, 2024
    Configuration menu
    Copy the full SHA
    18930eb View commit details
    Browse the repository at this point in the history

Commits on Jun 27, 2024

  1. Couple of FA optimizations

    Set SM scale multiplication to a constexpr. Minor asm improvement.
    
    Changed acc scaling to adjust for softmax division to
    multiplication with reciprocal. ~10% perf improvement.
    vgokhale committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    0d1c3e1 View commit details
    Browse the repository at this point in the history
  2. Make linter happy

    vgokhale committed Jun 27, 2024
    Configuration menu
    Copy the full SHA
    db3beaf View commit details
    Browse the repository at this point in the history

Commits on Jul 8, 2024

  1. Configuration menu
    Copy the full SHA
    5e2ffc6 View commit details
    Browse the repository at this point in the history
  2. Linter

    vgokhale committed Jul 8, 2024
    Configuration menu
    Copy the full SHA
    8dd5404 View commit details
    Browse the repository at this point in the history

Commits on Jul 18, 2024

  1. Configuration menu
    Copy the full SHA
    4787d45 View commit details
    Browse the repository at this point in the history
  2. Fix bug

    vgokhale committed Jul 18, 2024
    Configuration menu
    Copy the full SHA
    bfbc3ef View commit details
    Browse the repository at this point in the history

Commits on Jul 19, 2024

  1. Configuration menu
    Copy the full SHA
    3557466 View commit details
    Browse the repository at this point in the history