Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Barrier variant + memcpy_async benchmarks + Matmul with barrier #497

Open
wants to merge 110 commits into
base: main
Choose a base branch
from

Conversation

louisfd
Copy link
Member

@louisfd louisfd commented Feb 28, 2025

Barrier variant

  • CubeCoop: the memcpy_async call is cooperative, i.e. should not depend on UNIT_POS
  • CubeManual: the memcpy_async call is per unit, can be dispatched to any unit

memcpy_async benchmarks

  • Many variants on how to dispatch memcpy_async. Can beat sync loading on double buffering setup on some architecture, but otherwise the async has a lot of overhead it seems.

Matmul with barrier

  • Refactoring in matmul for a copy mechanism, that works with pipeline as well as barriers for async loading. There's also a sync dummy one for testing correctness if you don't have cuda, but it's not well coalesced so very slow.
  • WindowCooperativeLoading: loads all the needed slices cooperatively
  • Other loadings using CubeManual are coming soon.
  • Async loading do not support check bounds anymore, because the way it was done was naive and slow. A better solution coming soon.


#[cube]
impl<MP: MatmulPrecision, SMM, LL, RL> GlobalMatmul<MP>
for SimpleBarrierDummyMatmul<MP, SMM, LL, RL>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's the same as the SimpleBarrierMatmul, I would simply make it generic on the copy mechanism instead of forking the first one.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually tried making it generic in my subsequent upcoming PR, but it turned out to be a pain. In the end I deleted the dummy one, it's not that useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants