Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Multithread packing #7545

Open
wants to merge 7 commits into
base: master
Choose a base branch
from
Open

Conversation

mcr229
Copy link
Contributor

@mcr229 mcr229 commented Dec 2, 2024

Enable Multithreaded packing routines in xnn_create. This allows us to multi thread our packing routines at initialization, which can help with more performant first time model loads.

@mcr229
Copy link
Contributor Author

mcr229 commented Dec 2, 2024

@alankelly @dsharlet @fbarchard Putting up an RFC for multi threading packing at subgraph create. Please take a look at the commits labeld [MultiThreaded Packing][RFC].

@mcr229 mcr229 changed the title Multithread packing [RFC] Multithread packing Dec 2, 2024
@mcr229
Copy link
Contributor Author

mcr229 commented Dec 2, 2024

see around ~4.5x perf wins on first time model load.

@@ -0,0 +1,9 @@

// Copyright 2024 Google LLC
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

😅

Comment on lines +27 to +28
size_t extra_bytes_bl,
size_t extra_bytes_n,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do we need these from the API point of view?

src/packw.c Outdated
/*extra_bytes=*/context->extra_bytes_bl,
/*extra_bytes_n=*/context->extra_bytes_n,
/*params=*/context->params);

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rm empty line

const uint8_t* w0 = (const uint8_t*) weights;
const uint16_t* s0 = (const uint16_t*) scale;
size_t n = nc;
for (;n >= ${NR}; n -= ${NR}) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

indent 4


// KC/2 bytes is KC Nibbles
$for N in range(1, NR):
const uint8_t* w${N} = w${N-1} + (kc >> 1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this kernel does not support odd KC?
how do we ensure convolutions dont call this when kc is odd?

for x4-pack scalar kernel it still supports any KC, inefficiently
then from avxvnni I check if KC is odd and call the scalar kernel to handle it

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This kernel isn't used for convolutions, and only for fully connected layers.

A constraint applied on blockwise quantization is that kc must be divisible by blocksize, and block size is a multiple 32.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code pointer? Let's make sure they are not just asserts and we fail with proper error message if not doing that already.

@mcr229 mcr229 force-pushed the multithread_packing branch from bbb8982 to eb8384b Compare December 9, 2024 17:44
@mcr229 mcr229 force-pushed the multithread_packing branch from eb8384b to 30bd2ad Compare December 9, 2024 17:51
Copy link
Contributor

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prefer the scalar kernel be in its own PR. But the kernel part of this looks ok

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants