Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shortfin generation with full batches finish 10x faster #656

Open
renxida opened this issue Dec 6, 2024 · 1 comment
Open

Shortfin generation with full batches finish 10x faster #656

renxida opened this issue Dec 6, 2024 · 1 comment

Comments

@renxida
Copy link
Contributor

renxida commented Dec 6, 2024

Very strange behavior where if a batch is partially filled, it generates 10x slower than if it were completely filled.

I suspect it's because all the filler batches are using page 0 in the cache, causing contention / clobbering.

To confirm this, test by allocating unique pages for each filler page.

To fix this, either allocate pages for fillers in the batch, or treat page_index==0 as a special case in the prefill / decode functions.

See logs below:

xidaren2@sharkmi300x-3:~/shark-ai$ cat concurr_test.py
import concurrent.futures
import requests
import sys

url = "http://localhost:8003/generate"
payload = {
    "text": "1 2 3 4 5 6 7 ",
    "sampling_params": {"max_completion_tokens": 50},
}

def fetch(url, payload):
    return requests.post(url, json=payload)

if __name__ == "__main__":
    # Get number of workers from command line, default to 2 if not provided
    num_workers = int(sys.argv[1]) if len(sys.argv) > 1 else 2

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(fetch, url, payload) for _ in range(num_workers)]
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            print(result.status_code, result.text)
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 4
200 data: 8 9 10 11 12 13 14 15 16! 17 18 19 20 21 22 23 ! 24 25 26 27 28 29 30 !


200 data: 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34


200 data: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34


200 data: 8 9 10 11 12 13 14 15 17 18 19 20 21 22 24 25 24 25 26 27 28 29 32 33 34 35



real    0m2.040s
user    0m0.055s
sys     0m0.031s
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 3
200 data: 8 10 11 12 13 14 15 16 18! 18 19 20 21 22 23 24 ! 25 26 27 28 29 30 31 !


200 data: 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35


200 data: 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35



real    0m25.215s
user    0m0.076s
sys     0m0.012s
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 2
200 data: 8 9 10 11 12 13 14 15 16!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


200 data: 8 9 10 11 12 13 14 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35



real    0m25.124s
user    0m0.073s
sys     0m0.013s
xidaren2@sharkmi300x-3:~/shark-ai$
@stellaraccident
Copy link
Contributor

A thing to try: assign a unique page to each slot, even if not used. What we really want is a masked scatter but padding to not alias the same empty page is much easier on the memory system

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants