Shortfin generation with full batches finish 10x faster #656

renxida · 2024-12-06T22:47:06Z

Very strange behavior where if a batch is partially filled, it generates 10x slower than if it were completely filled.

I suspect it's because all the filler batches are using page 0 in the cache, causing contention / clobbering.

To confirm this, test by allocating unique pages for each filler page.

To fix this, either allocate pages for fillers in the batch, or treat page_index==0 as a special case in the prefill / decode functions.

See logs below:

xidaren2@sharkmi300x-3:~/shark-ai$ cat concurr_test.py
import concurrent.futures
import requests
import sys

url = "http://localhost:8003/generate"
payload = {
    "text": "1 2 3 4 5 6 7 ",
    "sampling_params": {"max_completion_tokens": 50},
}

def fetch(url, payload):
    return requests.post(url, json=payload)

if __name__ == "__main__":
    # Get number of workers from command line, default to 2 if not provided
    num_workers = int(sys.argv[1]) if len(sys.argv) > 1 else 2

    with concurrent.futures.ThreadPoolExecutor(max_workers=num_workers) as executor:
        futures = [executor.submit(fetch, url, payload) for _ in range(num_workers)]
        for future in concurrent.futures.as_completed(futures):
            result = future.result()
            print(result.status_code, result.text)
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 4
200 data: 8 9 10 11 12 13 14 15 16! 17 18 19 20 21 22 23 ! 24 25 26 27 28 29 30 !


200 data: 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34


200 data: 8 9 10 11 12 13 14 15 16 17 18 19 20 21 23 24 25 26 27 28 29 30 31 32 33 34


200 data: 8 9 10 11 12 13 14 15 17 18 19 20 21 22 24 25 24 25 26 27 28 29 32 33 34 35



real    0m2.040s
user    0m0.055s
sys     0m0.031s
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 3
200 data: 8 10 11 12 13 14 15 16 18! 18 19 20 21 22 23 24 ! 25 26 27 28 29 30 31 !


200 data: 8 9 10 11 12 13 14 15 16 18 19 20 21 22 23 25 26 27 28 29 30 31 32 33 34 35


200 data: 8 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 28 29 30 31 32 33 34 35



real    0m25.215s
user    0m0.076s
sys     0m0.012s
xidaren2@sharkmi300x-3:~/shark-ai$ time python concurr_test.py 2
200 data: 8 9 10 11 12 13 14 15 16!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!


200 data: 8 9 10 11 12 13 14 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35



real    0m25.124s
user    0m0.073s
sys     0m0.013s
xidaren2@sharkmi300x-3:~/shark-ai$

The text was updated successfully, but these errors were encountered:

stellaraccident · 2024-12-07T00:17:25Z

A thing to try: assign a unique page to each slot, even if not used. What we really want is a masked scatter but padding to not alias the same empty page is much easier on the memory system

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shortfin generation with full batches finish 10x faster #656

Shortfin generation with full batches finish 10x faster #656

renxida commented Dec 6, 2024 •

edited

Loading

stellaraccident commented Dec 7, 2024

Shortfin generation with full batches finish 10x faster #656

Shortfin generation with full batches finish 10x faster #656

Comments

renxida commented Dec 6, 2024 • edited Loading

stellaraccident commented Dec 7, 2024

renxida commented Dec 6, 2024 •

edited

Loading