Skip to content

Commit

Permalink
Update TP example
Browse files Browse the repository at this point in the history
  • Loading branch information
turboderp committed Aug 22, 2024
1 parent 4117daa commit 555c360
Showing 1 changed file with 14 additions and 1 deletion.
15 changes: 14 additions & 1 deletion examples/inference_tp.py
Original file line number Diff line number Diff line change
Expand Up @@ -10,9 +10,22 @@
config.arch_compat_overrides()
config.no_graphs = True
model = ExLlamaV2(config)
model.load_tp(progress = True)

# Load the model in tensor-parallel mode. With no gpu_split specified, the model will attempt to split across
# all visible devices according to the currently available VRAM on each. expect_cache_tokens is necessary for
# balancing the split, in case the GPUs are of uneven sizes, or if the number of GPUs doesn't divide the number
# of KV heads in the model
#
# The cache type for a TP model is always ExLlamaV2Cache_TP and should be allocated after the model. To use a
# quantized cache, add a `base = ExLlamaV2Cache_Q6` etc. argument to the cache constructor. It's advisable
# to also add `expect_cache_base = ExLlamaV2Cache_Q6` to load_tp() as well so the size can be correctly
# accounted for when splitting the model.

model.load_tp(progress = True, expect_cache_tokens = 16384)
cache = ExLlamaV2Cache_TP(model, max_seq_len = 16384)

# After loading the model, all other functions should work the same

print("Loading tokenizer...")
tokenizer = ExLlamaV2Tokenizer(config)

Expand Down

0 comments on commit 555c360

Please sign in to comment.