Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
I've made a partial fix for device mapping issues on Phi3. Previously, device mapping didn't work across various models, including Phi2, Phi3, Mistral, and Llama.
The fix involves moving tensors needed to operate together to the same device. I've chosen the device where the cache is, assuming that moving the cache might be slower. This change allows Phi3 to be loaded across devices, and I've tested it with 2 GPUs and 1 GPU + 1 CPU.
The fix resolves the issue partially for Phi3, but other models still encounter a CUDA_ERROR_ILLEGAL_ADDRESS error that prevents them from loading successfully. In contrast, Phi3 can now be loaded without issues.
The CUDA_ERROR_ILLEGAL_ADDRESS error occurs in different scenarios for each model. For example, in the Mistral model, calling contiguous() on a tensor causes this error, and moving a tensor across devices also triggers it. I found it unusual that Phi3 is the only model that works with this fix, and certain operations like contiguous() work fine on Phi3 but not on other models.
However, there's still a broken aspect: sending a second request with the same prompt results in gibberish output. Notably, this behavior is currently equivalent to running with
--no-paged-attn
(using only 1 device), so the issue is not introduced by this fix. I suspect it's a bug in the cache manager. PA, on the other hand, does not have this issue.I appreciate any feedback you can provide. I look forward to your review!