Skip to content

Commit

Permalink
Notes on vLLM memory issues
Browse files Browse the repository at this point in the history
  • Loading branch information
matthewcoole committed Sep 27, 2024
1 parent 26e1051 commit f0a16ed
Show file tree
Hide file tree
Showing 2 changed files with 13 additions and 0 deletions.
4 changes: 4 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,3 +7,7 @@ This repository is setup to work with [DVC](https://dvc.org/) backed by a [JASMI
Notes on the use of Data Version Control and Continuous Machine Learning:
- [DVC](dvc.md)
- [CML](cml.md)

## vLLM
Notes on running models with vLLM:
- [vLLM](vllm.md)
9 changes: 9 additions & 0 deletions vllm.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Out of memory issues
[vLLM](https://docs.vllm.ai/en/latest/) can download models directly from HuggingFace repositories. Unfortunately the library will attempt to pre-allocate vRAM space on the GPU for the model it is downloading, meaning if the model is too large for the vRAM that is available you will receive an out of memory error (without the model even having been downloaded).

The easiest way to avoid this issue is to download models that have been pre-quantized and will therefore more likely be small enough to fit in available vRAM. [UnslothAI](https://docs.unsloth.ai/get-started/all-our-models) has a lot of the popular models available in pre-quantized forms. These can be downloaded and used very easily, but you have to specify the quantization and load methods when doing so ([`bitsandbytes`](https://github.com/bitsandbytes-foundation/bitsandbytes)):

```python
llm = LLM(model="unsloth/Mistral-Nemo-Instruct-2407-bnb-4bit", quantization="bitsandbytes", load_format="bitsandbytes", max_model_len=4096)
```
> Note: vLLM will also ensure that enough memory is available to hold the context for queries run on the model. If the model has a very large context window this can easily create another out of memory exception. Set `max_model_len` to a reasonably small number to ensure no further memory issues.

0 comments on commit f0a16ed

Please sign in to comment.