In this section, we will explore how you can download a Large Language Model (LLM) from Huggingface and use that model to do inference on different hardware platforms.
Model Used: mistralai/Mistral-7B-v0.1
We will do our setup and testing in AWS EC2 instances. Here are the scenarios we will test. Please click the link for each scenario to nagivate to the details of setup, code and running the test.
Here are the differenct scenarios
In this tutorial, we will:
-
Create a GPU instance in AWS – p3.8xl instance that contains four NVIDIA Tesla V100 GPUs
-
Install compatible version of CUDA toolkit and ensure that the toolkit is installed and visible and ready to go
-
Install compatible version of Pytorch that is compatible with the CUDA toolkit version
-
Run inference on the GPU instance using the Mistral 7B model and the provided Jupyter notebook
In the previous tutorial, we ran the inference on P3.8xlarge GPU based EC2 instance. This instance is $12+ per hour based on on-demand pricing. This machine has four NVIDIA Tesla V100 GPUs each with 16 GB of GPU memory. So, in total this machine had 64 GB of GPU memory.
The next smaller EC2 instance config in P3 family is the P3.2xlarge machine. This machine has one V100 GPU with 16 GB of memory. We will run the inference in this P3.2xlarge EC2
instance.
The P3.2xlarge machine is $3+ per hour based on on-demand pricing. This is almost one-fourth the price for the P3.8xlarge instance
We will see how the LLM inferencing performance is for the Mistral 7B model, on this smaller GPU machine, not as expensive as the P3.8xlarge machine.
In this tutorial, we will:
-
Create a GPU instance in AWS – p3.2xl instance that contains one NVIDIA Tesla V100 GPU
-
Install compatible version of CUDA toolkit and ensure that the toolkit is installed and visible and ready to go
-
Install compatible version of Pytorch that is compatible with the CUDA toolkit version
-
Using the provided Jupyter notebook, we will run inference on the GPU instance using the Mistral 7B model