From 8c9f877171bdb10e14f10e0bd5f6aa048efe055c Mon Sep 17 00:00:00 2001 From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com> Date: Thu, 20 Jun 2024 18:43:23 +0800 Subject: [PATCH] Update part of Quickstart guide in mddocs (1/2) * Quickstart index.rst -> index.md * Update for Linux Install Quickstart * Update md docs for Windows Install QuickStart * Small fix * Add blank lines * Update mddocs for llama cpp quickstart * Update mddocs for llama3 llama-cpp and ollama quickstart * Update mddocs for ollama quickstart * Update mddocs for openwebui quickstart * Update mddocs for privateGPT quickstart * Update mddocs for vllm quickstart * Small fix * Update mddocs for text-generation-webui quickstart * Update for video links --- docs/mddocs/Quickstart/index.md | 26 ++ docs/mddocs/Quickstart/index.rst | 33 -- docs/mddocs/Quickstart/install_linux_gpu.md | 131 +++--- docs/mddocs/Quickstart/install_windows_gpu.md | 385 ++++++++---------- .../llama3_llamacpp_ollama_quickstart.md | 193 ++++----- .../mddocs/Quickstart/llama_cpp_quickstart.md | 171 ++++---- docs/mddocs/Quickstart/ollama_quickstart.md | 203 ++++----- .../open_webui_with_ollama_quickstart.md | 134 +++--- .../Quickstart/privateGPT_quickstart.md | 48 +-- docs/mddocs/Quickstart/vLLM_quickstart.md | 48 +-- docs/mddocs/Quickstart/webui_quickstart.md | 59 +-- 11 files changed, 607 insertions(+), 824 deletions(-) create mode 100644 docs/mddocs/Quickstart/index.md delete mode 100644 docs/mddocs/Quickstart/index.rst diff --git a/docs/mddocs/Quickstart/index.md b/docs/mddocs/Quickstart/index.md new file mode 100644 index 00000000000..efbaa868e31 --- /dev/null +++ b/docs/mddocs/Quickstart/index.md @@ -0,0 +1,26 @@ +# IPEX-LLM Quickstart + +> [!NOTE] +> We are adding more Quickstart guide. + +This section includes efficient guide to show you how to: + +- [`bigdl-llm` Migration Guide](./bigdl_llm_migration.md) +- [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md) +- [Install IPEX-LLM on Windows with Intel GPU](./install_windows_gpu.md) +- [Install IPEX-LLM in Docker on Windows with Intel GPU](./docker_windows_gpu.md) +- [Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL)](./docker_benchmark_quickstart.md) +- [Run Performance Benchmarking with IPEX-LLM](./benchmark_quickstart.md) +- [Run Local RAG using Langchain-Chatchat on Intel GPU](./chatchat_quickstart.md) +- [Run Text Generation WebUI on Intel GPU](./webui_quickstart.md) +- [Run Open WebUI on Intel GPU](./open_webui_with_ollama_quickstart.md) +- [Run PrivateGPT with IPEX-LLM on Intel GPU](./privateGPT_quickstart.md) +- [Run Coding Copilot (Continue) in VSCode with Intel GPU](./continue_quickstart.md) +- [Run Dify on Intel GPU](./dify_quickstart.md) +- [Run llama.cpp with IPEX-LLM on Intel GPU](./llama_cpp_quickstart.md) +- [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.md) +- [Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM](./llama3_llamacpp_ollama_quickstart.md) +- [Run IPEX-LLM Serving with FastChat](./fastchat_quickstart.md) +- [Run IPEX-LLM Serving with vLLM on Intel GPU](./vLLM_quickstart.md) +- [Finetune LLM with Axolotl on Intel GPU](./axolotl_quickstart.md) +- [Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi](./deepspeed_autotp_fastapi_quickstart.md) diff --git a/docs/mddocs/Quickstart/index.rst b/docs/mddocs/Quickstart/index.rst deleted file mode 100644 index 2e82acde52a..00000000000 --- a/docs/mddocs/Quickstart/index.rst +++ /dev/null @@ -1,33 +0,0 @@ -IPEX-LLM Quickstart -================================ - -.. note:: - - We are adding more Quickstart guide. - -This section includes efficient guide to show you how to: - - -* |bigdl_llm_migration_guide|_ -* `Install IPEX-LLM on Linux with Intel GPU <./install_linux_gpu.html>`_ -* `Install IPEX-LLM on Windows with Intel GPU <./install_windows_gpu.html>`_ -* `Install IPEX-LLM in Docker on Windows with Intel GPU <./docker_windows_gpu.html>`_ -* `Run PyTorch Inference on Intel GPU using Docker (on Linux or WSL) <./docker_benchmark_quickstart.html>`_ -* `Run Performance Benchmarking with IPEX-LLM <./benchmark_quickstart.html>`_ -* `Run Local RAG using Langchain-Chatchat on Intel GPU <./chatchat_quickstart.html>`_ -* `Run Text Generation WebUI on Intel GPU <./webui_quickstart.html>`_ -* `Run Open WebUI on Intel GPU <./open_webui_with_ollama_quickstart.html>`_ -* `Run PrivateGPT with IPEX-LLM on Intel GPU <./privateGPT_quickstart.html>`_ -* `Run Coding Copilot (Continue) in VSCode with Intel GPU <./continue_quickstart.html>`_ -* `Run Dify on Intel GPU <./dify_quickstart.html>`_ -* `Run llama.cpp with IPEX-LLM on Intel GPU <./llama_cpp_quickstart.html>`_ -* `Run Ollama with IPEX-LLM on Intel GPU <./ollama_quickstart.html>`_ -* `Run Llama 3 on Intel GPU using llama.cpp and ollama with IPEX-LLM <./llama3_llamacpp_ollama_quickstart.html>`_ -* `Run IPEX-LLM Serving with FastChat <./fastchat_quickstart.html>`_ -* `Run IPEX-LLM Serving with vLLM on Intel GPU <./vLLM_quickstart.html>`_ -* `Finetune LLM with Axolotl on Intel GPU <./axolotl_quickstart.html>`_ -* `Run IPEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi <./deepspeed_autotp_fastapi_quickstart.html>`_ - - -.. |bigdl_llm_migration_guide| replace:: ``bigdl-llm`` Migration Guide -.. _bigdl_llm_migration_guide: bigdl_llm_migration.html diff --git a/docs/mddocs/Quickstart/install_linux_gpu.md b/docs/mddocs/Quickstart/install_linux_gpu.md index 47d8f4a3eeb..d4442b0babb 100644 --- a/docs/mddocs/Quickstart/install_linux_gpu.md +++ b/docs/mddocs/Quickstart/install_linux_gpu.md @@ -2,7 +2,7 @@ This guide demonstrates how to install IPEX-LLM on Linux with Intel GPUs. It applies to Intel Data Center GPU Flex Series and Max Series, as well as Intel Arc Series GPU. -IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and supports PyTorch 2.0 and PyTorch 2.1 on Linux. This page demonstrates IPEX-LLM with PyTorch 2.1. Check the [Installation](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#linux) page for more details. +IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and supports PyTorch 2.0 and PyTorch 2.1 on Linux. This page demonstrates IPEX-LLM with PyTorch 2.1. Check the [Installation](../Overview/install_gpu.md#linux) page for more details. ## Install Prerequisites @@ -98,7 +98,7 @@ IPEX-LLM currently supports the Ubuntu 20.04 operating system and later, and sup For Intel Core™ Ultra integrated GPU, please make sure level_zero version >= 1.3.28717. The level_zero version can be checked with `sycl-ls`, and verison will be tagged behind `[ext_oneapi_level_zero:gpu]`. Here are the sample output of `sycl-ls`: -``` +```bash [opencl:acc:0] Intel(R) FPGA Emulation Platform for OpenCL(TM), Intel(R) FPGA Emulation Device OpenCL 1.2 [2023.16.12.0.12_195853.xmain-hotfix] [opencl:cpu:1] Intel(R) OpenCL, Intel(R) Core(TM) Ultra 5 125H OpenCL 3.0 (Build 0) [2023.16.12.0.12_195853.xmain-hotfix] [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) Graphics OpenCL 3.0 NEO [24.09.28717.12] @@ -118,7 +118,7 @@ sudo dpkg -i *.deb ``` ### Install oneAPI - ``` + ```bash wget -O- https://apt.repos.intel.com/intel-gpg-keys/GPG-PUB-KEY-INTEL-SW-PRODUCTS.PUB | gpg --dearmor | sudo tee /usr/share/keyrings/oneapi-archive-keyring.gpg > /dev/null echo "deb [signed-by=/usr/share/keyrings/oneapi-archive-keyring.gpg] https://apt.repos.intel.com/oneapi all main" | sudo tee /etc/apt/sources.list.d/oneAPI.list @@ -163,43 +163,38 @@ Download and install the Miniforge as follows if you don't have conda installed You can use `conda --version` to verify you conda installation. After installation, create a new python environment `llm`: -```cmd +```bash conda create -n llm python=3.11 ``` Activate the newly created environment `llm`: -```cmd +```bash conda activate llm ``` ## Install `ipex-llm` -With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. -Choose either US or CN website for `extra-index-url`: - -```eval_rst -.. tabs:: - .. tab:: US - - .. code-block:: cmd +With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. Choose either US or CN website for `extra-index-url`: - pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +- For **US**: - .. tab:: CN + ```bash + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + ``` - .. code-block:: cmd +- For **CN**: - pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ -``` + ```bash + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ + ``` -```eval_rst -.. note:: +> [!NOTE] +> If you encounter network issues while installing IPEX, refer to [this guide](../Overview/install_gpu.md#install-ipex-llm-from-wheel-1) for troubleshooting advice. - If you encounter network issues while installing IPEX, refer to `this guide `_ for troubleshooting advice. -``` ## Verify Installation -* You can verify if `ipex-llm` is successfully installed by simply importing a few classes from the library. For example, execute the following import command in the terminal: +- You can verify if `ipex-llm` is successfully installed by simply importing a few classes from the library. For example, execute the following import command in the terminal: + ```bash source /opt/intel/oneapi/setvars.sh @@ -210,61 +205,59 @@ Choose either US or CN website for `extra-index-url`: ## Runtime Configurations -To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example. +To use GPU acceleration on Linux, several environment variables are required or recommended before running a GPU example. Choose corresponding configurations based on your GPU device: -```eval_rst -.. tabs:: - .. tab:: Intel Arc™ A-Series and Intel Data Center GPU Flex +- For **Intel Arc™ A-Series and Intel Data Center GPU Flex**: - For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: + For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series, we recommend: - .. code-block:: bash - - # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. - # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. - source /opt/intel/oneapi/setvars.sh - - # Recommended Environment Variables for optimal performance - export USE_XETLA=OFF - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - export SYCL_CACHE_PERSISTENT=1 - - .. tab:: Intel Data Center GPU Max + ```bash + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh - For Intel Data Center GPU Max Series, we recommend: + # Recommended Environment Variables for optimal performance + export USE_XETLA=OFF + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + ``` - .. code-block:: bash +- For **Intel Data Center GPU Max**: - # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. - # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. - source /opt/intel/oneapi/setvars.sh + For Intel Data Center GPU Max Series, we recommend: - # Recommended Environment Variables for optimal performance - export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - export SYCL_CACHE_PERSISTENT=1 - export ENABLE_SDP_FUSION=1 + ```bash + # Configure oneAPI environment variables. Required step for APT or offline installed oneAPI. + # Skip this step for PIP-installed oneAPI since the environment has already been configured in LD_LIBRARY_PATH. + source /opt/intel/oneapi/setvars.sh - Please note that ``libtcmalloc.so`` can be installed by ``conda install -c conda-forge -y gperftools=2.10`` + # Recommended Environment Variables for optimal performance + export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so + export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 + export SYCL_CACHE_PERSISTENT=1 + export ENABLE_SDP_FUSION=1 + ``` -``` + Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10` - ```eval_rst - .. seealso:: +> [!NOTE] +> Please refer to [this guide](../Overview/install_gpu.md#runtime-configuration-1) for more details regarding runtime configuration. - Please refer to `this guide <../Overview/install_gpu.html#id5>`_ for more details regarding runtime configuration. - ``` ## A Quick Example Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface.co/microsoft/phi-1_5) model, a 1.3 billion parameter LLM for this demostration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". -* Step 1: Activate the Python environment `llm` you previously created: +- Step 1: Activate the Python environment `llm` you previously created: + ```bash conda activate llm ``` -* Step 2: Follow [Runtime Configurations Section](#runtime-configurations) above to prepare your runtime environment. -* Step 3: Create a new file named `demo.py` and insert the code snippet below. + +- Step 2: Follow [Runtime Configurations Section](#runtime-configurations) above to prepare your runtime environment. + +- Step 3: Create a new file named `demo.py` and insert the code snippet below. + ```python # Copy/Paste the contents to a new file demo.py import torch @@ -290,21 +283,23 @@ Now let's play with a real LLM. We'll be using the [phi-1.5](https://huggingface output_str = tokenizer.decode(output[0], skip_special_tokens=True) print(output_str) ``` - > Note: when running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. + + > **Note**: When running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. > This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU. -* Step 5. Run `demo.py` within the activated Python environment using the following command: +- Step 5. Run `demo.py` within the activated Python environment using the following command: + ```bash python demo.py ``` - ### Example output - - Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU: - ``` - Question:What is AI? - Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines. - ``` +### Example output + +Example output on a system equipped with an 11th Gen Intel Core i7 CPU and Iris Xe Graphics iGPU: +``` +Question:What is AI? +Answer: AI stands for Artificial Intelligence, which is the simulation of human intelligence in machines. +``` ## Tips & Troubleshooting diff --git a/docs/mddocs/Quickstart/install_windows_gpu.md b/docs/mddocs/Quickstart/install_windows_gpu.md index fe94002f7fe..feddbed2ecb 100644 --- a/docs/mddocs/Quickstart/install_windows_gpu.md +++ b/docs/mddocs/Quickstart/install_windows_gpu.md @@ -8,35 +8,16 @@ It applies to Intel Core Ultra and Core 11 - 14 gen integrated GPUs (iGPUs), as ### (Optional) Update GPU Driver -```eval_rst -.. tip:: - - It is recommended to update your GPU driver, if you have driver version lower than ``31.0.101.5122``. Refer to `here <../Overview/install_gpu.html#prerequisites>`_ for more information. -``` +> [!TIP] +> It is recommended to update your GPU driver, if you have driver version lower than `31.0.101.5122`. Refer to [here](../Overview/install_gpu.md#prerequisites) for more information. Download and install the latest GPU driver from the [official Intel download page](https://www.intel.com/content/www/us/en/download/785597/intel-arc-iris-xe-graphics-windows.html). A system reboot is necessary to apply the changes after the installation is complete. -```eval_rst -.. note:: - - The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below -``` +> [!NOTE] +> The process could take around 10 minutes. After reboot, check for the **Intel Arc Control** application to verify the driver has been installed correctly. If the installation was successful, you should see the **Arc Control** interface similar to the figure below - - - - - ### Setup Python Environment Visit [Miniforge installation page](https://conda-forge.org/download/), download the **Miniforge installer for Windows**, and follow the instructions to complete the installation. @@ -58,68 +39,55 @@ conda activate llm With the `llm` environment active, use `pip` to install `ipex-llm` for GPU. Choose either US or CN website for `extra-index-url`: -```eval_rst -.. tabs:: - .. tab:: US - - .. code-block:: cmd - - pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - - .. tab:: CN +- For **US**: - .. code-block:: cmd + ```bash + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + ``` - pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ -``` +- For **CN**: -```eval_rst -.. note:: + ```bash + pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/ + ``` - If you encounter network issues while installing IPEX, refer to `this guide `_ for troubleshooting advice. -``` +> [!NOTE] +> If you encounter network issues while installing IPEX, refer to [this guide](../Overview/install_gpu.md#install-ipex-llm-from-wheel) for troubleshooting advice. ## Verify Installation You can verify if `ipex-llm` is successfully installed following below steps. ### Step 1: Runtime Configurations -* Open the **Miniforge Prompt** and activate the Python environment `llm` you previously created: +- Open the **Miniforge Prompt** and activate the Python environment `llm` you previously created: + ```cmd conda activate llm ``` - -* Set the following environment variables according to your device: - ```eval_rst - .. tabs:: - .. tab:: Intel iGPU - - .. code-block:: cmd - - set SYCL_CACHE_PERSISTENT=1 - set BIGDL_LLM_XMX_DISABLED=1 - - .. tab:: Intel Arc™ A770 - - .. code-block:: cmd - - set SYCL_CACHE_PERSISTENT=1 - ``` - - ```eval_rst - .. seealso:: +- Set the following environment variables according to your device: - For other Intel dGPU Series, please refer to `this guide <../Overview/install_gpu.html#runtime-configuration>`_ for more details regarding runtime configuration. - ``` + - For **Intel iGPU**: + + ```cmd + set SYCL_CACHE_PERSISTENT=1 + set BIGDL_LLM_XMX_DISABLED=1 + ``` + + - For **Intel Arc™ A770**: + + ```cmd + set SYCL_CACHE_PERSISTENT=1 + ``` + +> [!TIP] +> For other Intel dGPU Series, please refer to [this guide](../Overview/install_gpu.md#runtime-configuration) for more details regarding runtime configuration. ### Step 2: Run Python Code -* Launch the Python interactive shell by typing `python` in the Miniforge Prompt window and then press Enter. +- Launch the Python interactive shell by typing `python` in the Miniforge Prompt window and then press Enter. + +- Copy following code to Miniforge Prompt **line by line** and press Enter **after copying each line**. -* Copy following code to Miniforge Prompt **line by line** and press Enter **after copying each line**. ```python import torch from ipex_llm.transformers import AutoModel,AutoModelForCausalLM @@ -127,17 +95,16 @@ You can verify if `ipex-llm` is successfully installed following below steps. tensor_2 = torch.randn(1, 1, 128, 40).to('xpu') print(torch.matmul(tensor_1, tensor_2).size()) ``` + It will output following content at the end: + ``` torch.Size([1, 1, 40, 40]) ``` - ```eval_rst - .. seealso:: + > **Tip**: If you encounter any problem, please refer to [here](../Overview/install_gpu.md#troubleshooting) for help. - If you encounter any problem, please refer to `here `_ for help. - ``` -* To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter). +- To exit the Python interactive shell, simply press Ctrl+Z then press Enter (or input `exit()` then press Enter). ## Monitor GPU Status To monitor your GPU's performance and status (e.g. memory consumption, utilization, etc.), you can use either the **Windows Task Manager (in `Performance` Tab)** (see the left side of the figure below) or the **Arc Control** application (see the right side of the figure below) @@ -148,156 +115,150 @@ To monitor your GPU's performance and status (e.g. memory consumption, utilizati Now let's play with a real LLM. We'll be using the [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model, a 1.8 billion parameter LLM for this demonstration. Follow the steps below to setup and run the model, and observe how it responds to a prompt "What is AI?". -* Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment. -* Step 2: Install additional package required for Qwen-1.8B-Chat to conduct: +- Step 1: Follow [Runtime Configurations Section](#step-1-runtime-configurations) above to prepare your runtime environment. + +- Step 2: Install additional package required for Qwen-1.8B-Chat to conduct: + ```cmd pip install tiktoken transformers_stream_generator einops ``` -* Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements. - ```eval_rst - .. tabs:: - .. tab:: Hugging Face - Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat `_ model with IPEX-LLM optimizations. - - .. code-block:: python - - # Copy/Paste the contents to a new file demo.py - import torch - from ipex_llm.transformers import AutoModelForCausalLM - from transformers import AutoTokenizer, GenerationConfig - generation_config = GenerationConfig(use_cache=True) - - print('Now start loading Tokenizer and optimizing Model...') - tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", - trust_remote_code=True) - - # Load Model using ipex-llm and load it to GPU - model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", - load_in_4bit=True, - cpu_embedding=True, - trust_remote_code=True) - model = model.to('xpu') - print('Successfully loaded Tokenizer and optimized Model!') - - # Format the prompt - question = "What is AI?" - prompt = "user: {prompt}\n\nassistant:".format(prompt=question) - - # Generate predicted tokens - with torch.inference_mode(): - input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') - - print('--------------------------------------Note-----------------------------------------') - print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') - print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') - print('| Please be patient until it finishes warm-up... |') - print('-----------------------------------------------------------------------------------') - - # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. - # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. - output = model.generate(input_ids, - do_sample=False, - max_new_tokens=32, - generation_config=generation_config) # warm-up - - print('Successfully finished warm-up, now start generation...') - - output = model.generate(input_ids, - do_sample=False, - max_new_tokens=32, - generation_config=generation_config).cpu() - output_str = tokenizer.decode(output[0], skip_special_tokens=True) - print(output_str) - - .. tab:: ModelScope - - Please first run following command in Miniforge Prompt to install ModelScope: - - .. code-block:: cmd - - pip install modelscope==1.11.0 - - Create a new file named ``demo.py`` and insert the code snippet below to run `Qwen-1.8B-Chat `_ model with IPEX-LLM optimizations. - - .. code-block:: python - - # Copy/Paste the contents to a new file demo.py - import torch - from ipex_llm.transformers import AutoModelForCausalLM - from transformers import GenerationConfig - from modelscope import AutoTokenizer - generation_config = GenerationConfig(use_cache=True) - - print('Now start loading Tokenizer and optimizing Model...') - tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", - trust_remote_code=True) - - # Load Model using ipex-llm and load it to GPU - model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", - load_in_4bit=True, - cpu_embedding=True, - trust_remote_code=True, - model_hub='modelscope') - model = model.to('xpu') - print('Successfully loaded Tokenizer and optimized Model!') - - # Format the prompt - question = "What is AI?" - prompt = "user: {prompt}\n\nassistant:".format(prompt=question) - - # Generate predicted tokens - with torch.inference_mode(): - input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') - - print('--------------------------------------Note-----------------------------------------') - print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') - print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') - print('| Please be patient until it finishes warm-up... |') - print('-----------------------------------------------------------------------------------') - - # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. - # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. - output = model.generate(input_ids, - do_sample=False, - max_new_tokens=32, - generation_config=generation_config) # warm-up - - print('Successfully finished warm-up, now start generation...') - - output = model.generate(input_ids, - do_sample=False, - max_new_tokens=32, - generation_config=generation_config).cpu() - output_str = tokenizer.decode(output[0], skip_special_tokens=True) - print(output_str) - - - .. tip:: - - Please note that the repo id on ModelScope may be different from Hugging Face for some models. - - ``` - ```eval_rst - .. note:: - - When running LLMs on Intel iGPUs with limited memory size, we recommend setting ``cpu_embedding=True`` in the ``from_pretrained`` function. - This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU. - ``` +- Step 3: Create code file. IPEX-LLM supports loading model from Hugging Face or ModelScope. Please choose according to your requirements. + + - For **loading model from Hugging Face**: + + Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://huggingface.co/Qwen/Qwen-1_8B-Chat) model with IPEX-LLM optimizations. + + ```python + # Copy/Paste the contents to a new file demo.py + import torch + from ipex_llm.transformers import AutoModelForCausalLM + from transformers import AutoTokenizer, GenerationConfig + generation_config = GenerationConfig(use_cache=True) + + print('Now start loading Tokenizer and optimizing Model...') + tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", + trust_remote_code=True) + + # Load Model using ipex-llm and load it to GPU + model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", + load_in_4bit=True, + cpu_embedding=True, + trust_remote_code=True) + model = model.to('xpu') + print('Successfully loaded Tokenizer and optimized Model!') + + # Format the prompt + question = "What is AI?" + prompt = "user: {prompt}\n\nassistant:".format(prompt=question) + + # Generate predicted tokens + with torch.inference_mode(): + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') + + print('--------------------------------------Note-----------------------------------------') + print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') + print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') + print('| Please be patient until it finishes warm-up... |') + print('-----------------------------------------------------------------------------------') + + # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. + # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config) # warm-up + + print('Successfully finished warm-up, now start generation...') + + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config).cpu() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(output_str) + ``` + - For **loading model ModelScopee**: + + Please first run following command in Miniforge Prompt to install ModelScope: + ```cmd + pip install modelscope==1.11.0 + ``` + + Create a new file named `demo.py` and insert the code snippet below to run [Qwen-1.8B-Chat](https://www.modelscope.cn/models/qwen/Qwen-1_8B-Chat/summary) model with IPEX-LLM optimizations. + + ```python + + # Copy/Paste the contents to a new file demo.py + import torch + from ipex_llm.transformers import AutoModelForCausalLM + from transformers import GenerationConfig + from modelscope import AutoTokenizer + generation_config = GenerationConfig(use_cache=True) + + print('Now start loading Tokenizer and optimizing Model...') + tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen-1_8B-Chat", + trust_remote_code=True) + + # Load Model using ipex-llm and load it to GPU + model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen-1_8B-Chat", + load_in_4bit=True, + cpu_embedding=True, + trust_remote_code=True, + model_hub='modelscope') + model = model.to('xpu') + print('Successfully loaded Tokenizer and optimized Model!') + + # Format the prompt + question = "What is AI?" + prompt = "user: {prompt}\n\nassistant:".format(prompt=question) + + # Generate predicted tokens + with torch.inference_mode(): + input_ids = tokenizer.encode(prompt, return_tensors="pt").to('xpu') + + print('--------------------------------------Note-----------------------------------------') + print('| For the first time that each model runs on Intel iGPU/Intel Arc™ A300-Series or |') + print('| Pro A60, it may take several minutes for GPU kernels to compile and initialize. |') + print('| Please be patient until it finishes warm-up... |') + print('-----------------------------------------------------------------------------------') + + # To achieve optimal and consistent performance, we recommend a one-time warm-up by running `model.generate(...)` an additional time before starting your actual generation tasks. + # If you're developing an application, you can incorporate this warm-up step into start-up or loading routine to enhance the user experience. + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config) # warm-up + + print('Successfully finished warm-up, now start generation...') + + output = model.generate(input_ids, + do_sample=False, + max_new_tokens=32, + generation_config=generation_config).cpu() + output_str = tokenizer.decode(output[0], skip_special_tokens=True) + print(output_str) + ``` + > **Note**: Please note that the repo id on ModelScope may be different from Hugging Face for some models. + +> [!NOTE] +> When running LLMs on Intel iGPUs with limited memory size, we recommend setting `cpu_embedding=True` in the `from_pretrained` function. +> This will allow the memory-intensive embedding layer to utilize the CPU instead of GPU. + +- Step 4. Run `demo.py` within the activated Python environment using the following command: -* Step 4. Run `demo.py` within the activated Python environment using the following command: ```cmd python demo.py ``` - ### Example output - - Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU: - ``` - user: What is AI? +### Example output - assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, - ``` +Example output on a system equipped with an Intel Core Ultra 5 125H CPU and Intel Arc Graphics iGPU: +``` +user: What is AI? + +assistant: AI stands for Artificial Intelligence, which refers to the development of computer systems that can perform tasks that typically require human intelligence, such as visual perception, speech recognition, +``` ## Tips & Troubleshooting diff --git a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md index 0576cc98d8a..6130916044d 100644 --- a/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md +++ b/docs/mddocs/Quickstart/llama3_llamacpp_ollama_quickstart.md @@ -6,7 +6,7 @@ Now, you can easily run Llama 3 on Intel GPU using `llama.cpp` and `Ollama` with See the demo of running Llama-3-8B-Instruct on Intel Arc GPU using `Ollama` below. - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-linux-arc.png)](https://llm-assets.readthedocs.io/en/latest/_images/ollama-llama3-linux-arc.mp4) ## Quick Start This quickstart guide walks you through how to run Llama 3 on Intel GPU using `llama.cpp` / `Ollama` with IPEX-LLM. @@ -15,7 +15,7 @@ This quickstart guide walks you through how to run Llama 3 on Intel GPU using `l #### 1.1 Install IPEX-LLM for llama.cpp and Initialize -Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section [Initialize llama.cpp with IPEX-LLM](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#initialize-llama-cpp-with-ipex-llm) to initialize. +Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](./llama_cpp_quickstart.md), and follow the instructions in section [Prerequisites](./llama_cpp_quickstart.md#0-prerequisites) to setup and section [Install IPEX-LLM for llama.cpp](./llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp) to install the IPEX-LLM with llama.cpp binaries, then follow the instructions in section [Initialize llama.cpp with IPEX-LLM](./llama_cpp_quickstart.md#initialize-llamacpp-with-ipex-llm) to initialize. **After above steps, you should have created a conda environment, named `llm-cpp` for instance and have llama.cpp binaries in your current directory.** @@ -33,73 +33,61 @@ Suppose you have downloaded a [Meta-Llama-3-8B-Instruct-Q4_K_M.gguf](https://hug To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`. -```eval_rst -.. tabs:: - .. tab:: Linux +- For **Linux users**: + + ```bash + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + ``` - .. code-block:: bash +- For **Windows users**: - source /opt/intel/oneapi/setvars.sh - export SYCL_CACHE_PERSISTENT=1 + Please run the following command in Miniforge Prompt. - .. tab:: Windows - - .. code-block:: bash + ```cmd + set SYCL_CACHE_PERSISTENT=1 + ``` - set SYCL_CACHE_PERSISTENT=1 - -``` - -```eval_rst -.. tip:: - - If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: - - .. code-block:: bash - - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - -``` +> [!TIP] +> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: +> +> ```bash +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> ``` ##### Run llama3 Under your current directory, exceuting below command to do inference with Llama3: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: + + ```bash + ./main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap + ``` - ./main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -t 8 -e -ngl 33 --color --no-mmap +- For **Windows users**: - .. tab:: Windows + Please run the following command in Miniforge Prompt. - Please run the following command in Miniforge Prompt. - - .. code-block:: bash - - main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap -``` + ```cmd + main -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun doing something" -e -ngl 33 --color --no-mmap + ``` Under your current directory, you can also execute below command to have interactive chat with Llama3: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: + + ```bash + ./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf + ``` - ./main -ngl 33 --interactive-first --color -e --in-prefix '<|start_header_id|>user<|end_header_id|>\n\n' --in-suffix '<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n' -r '<|eot_id|>' -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf +- For **Windows users**: - .. tab:: Windows + Please run the following command in Miniforge Prompt. - Please run the following command in Miniforge Prompt. - - .. code-block:: bash - - main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf -``` + ```cmd + main -ngl 33 --interactive-first --color -e --in-prefix "<|start_header_id|>user<|end_header_id|>\n\n" --in-suffix "<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n" -r "<|eot_id|>" -m /Meta-Llama-3-8B-Instruct-Q4_K_M.gguf + ``` Below is a sample output on Intel Arc GPU: @@ -108,7 +96,7 @@ Below is a sample output on Intel Arc GPU: #### 2.1 Install IPEX-LLM for Ollama and Initialize -Visit [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html), and follow the instructions in section [Install IPEX-LLM for llama.cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binary, then follow the instructions in section [Initialize Ollama](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/ollama_quickstart.html#initialize-ollama) to initialize. +Visit [Run Ollama with IPEX-LLM on Intel GPU](./ollama_quickstart.md), and follow the instructions in section [Install IPEX-LLM for llama.cpp](./llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp) to install the IPEX-LLM with Ollama binary, then follow the instructions in section [Initialize Ollama](./ollama_quickstart.md#2-initialize-ollama) to initialize. **After above steps, you should have created a conda environment, named `llm-cpp` for instance and have ollama binary file in your current directory.** @@ -122,80 +110,65 @@ Visit [Run Ollama with IPEX-LLM on Intel GPU](https://ipex-llm.readthedocs.io/en Launch the Ollama service: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: + + ```bash + export no_proxy=localhost,127.0.0.1 + export ZES_ENABLE_SYSMAN=1 + export OLLAMA_NUM_GPU=999 + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 - export no_proxy=localhost,127.0.0.1 - export ZES_ENABLE_SYSMAN=1 - export OLLAMA_NUM_GPU=999 - source /opt/intel/oneapi/setvars.sh - export SYCL_CACHE_PERSISTENT=1 + ./ollama serve + ``` - ./ollama serve +- For **Windows users**: - .. tab:: Windows + Please run the following command in Miniforge Prompt. - Please run the following command in Miniforge Prompt. + ```cmd + set no_proxy=localhost,127.0.0.1 + set ZES_ENABLE_SYSMAN=1 + set OLLAMA_NUM_GPU=999 + set SYCL_CACHE_PERSISTENT=1 - .. code-block:: bash + ollama serve + ``` - set no_proxy=localhost,127.0.0.1 - set ZES_ENABLE_SYSMAN=1 - set OLLAMA_NUM_GPU=999 - set SYCL_CACHE_PERSISTENT=1 +> [!TIP] +> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: +> +> ```bash +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> ``` - ollama serve - -``` - -```eval_rst -.. tip:: - - If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: - - .. code-block:: bash - - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - -``` - -```eval_rst -.. note:: - - To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. -``` +> [!NOTE] +> +> To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. ##### 2.2.2 Using Ollama Run Llama3 Keep the Ollama service on and open another terminal and run llama3 with `ollama run`: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash - - export no_proxy=localhost,127.0.0.1 - ./ollama run llama3:8b-instruct-q4_K_M - - .. tab:: Windows - - Please run the following command in Miniforge Prompt. +- For **Linux users**: + + ```bash + export no_proxy=localhost,127.0.0.1 + ./ollama run llama3:8b-instruct-q4_K_M + ``` - .. code-block:: bash +- For **Windows users**: - set no_proxy=localhost,127.0.0.1 - ollama run llama3:8b-instruct-q4_K_M -``` + Please run the following command in Miniforge Prompt. -```eval_rst -.. note:: + ```cmd + set no_proxy=localhost,127.0.0.1 + ollama run llama3:8b-instruct-q4_K_M + ``` - Here we just take `llama3:8b-instruct-q4_K_M` for example, you can replace it with any other Llama3 model you want. -``` +> [!NOTE] +> +> Here we just take `llama3:8b-instruct-q4_K_M` for example, you can replace it with any other Llama3 model you want. Below is a sample output on Intel Arc GPU : diff --git a/docs/mddocs/Quickstart/llama_cpp_quickstart.md b/docs/mddocs/Quickstart/llama_cpp_quickstart.md index 1373a781489..455d96f4eae 100644 --- a/docs/mddocs/Quickstart/llama_cpp_quickstart.md +++ b/docs/mddocs/Quickstart/llama_cpp_quickstart.md @@ -4,15 +4,12 @@ See the demo of running LLaMA2-7B on Intel Arc GPU below. - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/llama-cpp-arc.png)](https://llm-assets.readthedocs.io/en/latest/_images/llama-cpp-arc.mp4) -```eval_rst -.. note:: - - `ipex-llm[cpp]==2.5.0b20240527` is consistent with `c780e75 `_ of llama.cpp. - - Our current version is consistent with `62bfef5 `_ of llama.cpp. -``` +> [!NOTE] +> `ipex-llm[cpp]==2.5.0b20240527` is consistent with [c780e75](https://github.com/ggerganov/llama.cpp/commit/c780e75305dba1f67691a8dc0e8bc8425838a452) of llama.cpp. +> +> Our latest version is consistent with [62bfef5](https://github.com/ggerganov/llama.cpp/commit/62bfef5194d5582486d62da3db59bf44981b7912) of llama.cpp. ## Quick Start This quickstart guide walks you through installing and running `llama.cpp` with `ipex-llm`. @@ -23,41 +20,35 @@ IPEX-LLM's support for `llama.cpp` now is available for Linux system and Windows #### Linux For Linux system, we recommend Ubuntu 20.04 or later (Ubuntu 22.04 is preferred). -Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.html), follow [Install Intel GPU Driver](./install_linux_gpu.html#install-intel-gpu-driver) and [Install oneAPI](./install_linux_gpu.html#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0. +Visit the [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md), follow [Install Intel GPU Driver](./install_linux_gpu.md#install-gpu-driver) and [Install oneAPI](./install_linux_gpu.md#install-oneapi) to install GPU driver and Intel® oneAPI Base Toolkit 2024.0. #### Windows (Optional) IPEX-LLM backend for llama.cpp only supports the more recent GPU drivers. Please make sure your GPU driver version is equal or newer than `31.0.101.5333`, otherwise you might find gibberish output. -If you have lower GPU driver version, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](./install_windows_gpu.html), and follow [Update GPU driver](./install_windows_gpu.html#optional-update-gpu-driver). +If you have lower GPU driver version, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](./install_windows_gpu.md), and follow [Update GPU driver](./install_windows_gpu.md#optional-update-gpu-driver). ### 1 Install IPEX-LLM for llama.cpp To use `llama.cpp` with IPEX-LLM, first ensure that `ipex-llm[cpp]` is installed. -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: + + ```bash + conda create -n llm-cpp python=3.11 + conda activate llm-cpp + pip install --pre --upgrade ipex-llm[cpp] + ``` - conda create -n llm-cpp python=3.11 - conda activate llm-cpp - pip install --pre --upgrade ipex-llm[cpp] +- For **Windows users**: - .. tab:: Windows + Please run the following command in Miniforge Prompt. - .. note:: - - Please run the following command in Miniforge Prompt. - - .. code-block:: cmd - - conda create -n llm-cpp python=3.11 - conda activate llm-cpp - pip install --pre --upgrade ipex-llm[cpp] - -``` + ```cmd + conda create -n llm-cpp python=3.11 + conda activate llm-cpp + pip install --pre --upgrade ipex-llm[cpp] + ``` **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `llama.cpp` commands with IPEX-LLM.** @@ -72,43 +63,34 @@ cd llama-cpp #### Initialize llama.cpp with IPEX-LLM Then you can use following command to initialize `llama.cpp` with IPEX-LLM: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash - - init-llama-cpp - After ``init-llama-cpp``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory. +- For **Linux users**: + + ```bash + init-llama-cpp + ``` - .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image.png + After `init-llama-cpp`, you should see many soft links of `llama.cpp`'s executable files and a `convert.py` in current directory. - .. tab:: Windows + ![init_llama_cpp_demo_image](https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image.png) - Please run the following command with **administrator privilege in Miniforge Prompt**. +- For **Windows users**: - .. code-block:: bash - - init-llama-cpp.bat + Please run the following command with **administrator privilege in Miniforge Prompt**. - After ``init-llama-cpp.bat``, you should see many soft links of ``llama.cpp``'s executable files and a ``convert.py`` in current directory. + ```cmd + init-llama-cpp.bat + ``` - .. image:: https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image_windows.png + After `init-llama-cpp.bat`, you should see many soft links of `llama.cpp`'s executable files and a `convert.py` in current directory. -``` + ![init_llama_cpp_demo_image_windows](https://llm-assets.readthedocs.io/en/latest/_images/init_llama_cpp_demo_image_windows.png) -```eval_rst -.. note:: +> [!TIP] +> `init-llama-cpp` will create soft links of llama.cpp's executable files to current directory, if you want to use these executable files in other places, don't forget to run above commands again. - ``init-llama-cpp`` will create soft links of llama.cpp's executable files to current directory, if you want to use these executable files in other places, don't forget to run above commands again. -``` - -```eval_rst -.. note:: - - If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your binary file, don't forget to remove old binary files first and initialize again with ``init-llama-cpp`` or ``init-llama-cpp.bat``. -``` +> [!NOTE] +> If you have installed higher version `ipex-llm[cpp]` and want to upgrade your binary file, don't forget to remove old binary files first and initialize again with `init-llama-cpp` or `init-llama-cpp.bat`. **Now you can use these executable files by standard llama.cpp's usage.** @@ -116,35 +98,27 @@ Then you can use following command to initialize `llama.cpp` with IPEX-LLM: To use GPU acceleration, several environment variables are required or recommended before running `llama.cpp`. -```eval_rst -.. tabs:: - .. tab:: Linux +- For **Linux users**: + + ```bash + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 + ``` - .. code-block:: bash +- For **Windows users**: - source /opt/intel/oneapi/setvars.sh - export SYCL_CACHE_PERSISTENT=1 + Please run the following command in Miniforge Prompt. - .. tab:: Windows + ```cmd + set SYCL_CACHE_PERSISTENT=1 + ``` - Please run the following command in Miniforge Prompt. - - .. code-block:: bash - - set SYCL_CACHE_PERSISTENT=1 - -``` - -```eval_rst -.. tip:: - - If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: - - .. code-block:: bash - - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - -``` +> [!TIP] +> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance: +> +> ```bash +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> ``` ### 3 Example: Running community GGUF models with IPEX-LLM @@ -155,30 +129,23 @@ Before running, you should download or copy community GGUF model to your current #### Run the quantized model -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash - - ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color - - .. note:: +- For **Linux users**: + + ```bash + ./main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + ``` - For more details about meaning of each parameter, you can use ``./main -h``. + > **Note**: For more details about meaning of each parameter, you can use `./main -h`. - .. tab:: Windows +- For **Windows users**: - Please run the following command in Miniforge Prompt. + Please run the following command in Miniforge Prompt. - .. code-block:: bash + ```cmd + main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color + ``` - main -m mistral-7b-instruct-v0.1.Q4_K_M.gguf -n 32 --prompt "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" -t 8 -e -ngl 33 --color - - .. note:: - - For more details about meaning of each parameter, you can use ``main -h``. -``` + > **Note**: For more details about meaning of each parameter, you can use `main -h`. #### Sample Output ``` @@ -325,7 +292,7 @@ If `-ngl` is set to 0, it means that the entire model will run on CPU. If `-ngl` #### How to specificy GPU If your machine has multi GPUs, `llama.cpp` will default use all GPUs which may slow down your inference for model which can run on single GPU. You can add `-sm none` in your command to use one GPU only. -Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device before excuting your command, more details can refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/KeyFeatures/multi_gpus_selection.html#oneapi-device-selector). +Also, you can use `ONEAPI_DEVICE_SELECTOR=level_zero:[gpu_id]` to select device before excuting your command, more details can refer to [here](../Overview/KeyFeatures/multi_gpus_selection.md#2-oneapi-device-selector). #### Program crash with Chinese prompt If you run the llama.cpp program on Windows and find that your program crashes or outputs abnormally when accepting Chinese prompts, you can open `Region->Administrative->Change System locale..`, check `Beta: Use Unicode UTF-8 for worldwide language support` option and then restart your computer. diff --git a/docs/mddocs/Quickstart/ollama_quickstart.md b/docs/mddocs/Quickstart/ollama_quickstart.md index fa81d73a24e..4760f6a2c32 100644 --- a/docs/mddocs/Quickstart/ollama_quickstart.md +++ b/docs/mddocs/Quickstart/ollama_quickstart.md @@ -4,15 +4,12 @@ See the demo of running LLaMA2-7B on Intel Arc GPU below. - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/ollama-linux-arc.png)](https://llm-assets.readthedocs.io/en/latest/_images/ollama-linux-arc.mp4) -```eval_rst -.. note:: - - `ipex-llm[cpp]==2.5.0b20240527` is consistent with `v0.1.34 `_ of ollama. - - Our current version is consistent with `v0.1.39 `_ of ollama. -``` +> [!NOTE] +> `ipex-llm[cpp]==2.5.0b20240527` is consistent with [v0.1.34](https://github.com/ollama/ollama/releases/tag/v0.1.34) of ollama. +> +> Our current version is consistent with [v0.1.39](https://github.com/ollama/ollama/releases/tag/v0.1.39) of ollama. ## Quickstart @@ -20,7 +17,7 @@ See the demo of running LLaMA2-7B on Intel Arc GPU below. IPEX-LLM's support for `ollama` now is available for Linux system and Windows system. -Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html), and follow the instructions in section [Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#prerequisites) to setup and section [Install IPEX-LLM cpp](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html#install-ipex-llm-for-llama-cpp) to install the IPEX-LLM with Ollama binaries. +Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](./llama_cpp_quickstart.md), and follow the instructions in section [Prerequisites](./llama_cpp_quickstart.md#0-prerequisites) to setup and section [Install IPEX-LLM cpp](./llama_cpp_quickstart.md#1-install-ipex-llm-for-llamacpp) to install the IPEX-LLM with Ollama binaries. **After the installation, you should have created a conda environment, named `llm-cpp` for instance, for running `ollama` commands with IPEX-LLM.** @@ -28,31 +25,24 @@ Visit [Run llama.cpp with IPEX-LLM on Intel GPU Guide](https://ipex-llm.readthed Activate the `llm-cpp` conda environment and initialize Ollama by executing the commands below. A symbolic link to `ollama` will appear in your current directory. -```eval_rst -.. tabs:: - .. tab:: Linux +- For **Linux users**: + + ```bash + conda activate llm-cpp + init-ollama + ``` - .. code-block:: bash - - conda activate llm-cpp - init-ollama +- For **Windows users**: - .. tab:: Windows + Please run the following command with **administrator privilege in Miniforge Prompt**. - Please run the following command with **administrator privilege in Miniforge Prompt**. + ```cmd + conda activate llm-cpp + init-ollama.bat + ``` - .. code-block:: bash - - conda activate llm-cpp - init-ollama.bat - -``` - -```eval_rst -.. note:: - - If you have installed higher version ``ipex-llm[cpp]`` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with ``init-ollama`` or ``init-ollama.bat``. -``` +> [!NOTE] +> If you have installed higher version `ipex-llm[cpp]` and want to upgrade your ollama binary file, don't forget to remove old binary files first and initialize again with `init-ollama` or `init-ollama.bat`. **Now you can use this executable file by standard ollama's usage.** @@ -60,57 +50,43 @@ Activate the `llm-cpp` conda environment and initialize Ollama by executing the You may launch the Ollama service as below: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: - export OLLAMA_NUM_GPU=999 - export no_proxy=localhost,127.0.0.1 - export ZES_ENABLE_SYSMAN=1 - source /opt/intel/oneapi/setvars.sh - export SYCL_CACHE_PERSISTENT=1 + ```bash + export OLLAMA_NUM_GPU=999 + export no_proxy=localhost,127.0.0.1 + export ZES_ENABLE_SYSMAN=1 + source /opt/intel/oneapi/setvars.sh + export SYCL_CACHE_PERSISTENT=1 - ./ollama serve + ./ollama serve + ``` - .. tab:: Windows +- For **Windows users**: - Please run the following command in Miniforge Prompt. + Please run the following command in Miniforge Prompt. - .. code-block:: bash + ```cmd + set OLLAMA_NUM_GPU=999 + set no_proxy=localhost,127.0.0.1 + set ZES_ENABLE_SYSMAN=1 + set SYCL_CACHE_PERSISTENT=1 - set OLLAMA_NUM_GPU=999 - set no_proxy=localhost,127.0.0.1 - set ZES_ENABLE_SYSMAN=1 - set SYCL_CACHE_PERSISTENT=1 + ollama serve + ``` - ollama serve +> [!NOTE] +> Please set environment variable `OLLAMA_NUM_GPU` to `999` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU. -``` - -```eval_rst -.. note:: - - Please set environment variable ``OLLAMA_NUM_GPU`` to ``999`` to make sure all layers of your model are running on Intel GPU, otherwise, some layers may run on CPU. -``` - -```eval_rst -.. tip:: +> [!TIP] +> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: +> +> ```bash +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> ``` - If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before executing `ollama serve`: - - .. code-block:: bash - - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - -``` - -```eval_rst -.. note:: - - To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. -``` +> [!NOTE] +> To allow the service to accept connections from all IP addresses, use `OLLAMA_HOST=0.0.0.0 ./ollama serve` instead of just `./ollama serve`. The console will display messages similar to the following: @@ -134,34 +110,29 @@ Keep the Ollama service on and open another terminal and run `./ollama pull with your pulled model**, e.g. `dolphin-phi`. -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash - - curl http://localhost:11434/api/generate -d ' - { - "model": "", - "prompt": "Why is the sky blue?", - "stream": false - }' - - .. tab:: Windows - - Please run the following command in Miniforge Prompt. - - .. code-block:: bash - - curl http://localhost:11434/api/generate -d " - { - \"model\": \"\", - \"prompt\": \"Why is the sky blue?\", - \"stream\": false - }" - -``` - +- For **Linux users**: + + ```bash + curl http://localhost:11434/api/generate -d ' + { + "model": "", + "prompt": "Why is the sky blue?", + "stream": false + }' + ``` + +- For **Windows users**: + + Please run the following command in Miniforge Prompt. + + ```cmd + curl http://localhost:11434/api/generate -d " + { + \"model\": \"\", + \"prompt\": \"Why is the sky blue?\", + \"stream\": false + }" + ``` #### Using Ollama Run GGUF models @@ -175,27 +146,23 @@ PARAMETER num_predict 64 Then you can create the model in Ollama by `ollama create example -f Modelfile` and use `ollama run` to run the model directly on console. -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash +- For **Linux users**: + + ```bash + export no_proxy=localhost,127.0.0.1 + ./ollama create example -f Modelfile + ./ollama run example + ``` - export no_proxy=localhost,127.0.0.1 - ./ollama create example -f Modelfile - ./ollama run example +- For **Windows users**: - .. tab:: Windows + Please run the following command in Miniforge Prompt. - Please run the following command in Miniforge Prompt. - - .. code-block:: bash - - set no_proxy=localhost,127.0.0.1 - ollama create example -f Modelfile - ollama run example - -``` + ```cmd + set no_proxy=localhost,127.0.0.1 + ollama create example -f Modelfile + ollama run example + ``` An example process of interacting with model with `ollama run example` looks like the following: diff --git a/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md b/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md index 1eb2ec05418..d143e7a6664 100644 --- a/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md +++ b/docs/mddocs/Quickstart/open_webui_with_ollama_quickstart.md @@ -4,7 +4,7 @@ *See the demo of running Mistral:7B on Intel Arc A770 below.* - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/open_webui_demo.png)](https://llm-assets.readthedocs.io/en/latest/_images/open_webui_demo.mp4) ## Quickstart @@ -13,80 +13,70 @@ This quickstart guide walks you through setting up and using [Open WebUI](https: ### 1 Run Ollama with Intel GPU -Follow the instructions on the [Run Ollama with Intel GPU](ollama_quickstart.html) to install and run "Ollama Serve". Please ensure that the Ollama server continues to run while you're using the Open WebUI. +Follow the instructions on the [Run Ollama with Intel GPU](./ollama_quickstart.md) to install and run "Ollama Serve". Please ensure that the Ollama server continues to run while you're using the Open WebUI. ### 2 Install the Open-Webui #### Install Node.js & npm -```eval_rst -.. note:: - - Package version requirements for running Open WebUI: Node.js (>= 20.10) or Bun (>= 1.0.21), Python (>= 3.11) -``` +> [!NOTE] +> Package version requirements for running Open WebUI: Node.js (>= 20.10) or Bun (>= 1.0.21), Python (>= 3.11) Please install Node.js & npm as below: -```eval_rst -.. tabs:: - .. tab:: Linux - - Run below commands to install Node.js & npm. Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively. +- For **Linux users**: - .. code-block:: bash + Run below commands to install Node.js & npm. Once the installation is complete, verify the installation by running `node -v` and `npm -v` to check the versions of Node.js and npm, respectively. - sudo apt update - sudo apt install nodejs - sudo apt install npm - - .. tab:: Windows + ```bash + sudo apt update + sudo apt install nodejs + sudo apt install npm + ``` - You may download Node.js installation package from https://nodejs.org/dist/v20.12.2/node-v20.12.2-x64.msi, which will install both Node.js & npm on your system. +- For **Windows users**: - Once the installation is complete, verify the installation by running ```node -v``` and ```npm -v``` to check the versions of Node.js and npm, respectively. -``` + You may download Node.js installation package from https://nodejs.org/dist/v20.12.2/node-v20.12.2-x64.msi, which will install both Node.js & npm on your system. + Once the installation is complete, verify the installation by running `node -v` and `npm -v` to check the versions of Node.js and npm, respectively. #### Download the Open-Webui Use `git` to clone the [open-webui repo](https://github.com/open-webui/open-webui.git), or download the open-webui source code zip from [this link](https://github.com/open-webui/open-webui/archive/refs/heads/main.zip) and unzip it to a directory, e.g. `~/open-webui`. - #### Install Dependencies You may run below commands to install Open WebUI dependencies: -```eval_rst -.. tabs:: - .. tab:: Linux - .. code-block:: bash +- For **Linux users**: - cd ~/open-webui/ - cp -RPp .env.example .env # Copy required .env file + ```bash + cd ~/open-webui/ + cp -RPp .env.example .env # Copy required .env file - # Build frontend - npm i - npm run build + # Build frontend + npm i + npm run build - # Install Dependencies - cd ./backend - pip install -r requirements.txt -U + # Install Dependencies + cd ./backend + pip install -r requirements.txt -U + ``` - .. tab:: Windows - - .. code-block:: bash +- For **Windows users**: - cd ~\open-webui\ - copy .env.example .env + ```cmd + cd ~\open-webui\ + copy .env.example .env - # Build frontend - npm install - npm run build + :: Build frontend + npm install + npm run build - # Install Dependencies - cd .\backend - pip install -r requirements.txt -U -``` + :: Install Dependencies + cd .\backend + pip install -r requirements.txt -U + ``` ### 3. Start the Open-WebUI @@ -94,46 +84,31 @@ You may run below commands to install Open WebUI dependencies: Run below commands to start the service: -```eval_rst -.. tabs:: - .. tab:: Linux +- For **Linux users**: - .. code-block:: bash + ```bash + export no_proxy=localhost,127.0.0.1 + bash start.sh + ``` - export no_proxy=localhost,127.0.0.1 - bash start.sh - - .. note: - - If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com` before running `bash start.sh`. + If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `export HF_ENDPOINT=https://hf-mirror.com` before running `bash start.sh`. +- For **Windows users**: - .. tab:: Windows - - .. code-block:: bash - - set no_proxy=localhost,127.0.0.1 - start_windows.bat - - .. note: - - If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `set HF_ENDPOINT=https://hf-mirror.com` before running `start_windows.bat`. -``` + ```cmd + set no_proxy=localhost,127.0.0.1 + start_windows.bat + ``` + If you have difficulty accessing the huggingface repositories, you may use a mirror, e.g. add `set HF_ENDPOINT=https://hf-mirror.com` before running `start_windows.bat`. #### Access the WebUI Upon successful launch, URLs to access the WebUI will be displayed in the terminal. Open the provided local URL in your browser to interact with the WebUI, e.g. http://localhost:8080/. - - ### 4. Using the Open-Webui -```eval_rst -.. note:: - - For detailed information about how to use Open WebUI, visit the README of `open-webui official repository `_. - -``` +> [!NOTE] +> For detailed information about how to use Open WebUI, visit the README of [open-webui official repository](https://github.com/open-webui/open-webui). #### Log-in @@ -163,11 +138,8 @@ If the connection is successful, you will see a message stating `Service Connect -```eval_rst -.. note:: - - If you want to use an Ollama server hosted at a different URL, simply update the **Ollama Base URL** to the new URL and press the **Refresh** button to re-confirm the connection to Ollama. -``` +> [!NOTE] +> If you want to use an Ollama server hosted at a different URL, simply update the **Ollama Base URL** to the new URL and press the **Refresh** button to re-confirm the connection to Ollama. #### Pull Model @@ -205,4 +177,4 @@ To shut down the open-webui server, use **Ctrl+C** in the terminal where the ope ##### Error `No module named 'torch._C` -When you encounter the error ``ModuleNotFoundError: No module named 'torch._C'`` after executing ```bash start.sh```, you can resolve it by reinstalling PyTorch. First, use ```pip uninstall torch``` to remove the existing PyTorch installation, and then reinstall it along with its dependencies by running ```pip install torch torchvision torchaudio```. +When you encounter the error `ModuleNotFoundError: No module named 'torch._C'` after executing `bash start.sh`, you can resolve it by reinstalling PyTorch. First, use `pip uninstall torch` to remove the existing PyTorch installation, and then reinstall it along with its dependencies by running `pip install torch torchvision torchaudio`. diff --git a/docs/mddocs/Quickstart/privateGPT_quickstart.md b/docs/mddocs/Quickstart/privateGPT_quickstart.md index 0d605068005..2e2f18f4566 100644 --- a/docs/mddocs/Quickstart/privateGPT_quickstart.md +++ b/docs/mddocs/Quickstart/privateGPT_quickstart.md @@ -4,12 +4,10 @@ *See the demo of privateGPT running Mistral:7B on Intel Arc A770 below.* - - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/PrivateGPT-ARC.png)](https://llm-assets.readthedocs.io/en/latest/_images/PrivateGPT-ARC.mp4) ## Quickstart - ### 1. Install and Start `Ollama` Service on Intel GPU Follow the steps in [Run Ollama on Intel GPU Guide](./ollama_quickstart.md) to install and run Ollama on Intel GPU. Ensure that `ollama serve` is running correctly and can be accessed through a local URL (e.g., `https://127.0.0.1:11434`) or a remote URL (e.g., `http://your_ip:11434`). @@ -56,13 +54,8 @@ Below is an example of how `settings-ollama.yaml` should look.

-```eval_rst - -.. note:: - - `settings-ollama.yaml` is loaded when the Ollama profile is specified in the PGPT_PROFILES environment variable. This can override configurations from the default `settings.yaml`. - -``` +> [!NOTE] +> `settings-ollama.yaml` is loaded when the Ollama profile is specified in the PGPT_PROFILES environment variable. This can override configurations from the default `settings.yaml`. For more information on configuring PrivateGPT, please visit the [PrivateGPT Main Concepts](https://docs.privategpt.dev/installation/getting-started/main-concepts) page. @@ -72,31 +65,24 @@ Please ensure that the Ollama server continues to run in a terminal while you're Run below commands to start the service in another terminal: -```eval_rst -.. tabs:: - .. tab:: Linux - - .. code-block:: bash - - export no_proxy=localhost,127.0.0.1 - PGPT_PROFILES=ollama make run +- For **Linux users**: + + ```bash + export no_proxy=localhost,127.0.0.1 + PGPT_PROFILES=ollama make run + ``` - .. note: + > **Note**: Setting `PGPT_PROFILES=ollama` will load the configuration from `settings.yaml` and `settings-ollama.yaml`. - Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``. +- For **Windows users**: - .. tab:: Windows - - .. code-block:: bash - - set no_proxy=localhost,127.0.0.1 - set PGPT_PROFILES=ollama - make run + ```cmd + set no_proxy=localhost,127.0.0.1 + set PGPT_PROFILES=ollama + make run + ``` - .. note: - - Setting ``PGPT_PROFILES=ollama`` will load the configuration from ``settings.yaml`` and ``settings-ollama.yaml``. -``` + > **Note**: Setting `PGPT_PROFILES=ollama` will load the configuration from `settings.yaml` and `settings-ollama.yaml`. Upon successful deployment, you will see logs in the terminal similar to the following: diff --git a/docs/mddocs/Quickstart/vLLM_quickstart.md b/docs/mddocs/Quickstart/vLLM_quickstart.md index 71e34834c17..155fd321c6e 100644 --- a/docs/mddocs/Quickstart/vLLM_quickstart.md +++ b/docs/mddocs/Quickstart/vLLM_quickstart.md @@ -20,9 +20,9 @@ This quickstart guide walks you through installing and running `vLLM` with `ipex IPEX-LLM's support for `vLLM` now is available for only Linux system. -Visit [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html) and follow the instructions in section [Install Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to isntall prerequisites that are needed for running code on Intel GPUs. +Visit [Install IPEX-LLM on Linux with Intel GPU](./install_linux_gpu.md) and follow the instructions in section [Install Prerequisites](./install_linux_gpu.md#install-prerequisites) to isntall prerequisites that are needed for running code on Intel GPUs. -Then,follow instructions in section [Install ipex-llm](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-ipex-llm) to install `ipex-llm[xpu]` and setup the recommended runtime configurations. +Then, follow instructions in section [Install ipex-llm](./install_linux_gpu.md#install-ipex-llm) to install `ipex-llm[xpu]` and setup the recommended runtime configurations. **After the installation, you should have created a conda environment, named `ipex-vllm` for instance, for running `vLLM` commands with IPEX-LLM.** @@ -54,12 +54,10 @@ pip install transformers_stream_generator einops tiktoken To run offline inference using vLLM for a quick impression, use the following example. -```eval_rst -.. note:: - - Please modify the MODEL_PATH in offline_inference.py to use your chosen model. - You can try modify load_in_low_bit to different values in **[sym_int4, fp6, fp8, fp8_e4m3, fp16]** to use different quantization dtype. -``` +> [!NOTE] +> Please modify the MODEL_PATH in offline_inference.py to use your chosen model. +> +> You can try modify load_in_low_bit to different values in **[sym_int4, fp6, fp8, fp8_e4m3, fp16]** to use different quantization dtype. ```bash #!/bin/bash @@ -91,11 +89,8 @@ Prompt: 'The future of AI is', Generated text: " bright, but it's not without ch ### Service -```eval_rst -.. note:: - - Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance. -``` +> [!NOTE] +> Because of using JIT compilation for kernels. We recommend to send a few requests for warmup before using the service for the best performance. To fully utilize the continuous batching feature of the `vLLM`, you can send requests to the service using `curl` or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same `forward` step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. @@ -168,20 +163,17 @@ Below shows an example output using `Qwen1.5-7B-Chat` with low-bit format `sym_i -```eval_rst -.. tip:: - - If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service: - - .. code-block:: bash - - export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - -``` +> [!TIP] +> If your local LLM is running on Intel Arc™ A-Series Graphics with Linux OS (Kernel 6.2), it is recommended to additionaly set the following environment variable for optimal performance before starting the service: +> +> ```bash +> export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +> ``` ## 4. About Tensor parallel -> Note: We recommend to use docker for tensor parallel deployment. Check our serving docker image `intelanalytics/ipex-llm-serving-xpu`. +> [!NOTE] +> We recommend to use docker for tensor parallel deployment. Check our serving docker image `intelanalytics/ipex-llm-serving-xpu`. We have also supported tensor parallel by using multiple Intel GPU cards. To enable tensor parallel, you will need to install `libfabric-dev` in your environment. In ubuntu, you can install it by: @@ -268,9 +260,5 @@ The following figure shows the result of benchmarking `Llama-2-7b-chat-hf` using - -```eval_rst -.. tip:: - - To find the best config that fits your workload, you may need to start the service and use tools like `wrk` or `jmeter` to perform a stress tests. -``` +> [!TIP] +> To find the best config that fits your workload, you may need to start the service and use tools like `wrk` or `jmeter` to perform a stress tests. diff --git a/docs/mddocs/Quickstart/webui_quickstart.md b/docs/mddocs/Quickstart/webui_quickstart.md index 3aab958928f..e5e486ee469 100644 --- a/docs/mddocs/Quickstart/webui_quickstart.md +++ b/docs/mddocs/Quickstart/webui_quickstart.md @@ -4,7 +4,7 @@ The [oobabooga/text-generation-webui](https://github.com/oobabooga/text-generati See the demo of running LLaMA2-7B on an Intel Core Ultra laptop below. - +[![Demo video](https://llm-assets.readthedocs.io/en/latest/_images/webui-mtl.png)](https://llm-assets.readthedocs.io/en/latest/_images/webui-mtl.mp4) ## Quickstart This quickstart guide walks you through setting up and using the [Text Generation WebUI](https://github.com/intel-analytics/text-generation-webui) with `ipex-llm`. @@ -18,13 +18,12 @@ A preview of the WebUI in action is shown below: ### 1 Install IPEX-LLM -To use the WebUI, first ensure that IPEX-LLM is installed. Follow the instructions on the [IPEX-LLM Installation Quickstart for Windows with Intel GPU](install_windows_gpu.html). +To use the WebUI, first ensure that IPEX-LLM is installed. Follow the instructions on the [IPEX-LLM Installation Quickstart for Windows with Intel GPU](./install_windows_gpu.md). **After the installation, you should have created a conda environment, named `llm` for instance, for running `ipex-llm` applications.** ### 2 Install the WebUI - #### Download the WebUI Download the `text-generation-webui` with IPEX-LLM integrations from [this link](https://github.com/intel-analytics/text-generation-webui/archive/refs/heads/ipex-llm.zip). Unzip the content into a directory, e.g.,`C:\text-generation-webui`. @@ -41,26 +40,21 @@ pip install -r requirements_cpu_only.txt pip install -r extensions/openai/requirements.txt ``` -```eval_rst -.. note:: - - `extensions/openai/requirements.txt` is for API service. If you don't need the API service, you can omit this command. -``` +> [!NOTE] +> `extensions/openai/requirements.txt` is for API service. If you don't need the API service, you can omit this command. ### 3 Start the WebUI Server #### Set Environment Variables Configure oneAPI variables by running the following command in **Miniforge Prompt**: -```eval_rst -.. note:: - - For more details about runtime configurations, refer to `this guide `_ -``` +> [!NOTE] +> For more details about runtime configurations, refer to [this guide](../Overview/install_gpu.md#runtime-configuration). ```cmd set SYCL_CACHE_PERSISTENT=1 ``` + If you're running on iGPU, set additional environment variables by running the following commands: ```cmd set BIGDL_LLM_XMX_DISABLED=1 @@ -70,31 +64,21 @@ set BIGDL_LLM_XMX_DISABLED=1 In **Miniforge Prompt** with the conda environment `llm` activated, navigate to the `text-generation-webui` folder and execute the following commands (You can optionally lanch the server with or without the API service): ##### without API service - ```cmd - python server.py --load-in-4bit - ``` -##### with API service - ``` - python server.py --load-in-4bit --api --api-port 5000 --listen - ``` -```eval_rst -.. note:: - - with ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to `this link `_ +```cmd +python server.py --load-in-4bit ``` - -```eval_rst -.. note:: - - The API service allows user to access models using OpenAI-compatible API. For usage examples, refer to [this link](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples) +##### with API service +```cmd +python server.py --load-in-4bit --api --api-port 5000 --listen ``` +> [!TIP] +> With ``--load-in-4bit`` option, the models will be optimized and run at 4-bit precision. For configuration for other formats and precisions, refer to [this link](https://github.com/intel-analytics/text-generation-webui?tab=readme-ov-file#32-optimizations-for-other-percisions). -```eval_rst -.. note:: - - The API server will by default use port ``5000``. To change the port, use ``--api-port 1234`` in the command above. You can also specify using SSL or API Key in the command. Please see `this guide `_ for the full list of arguments. -``` +> [!NOTE] +> The API service allows user to access models using OpenAI-compatible API. For usage examples, refer to [this link](https://github.com/oobabooga/text-generation-webui/wiki/12-%E2%80%90-OpenAI-API#examples) +> [!NOTE] +> The API server will by default use port ``5000``. To change the port, use ``--api-port 1234`` in the command above. You can also specify using SSL or API Key in the command. Please see `this guide `_ for the full list of arguments. #### Access the WebUI Upon successful launch, URLs to access the WebUI will be displayed in the terminal as shown below. Open the provided local URL in your browser to interact with the WebUI. @@ -129,11 +113,8 @@ If everything goes well, you will get a message as shown below. -```eval_rst -.. note:: - - Model loading might take a few minutes as it includes a **warm-up** phase. This `warm-up` step is used to improve the speed of subsequent model uses. -``` +> [!NOTE] +> Model loading might take a few minutes as it includes a **warm-up** phase. This `warm-up` step is used to improve the speed of subsequent model uses. #### Chat with the Model