Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable Langchain on Intel GPU #16

Merged
merged 16 commits into from
Jun 4, 2024
Merged
250 changes: 250 additions & 0 deletions docs/docs/integrations/llms/ipex_llm_gpu.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,250 @@
{
ivy-lv11 marked this conversation as resolved.
Show resolved Hide resolved
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# IPEX-LLM on Intel GPU\n",
"\n",
"> [IPEX-LLM](https://github.com/intel-analytics/ipex-llm) is a PyTorch library for running LLM on Intel CPU and GPU (e.g., local PC with iGPU, discrete GPU such as Arc, Flex and Max) with very low latency.\n",
"\n",
"This example goes over how to use LangChain to interact with `ipex-llm` for text generation. \n",
"\n",
"> **Note**\n",
">\n",
"> It is recommended that only Windows users with Intel Arc A-Series GPU (except for Intel Arc A300-Series or Pro A60) run this Jupyter notebook directly. For other cases (e.g. Linux users, Intel iGPU, etc.), it is recommended to run the code with Python scripts in terminal for best experiences.\n",
"\n",
"## Install Prerequisites\n",
"To benefit from IPEX-LLM on Intel GPUs, there are several prerequisite steps for tools installation and environment preparation.\n",
"\n",
"If you are a Windows user, visit the [Install IPEX-LLM on Windows with Intel GPU Guide](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html), and follow [Install Prerequisites](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_windows_gpu.html#install-prerequisites) to update GPU driver (optional) and install Conda.\n",
"\n",
"If you are a Linux user, visit the [Install IPEX-LLM on Linux with Intel GPU](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html), and follow [**Install Prerequisites**](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/install_linux_gpu.html#install-prerequisites) to install GPU driver, Intel® oneAPI Base Toolkit 2024.0, and Conda.\n",
"\n",
"## Setup\n",
"\n",
"After the prerequisites installation, you should have created a conda environment with all prerequisites installed. **Start the jupyter service in this conda environment**:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"%pip install -qU langchain langchain-community"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Install IEPX-LLM for running LLMs locally on Intel GPU."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"%pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**\n",
">\n",
"> You can also use `https://pytorch-extension.intel.com/release-whl/stable/xpu/cn/` as the extra-indel-url.\n",
"\n",
"## Runtime Configuration\n",
"\n",
"For optimal performance, it is recommended to set several environment variables based on your device:\n",
"\n",
"### For Windows Users with Intel Core Ultra integrated GPU"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SYCL_CACHE_PERSISTENT\"] = \"1\"\n",
"os.environ[\"BIGDL_LLM_XMX_DISABLED\"] = \"1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### For Windows Users with Intel Arc A-Series GPU"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {
"vscode": {
"languageId": "plaintext"
}
},
"outputs": [],
"source": [
"import os\n",
"\n",
"os.environ[\"SYCL_CACHE_PERSISTENT\"] = \"1\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"> **Note**\n",
">\n",
"> For the first time that each model runs on Intel iGPU/Intel Arc A300-Series or Pro A60, it may take several minutes to compile.\n",
">\n",
"> For other GPU type, please refer to [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#runtime-configuration) for Windows users, and [here](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Overview/install_gpu.html#id5) for Linux users.\n",
"\n",
"\n",
"## Basic Usage\n",
"\n",
"Setting `device_map` to `\"xpu\"` when initializing `IpexLLM` will put the LLM model on Intel GPU and benefit from IPEX-LLM optimizations:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import warnings\n",
"\n",
"from langchain.chains import LLMChain\n",
"from langchain_community.llms import IpexLLM\n",
"from langchain_core.prompts import PromptTemplate\n",
"\n",
"warnings.filterwarnings(\"ignore\", category=UserWarning, message=\".*padding_mask.*\")\n",
"template = \"USER: {question}\\nASSISTANT:\"\n",
"prompt = PromptTemplate(template=template, input_variables=[\"question\"])\n",
"\n",
"llm = IpexLLM.from_model_id(\n",
" model_id=\"lmsys/vicuna-7b-v1.5\",\n",
" model_kwargs={\"temperature\": 0, \"max_length\": 64, \"trust_remote_code\": True},\n",
" device_map=\"xpu\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use it in Chains"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm_chain = prompt | llm\n",
"\n",
"question = \"What is AI?\"\n",
"output = llm_chain.invoke(question)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Save/Load Low-bit Model\n",
"Alternatively, you might save the low-bit model to disk once and use `from_model_id_low_bit` instead of `from_model_id` to reload it for later use - even across different machines. It is space-efficient, as the low-bit model demands significantly less disk space than the original model. And `from_model_id_low_bit` is also more efficient than `from_model_id` in terms of speed and memory usage, as it skips the model conversion step. You can similarly set `device_map` to `xpu` in order to load the LLM model to Intel GPU. "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"To save the low-bit model, use `save_low_bit` as follows."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"saved_lowbit_model_path = \"./vicuna-7b-1.5-low-bit\" # path to save low-bit model\n",
"llm.model.save_low_bit(saved_lowbit_model_path)\n",
"del llm"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Load the model from saved lowbit model path as follows. \n",
"> Note that the saved path for the low-bit model only includes the model itself but not the tokenizers. If you wish to have everything in one place, you will need to manually download or copy the tokenizer files from the original model's directory to the location where the low-bit model is saved."
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm_lowbit = IpexLLM.from_model_id_low_bit(\n",
" model_id=saved_lowbit_model_path,\n",
" tokenizer_id=\"lmsys/vicuna-7b-v1.5\",\n",
" # tokenizer_name=saved_lowbit_model_path, # copy the tokenizers to saved path if you want to use it this way\n",
" model_kwargs={\"temperature\": 0, \"max_length\": 64, \"trust_remote_code\": True},\n",
" device_map=\"xpu\",\n",
")"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"Use the loaded model in Chains:"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"llm_chain = prompt | llm_lowbit\n",
"\n",
"\n",
"question = \"What is AI?\"\n",
"output = llm_chain.invoke(question)"
]
}
],
"metadata": {
"language_info": {
"name": "python"
}
},
"nbformat": 4,
"nbformat_minor": 2
}
20 changes: 17 additions & 3 deletions libs/community/langchain_community/llms/ipex_llm.py
Original file line number Diff line number Diff line change
@@ -1,11 +1,11 @@
import logging
ivy-lv11 marked this conversation as resolved.
Show resolved Hide resolved
from typing import Any, List, Mapping, Optional
from typing import Any, List, Mapping, Optional, Literal

from langchain_core.callbacks import CallbackManagerForLLMRun
from langchain_core.language_models.llms import LLM
from langchain_core.pydantic_v1 import Extra

DEFAULT_MODEL_ID = "gpt2"

Check failure on line 8 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (I001)

langchain_community/llms/ipex_llm.py:1:1: I001 Import block is un-sorted or un-formatted

Check failure on line 8 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.11

Ruff (I001)

langchain_community/llms/ipex_llm.py:1:1: I001 Import block is un-sorted or un-formatted


logger = logging.getLogger(__name__)
Expand Down Expand Up @@ -46,6 +46,7 @@
tokenizer_id: Optional[str] = None,
load_in_4bit: bool = True,
load_in_low_bit: Optional[str] = None,
device_map: Literal['cpu','xpu'] = 'cpu',
**kwargs: Any,
) -> LLM:
"""
Expand Down Expand Up @@ -75,6 +76,7 @@
low_bit_model=False,
load_in_4bit=load_in_4bit,
load_in_low_bit=load_in_low_bit,
device_map=device_map,
model_kwargs=model_kwargs,
kwargs=kwargs,
)
Expand All @@ -86,6 +88,7 @@
model_kwargs: Optional[dict] = None,
*,
tokenizer_id: Optional[str] = None,
device_map: Literal['cpu','xpu'] = 'cpu',
**kwargs: Any,
) -> LLM:
"""
Expand All @@ -109,6 +112,7 @@
low_bit_model=True,
load_in_4bit=False, # not used for low-bit model
load_in_low_bit=None, # not used for low-bit model
device_map=device_map,
model_kwargs=model_kwargs,
kwargs=kwargs,
)
Expand All @@ -121,6 +125,7 @@
load_in_4bit: bool = False,
load_in_low_bit: Optional[str] = None,
low_bit_model: bool = False,
device_map: Literal['cpu','xpu'] = "cpu",
model_kwargs: Optional[dict] = None,
kwargs: Optional[dict] = None,
) -> Any:
Expand Down Expand Up @@ -189,6 +194,15 @@
model_kwargs=_model_kwargs,
)

# Set "cpu" as default device

if device_map not in ["cpu", "xpu"]:
raise ValueError(
"IpexLLM currently only supports device to be "
f"'cpu' or 'xpu', but you have: {device_map}."
)
model.to(device_map)

return cls(
model_id=model_id,
model=model,
Expand Down Expand Up @@ -237,7 +251,7 @@
if self.streaming:
from transformers import TextStreamer

input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.model.device)

Check failure on line 254 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (E501)

langchain_community/llms/ipex_llm.py:254:89: E501 Line too long (96 > 88)

Check failure on line 254 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.11

Ruff (E501)

langchain_community/llms/ipex_llm.py:254:89: E501 Line too long (96 > 88)
streamer = TextStreamer(
self.tokenizer, skip_prompt=True, skip_special_tokens=True
)
Expand All @@ -263,7 +277,7 @@
text = self.tokenizer.decode(output[0], skip_special_tokens=True)
return text
else:
input_ids = self.tokenizer.encode(prompt, return_tensors="pt")
input_ids = self.tokenizer.encode(prompt, return_tensors="pt").to(self.model.device)

Check failure on line 280 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.8

Ruff (E501)

langchain_community/llms/ipex_llm.py:280:89: E501 Line too long (96 > 88)

Check failure on line 280 in libs/community/langchain_community/llms/ipex_llm.py

View workflow job for this annotation

GitHub Actions / cd libs/community / make lint #3.11

Ruff (E501)

langchain_community/llms/ipex_llm.py:280:89: E501 Line too long (96 > 88)
if stop is not None:
from transformers.generation.stopping_criteria import (
StoppingCriteriaList,
Expand Down
20 changes: 20 additions & 0 deletions libs/community/tests/integration_tests/llms/test_ipex_llm.py
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,23 @@ def test_save_load_lowbit(model_id: str) -> None:
)
output = loaded_llm.invoke("Hello!")
assert isinstance(output, str)

@skip_if_no_model_ids
@pytest.mark.parametrize(
"model_id",
model_ids_to_test,
)
def test_load_generate_gpu(model_id: str) -> None:
"""Test valid call."""
llm = IpexLLM.from_model_id(
model_id=model_id,
model_kwargs={
"temperature": 0,
"max_length": 16,
"trust_remote_code": True,
},
device_map="xpu",
)
output = llm.generate(["Hello!"])
assert isinstance(output, LLMResult)
assert isinstance(output.generations, list)
Oscilloscope98 marked this conversation as resolved.
Show resolved Hide resolved
Loading