Native Intel IPEX-LLM Support #7190

iamhumanipromise · 2024-05-10T02:02:37Z

Prerequisites

Please answer the following questions for yourself before submitting an issue.

[X ] I am running the latest code. Development is very rapid so there are no tagged versions as of now.
[ X] I carefully followed the README.md.
[ X] I searched using keywords relevant to my issue to make sure that I am creating a new issue that is not already open (or closed).
[X ] I reviewed the Discussions, and have a new bug or useful enhancement to share.

Feature Description

I have found this closed issue where someone manually (?how?) implemented IPEX-LLM. However, looking forward to native IPEX-LLM support for Intel Xe iGPUs + Intel Arc dGPUs on Windows and Linux

#7042

TL;DR is IPEX-LLM now provides a C++ interface, which can be used as a backend for running llama.cpp on Intel GPUs. Incorporating this interface into llama.cpp would allow for leveraging the optimized performance of IPEX-LLM.

Motivation

Intel Xe graphics launched in 2020. Flex, Max Datacenter and Arc Consumer cards for laptop and desktop launched in 2022. This is a lot of devices in production/circulation.

This would "permit" llama.cpp users to utilize their integrated Xe GPUs and dedicated Arc GPUs, Datacenter Flex and Max cards with llama.cpp on BOTH Windows and Linux natively (without a confusing manual build).

Possible Implementation

The implementation of native Intel IPEX-LLM support would be something like... Integrate --> Test --> Document --> Release.

Integration with IPEX: Since IPEX-LLM is built on top of Intel Extension for PyTorch (IPEX), the first step would be to ensure seamless integration with IPEX. This would involve linking the llama.cpp build system with the IPEX library and ensuring that all dependencies are correctly managed. Here is a link for using llama.cpp with Intel GPUs...

Full manual/guide: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html
Full verified model list: https://ipex-llm.readthedocs.io/en/latest/#verified-models
Github: https://github.com/intel-analytics/ipex-llm

The "owners" of this process will be the devs and engineers here; in this Github (simple nerds such as myself do not have the expertise to tackle something like this... even locally)

For example from the documentation it looks like this would be create a new conda envioronment --> set up environment --> configure oneapi variables --> update cmakelists.txt or makefile with paths to IPEX-LLM library and headers --> then ??map llama.cpp functionalities to ipex apis (which Intel has already done).

Testing Across Platforms: Ensuring that the implementation works across different versions of Windows and Linux is crucial... This includes testing on various Intel iGPUs and Arc dGPUs to guarantee broad compatibility. This effort would involve the community here, various Discords, subreddits, and perhaps trying to "rope in" as many laptop/desktop Xe iGPU users and dGPU users as possible -- so that means gamers, too.

The "owners" of this step would be wide-ranging overall.

Documentation and Examples: Someone would have to "own" updating the documentation to guide users on how to enable and use the new IPEX-LLM support. Providing examples and quickstart guides can significantly help; but ultimately for independent users it will be up to them and then for GUI and TUI/CLI frontends, the documentation will need to be updated by them.
Release After all of this has been done, going forward to launch woot woot.

I'm sure there are many, many steps I am missing here. Just wanted to "kick off" the process.

NeoZhangJianyu · 2024-05-10T12:52:31Z

@iamhumanipromise
Sorry, it's not clear of the description of the idea.

As my understanding of this issue, you hope use IPEX-LLM as backend to support Intel GPU.
It's more like use TensorFlow/Pytorch as backend too.

If yes, will it be quicker than TensorFlow/Pytorch? Why not use TenfsorFlow/Pytorch directly?

qnixsynapse · 2024-05-10T13:50:50Z

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

Also PyTorch's IPEX and openxla both use Intel OneAPI SYCL which is used by llama.cpp's SYCL backend. So, it is already supported.

simonlui · 2024-05-10T18:38:32Z

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

NeoZhangJianyu · 2024-05-11T10:12:09Z

SYCL backend is still focusing on the missed functions to support more features and model.
Performance optimization will be handled in next. But we have less spare time to contribute to it. I think the progress won't be quickly.
Because SYCL backend will cover more Intel GPUs: Max, Flex, Arc and iGPU in MTL. So the performance optimization need to be verified on all GPUs and make sure not to impact any of them.

Thellton · 2024-05-13T14:30:11Z

which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?

EDIT: I take it that [SYCL] Refactor would be it?

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

simonlui · 2024-05-13T14:56:15Z

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.

Thellton · 2024-05-15T02:32:19Z

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

Going to respond to this since the other comment was deleted from another person from Intel. I think it should be working but for some reason, it fails as it forces you to fully offload in the case of something like Llama 3 8B or faults on an illegal instruction for something bigger like Llama 3 70B or Command-R which had support just added from what I tested. I haven't upgraded in a while, so I'll probably be rechecking this before opening a ticket in the other repository to fix this since upstream works but the fork doesn't in this situation.

Do they have a fork on github for llamacpp? I actually haven't found it, I just installed from the readingthedocs site that I linked to. hell, I don't actually know how to go about updating the install; just have a hypothesis about what I need to do.

tonym97 · 2024-05-15T14:10:51Z

Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)

Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.

And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)

Thellton · 2024-05-15T14:34:33Z

Most of the stuff for IPEX-LLM has been upstreamed into llama.cpp. IPEX-LLM llama.cpp vs llama.cpp (upstream) is basically the same perf at this point. I think the question shouldn't be for IPEX-LLM support, but for SYCL support using upstream llama.cpp (which the IPEX-LLM team is already upstreaming into llama.cpp)

Also note that this doesn't require IPEX itself. IPEX-LLM does, but the native SYCL support does not.

And yes I work for Intel and yes I'm talking to IPEX-LLM teams and others :)

With a Q6_K quant of a llama 3 that had been quanted from a BF16 GGUF with the correct pre-tokeniser and EOS token, I get 30 tokens per second at the beginning of context with the IPEX branch compared to 17 tokens per second with the llamacpp-SYCL version b2885. that's actually quite a stark difference in performance as I see it and I feel that if it's possible, it'd be awesome to see the performance of the IPEX branch becoming generally available from the standard SYCL branch of llamacpp, as installing the IPEX branch was troublesome.

So I'll be waiting with bated breath, I guess.

tonym97 · 2024-05-15T14:37:33Z

Yeah that’s fair. It definitely depends on model size etc.

Will work with the team to try to upstream asap as we can.

NeoZhangJianyu · 2024-05-17T23:04:30Z

which issue/pull would you recommend we follow for latest info about the SYCL branch @NeoZhangJianyu?

EDIT: I take it that [SYCL] Refactor would be it?

IPEX LLM already supports llama.cpp I think: https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/llama_cpp_quickstart.html

What IPEX-LLM has is a fork of llama.cpp and some other projects that has optimizations that have not been upstreamed here for one reason or another. I'm a current user of it and typically, usage of it doubles the speed of upstream. However, it can't support mixed GPU + CPU scenarios which is the main issue and new model support may take a while to filter over. Hence why I have upstream and the fork for my use cases.

they seem to be keeping it reasonably up to date, as their published version of LlamaCPP-IPEX is using a week old version as its base line so far. I do hope they provide a bit of clarity about how to go about actually pulling new versions of their IPEX branch though... also Interesting about the lack of GPU overflow/partial offload capability, I was not aware of that...

I suggest to use the latest code in master branch.
There is no obvious issue in SYCL backend recently.

christianazinn · 2024-05-25T01:13:48Z

Can also attest to differences in SYCL build (as outlined in https://github.com/ggerganov/llama.cpp/blob/master/README-sycl.md) and the IPEX-LLM branch. Intel Arc A770M, Llama 3 8B Q8_0, full offload with the prompt "Building a website can be done in 10 simple steps:\nStep 1:"; Win11 and WSL2 Ubuntu SYCL builds get in the 4.3-4.9 tok/s range, while WSL2 Ubuntu IPEX build from their branch gets 6.5-7.1 tok/s. Looking forward to upstreamed IPEX support!

github-actions · 2024-07-09T01:06:58Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

iamhumanipromise added the enhancement New feature or request label May 10, 2024

github-actions bot added the stale label Jun 24, 2024

github-actions bot closed this as completed Jul 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Native Intel IPEX-LLM Support #7190

Native Intel IPEX-LLM Support #7190

iamhumanipromise commented May 10, 2024 •

edited

Loading

NeoZhangJianyu commented May 10, 2024

qnixsynapse commented May 10, 2024 •

edited

Loading

simonlui commented May 10, 2024 •

edited

Loading

NeoZhangJianyu commented May 11, 2024

Thellton commented May 13, 2024 •

edited

Loading

simonlui commented May 13, 2024

Thellton commented May 15, 2024

tonym97 commented May 15, 2024 •

edited

Loading

Thellton commented May 15, 2024 •

edited

Loading

tonym97 commented May 15, 2024

NeoZhangJianyu commented May 17, 2024

christianazinn commented May 25, 2024

github-actions bot commented Jul 9, 2024

Native Intel IPEX-LLM Support #7190

Native Intel IPEX-LLM Support #7190

Comments

iamhumanipromise commented May 10, 2024 • edited Loading

Prerequisites

Feature Description

Motivation

Possible Implementation

NeoZhangJianyu commented May 10, 2024

qnixsynapse commented May 10, 2024 • edited Loading

simonlui commented May 10, 2024 • edited Loading

NeoZhangJianyu commented May 11, 2024

Thellton commented May 13, 2024 • edited Loading

simonlui commented May 13, 2024

Thellton commented May 15, 2024

tonym97 commented May 15, 2024 • edited Loading

Thellton commented May 15, 2024 • edited Loading

tonym97 commented May 15, 2024

NeoZhangJianyu commented May 17, 2024

christianazinn commented May 25, 2024

github-actions bot commented Jul 9, 2024

iamhumanipromise commented May 10, 2024 •

edited

Loading

qnixsynapse commented May 10, 2024 •

edited

Loading

simonlui commented May 10, 2024 •

edited

Loading

Thellton commented May 13, 2024 •

edited

Loading

tonym97 commented May 15, 2024 •

edited

Loading

Thellton commented May 15, 2024 •

edited

Loading