Update pipeline parallel serving for more model support #11428

plusbang · 2024-06-25T10:24:12Z

Description

As a follow-up to #11319:

Move ModelRunner (implementation for pipeline parallel multi-stage serving) to source code
Update ModelRunner.load_model for different model support
Align ModelRunner.model_step with pipeline_parallel_generate about model forward and past key values
Some small fixes about serving example (remove unused function, add license)

1. Why the change?

Align PP serving and inference, therefore verified models in PP inference could be directly used in serving.

2. User API changes

from ipex_llm.transformers import ModelRunner

4. How to test?

Unit test
Local test (run llama2-7b, baichuan2-13b and chatglm3-6b with pp serving example)

xiangyuT · 2024-06-26T05:59:06Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

+            input_ids = None
+            inputs_embeds = input
+
+        output = self.model(input_ids=input_ids,


Need to pass attention_mask = make_attention_mask(cur_batch.prompt_lengths).to(input.device) here for batch serving.

Need to pass attention_mask = make_attention_mask(cur_batch.prompt_lengths).to(input.device) here for batch serving.

Have updated :)

xiangyuT

LGTM

…ics#11428)

plusbang requested review from xiangyuT and sgwhat June 25, 2024 10:25

update

7fea5dc

xiangyuT reviewed Jun 26, 2024

View reviewed changes

hzjane mentioned this pull request Jun 27, 2024

Add pp_serving example to serving image #11433

Merged

7 tasks

plusbang requested a review from hzjane June 27, 2024 06:20

xiangyuT approved these changes Jun 27, 2024

View reviewed changes

plusbang merged commit 987017e into intel-analytics:main Jun 27, 2024
31 checks passed

plusbang added 2 commits June 27, 2024 22:01

fix attention mask and mem usage

0c72348

fix

130606b

RyuKosei pushed a commit to RyuKosei/ipex-llm that referenced this pull request Jul 19, 2024

Update pipeline parallel serving for more model support (intel-analyt…

e5b0697

…ics#11428)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update pipeline parallel serving for more model support #11428

Update pipeline parallel serving for more model support #11428

plusbang commented Jun 25, 2024 •

edited

Loading

xiangyuT Jun 26, 2024

plusbang Jun 27, 2024

xiangyuT left a comment

Update pipeline parallel serving for more model support #11428

Update pipeline parallel serving for more model support #11428

Conversation

plusbang commented Jun 25, 2024 • edited Loading

Description

1. Why the change?

2. User API changes

4. How to test?

xiangyuT Jun 26, 2024

Choose a reason for hiding this comment

plusbang Jun 27, 2024

Choose a reason for hiding this comment

xiangyuT left a comment

Choose a reason for hiding this comment

plusbang commented Jun 25, 2024 •

edited

Loading