Optimizations for Pipeline Parallel Serving #11702

xiangyuT · 2024-08-01T06:49:34Z

3. Summary of the change

Optimize stream_output() methods in Pipeline Parallel Serving.

4. How to test?

Local Test
Unit test: Please manually trigger the PR Validation here by inputting the PR number (e.g., 1234). And paste your action link here once it has been successfully finished.
https://github.com/intel-analytics/ipex-llm-workflow/actions/runs/10209347945

qiyuangong · 2024-08-01T10:23:10Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

@@ -179,6 +180,11 @@ def pipeline_parallel(model, pipeline_parallel_stages, torch_dtype=torch.float32
    layer_start = slice_size * local_rank
    layer_end = layer_start + min(slice_size, num_layers - layer_start)

+    # if local_rank == 0:


Remove these lines if they are not necessary

qiyuangong · 2024-08-01T10:25:48Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

@@ -738,14 +744,73 @@ def clear_batch(self, cur_id):
        self.is_finish.pop(cur_id, None)
        self.partial_output_dict.pop(cur_id, None)

+    async def finish_stream_output(self, cur_id):


wait_xxx maybe better

qiyuangong · 2024-08-01T10:34:13Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

+            torch.xpu.synchronize(self.device)
+            self.send_buff.wait()
+        if output is not None:
+            self.send_buff = dist.isend(output, dst=self.next_rank)


https://pytorch.org/docs/stable/distributed.html#torch.distributed.isend

Use async isend to replace sync send.

qiyuangong · 2024-08-01T10:35:01Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

+                #     remain = 0
+                #     self.is_finish[request_id] = True
+
+                text = cur_text[cached_index]


text is not a good name

qiyuangong · 2024-08-02T02:28:06Z

python/llm/src/ipex_llm/transformers/pipeline_parallel.py

+
+                text = tokenizer.decode(self.token_cache[request_id])
+
+                if text.endswith("\n"):


This block is text-related. Maybe we can move it into a function,

qiyuangong

LGTM

xiangyuT added 3 commits August 1, 2024 14:48

init

ac86eaf

refine

4759cb6

format

660aab7

qiyuangong reviewed Aug 1, 2024

View reviewed changes

xiangyuT added 2 commits August 2, 2024 09:43

refine

72ce110

refine

7f4f0b0

qiyuangong changed the title ~~[WIP] Optimizations for Pipeline Parallel Serving~~ Optimizations for Pipeline Parallel Serving Aug 2, 2024

qiyuangong reviewed Aug 2, 2024

View reviewed changes

refine

c78314d

xiangyuT marked this pull request as ready for review August 2, 2024 03:11

qiyuangong approved these changes Aug 2, 2024

View reviewed changes

xiangyuT merged commit 1baa3ef into intel:main Aug 2, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimizations for Pipeline Parallel Serving #11702

Optimizations for Pipeline Parallel Serving #11702

xiangyuT commented Aug 1, 2024 •

edited

Loading

qiyuangong Aug 1, 2024

qiyuangong Aug 1, 2024

qiyuangong Aug 1, 2024

qiyuangong Aug 1, 2024

qiyuangong Aug 2, 2024

qiyuangong left a comment


		text = tokenizer.decode(self.token_cache[request_id])

		if text.endswith("\n"):

Optimizations for Pipeline Parallel Serving #11702

Optimizations for Pipeline Parallel Serving #11702

Conversation

xiangyuT commented Aug 1, 2024 • edited Loading

3. Summary of the change

4. How to test?

qiyuangong Aug 1, 2024

Choose a reason for hiding this comment

qiyuangong Aug 1, 2024

Choose a reason for hiding this comment

qiyuangong Aug 1, 2024

Choose a reason for hiding this comment

qiyuangong Aug 1, 2024

Choose a reason for hiding this comment

qiyuangong Aug 2, 2024

Choose a reason for hiding this comment

qiyuangong left a comment

Choose a reason for hiding this comment

xiangyuT commented Aug 1, 2024 •

edited

Loading