Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizations for Pipeline Parallel Serving #11702

Merged
merged 6 commits into from
Aug 2, 2024
Merged

Conversation

xiangyuT
Copy link
Contributor

@xiangyuT xiangyuT commented Aug 1, 2024

3. Summary of the change

Optimize stream_output() methods in Pipeline Parallel Serving.

4. How to test?

@@ -179,6 +180,11 @@ def pipeline_parallel(model, pipeline_parallel_stages, torch_dtype=torch.float32
layer_start = slice_size * local_rank
layer_end = layer_start + min(slice_size, num_layers - layer_start)

# if local_rank == 0:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove these lines if they are not necessary

@@ -738,14 +744,73 @@ def clear_batch(self, cur_id):
self.is_finish.pop(cur_id, None)
self.partial_output_dict.pop(cur_id, None)

async def finish_stream_output(self, cur_id):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

wait_xxx maybe better

torch.xpu.synchronize(self.device)
self.send_buff.wait()
if output is not None:
self.send_buff = dist.isend(output, dst=self.next_rank)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

# remain = 0
# self.is_finish[request_id] = True

text = cur_text[cached_index]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text is not a good name

@qiyuangong qiyuangong changed the title [WIP] Optimizations for Pipeline Parallel Serving Optimizations for Pipeline Parallel Serving Aug 2, 2024

text = tokenizer.decode(self.token_cache[request_id])

if text.endswith("\n"):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This block is text-related. Maybe we can move it into a function,

@xiangyuT xiangyuT marked this pull request as ready for review August 2, 2024 03:11
Copy link
Contributor

@qiyuangong qiyuangong left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@xiangyuT xiangyuT merged commit 1baa3ef into intel:main Aug 2, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants