Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Saving results from each process and aggregating distributed output #92

Closed
IlyasMoutawwakil opened this issue Dec 2, 2023 · 1 comment
Assignees
Labels
enhancement New feature or request

Comments

@IlyasMoutawwakil
Copy link
Member

The way results are handled with distributed inference/training (torchrun launcher) is by only saving them in rank 0, however this might be wrong, misleading or ambiguous, especially when processes are not necessarily synchronized (dp/tp inference) and where one process/device might be affected by comms. A better approach would be sending the results from each process and merging them appropriately by the launcher (sum throughputs, avg latencies ?). this will remove the world_size logic from benchmarks.

@IlyasMoutawwakil IlyasMoutawwakil self-assigned this Dec 7, 2023
@IlyasMoutawwakil IlyasMoutawwakil added the enhancement New feature or request label Dec 7, 2023
@IlyasMoutawwakil IlyasMoutawwakil changed the title Saving results from each process and an aggregated output Saving results from each process and aggregating distributed output Jan 12, 2024
@IlyasMoutawwakil
Copy link
Member Author

done in #122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant