-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed gather method to work with distributed training in metrics.py and loss.py #450
Conversation
Hi @mstapelberg! Not sure if this is a better way of implementing though. @Linux-cpp-lisp will know! |
Hi @kavanase thanks for your reply and help! I'll give that a go, likely will be a much better implementation than what I hacked together. Will try the new branch for pytorch_runstats now |
Hi @kavanase I gave the updated pytorch_runstats a go; it works (I've updated my initial pull request). However when I try it on a more realistic problem the following error happens (full log here -
I'm fairly novice when it comes to Github so would it be best to open another issue with this? Or is this something you have seen in the past as well? Thanks! |
Hmm ok I haven't seen that error yet with my tests for the Maybe worth trying that with the un-edited |
Description
In the DDP branch, the gather method uses a .get_state() method that does not exist in the torch_runstats package. I implemented a simple work around to directly access the state of the running statistics (_state) and the number of samples (_n).
In the metrics.py version of gather, I also explicitly made sure the tensors are on the same device prior to accumulating the state.
Motivation and Context
Resolves: #449
How Has This Been Tested?
I tested my changes by comparing the output metrics from minimal.yaml and minimal_distributed.yaml.
aspirin_pr_test.txt
distributed_aspirin_pr_test.txt
Training time with 2 GPUs (Quadro RTX6000) was 248 seconds with
minimal_distributed.txt
:
Training time with 1 GPU (Quadro RTX6000) was 566 seconds with
minimal.txt
:
I tried another training attempt with 1 GPU
aspirin_pr_test_t2.txt
:
Using the state-run runstats branch
minimal_distributed_fixed_runstats.txt
:
The configs were setup such that the distributed_batch_size / world_size = batch_size. Not sure what happened between the first and second single gpu test, but perhaps I should repeat the trainings a few times to compare the loss and wall times.
Types of changes
Checklist:
black
.docs/options
) has been updated with new or changed options.CHANGELOG.md
.