-
Notifications
You must be signed in to change notification settings - Fork 445
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding pytorch DDP script of detection task #1777
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarjil77 👋,
Thanks for your PR i left some comments here :D
Before you push your code please run make style
to fix the formatting 👍
If all changes are applied it would be create if you could also update the corresponding readme please ?
File: https://github.com/mindee/doctr/blob/main/references/detection/README.md
content to add:
Multi-GPU support (PyTorch only - Experimental)
Multi-GPU support on recognition task with PyTorch has been added. It'll be probably added for other tasks.
Arguments are the same than the ones from single GPU, except:
--devices
: by default, if you do not pass--devices
, it will use all GPUs on your computer.
You can use specific GPUs by passing a list of ids (ex:0 1 2
). To find them, you can use the following snippet:
import torch
devices = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
device_names = [torch.cuda.get_device_name(d) for d in devices]
--backend
: you can specify anotherbackend
forDistribuedDataParallel
if the default one is not available on
your operating system. Fastest one isnccl
according to PyTorch Documentation.
python references/detection/train_pytorch_ddp.py db_resnet50 --train_path path/to/your/train_set --val_path path/to/your/val_set --epochs 5 --devices 0 1 --backend nccl
Same as you can find here:
https://github.com/mindee/doctr/blob/main/references/recognition/README.md
import datetime | ||
import hashlib | ||
import logging | ||
import multiprocessing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
import multiprocessing as mp
That's required for the worker count if not directly passed
|
||
""" | ||
rank(int) : it is the unique identifier to each process and you can also say that it is your device id | ||
world_size(int) : total number of processes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
->
"""
Args:
----
rank (int): device id to put the model on
world_size (int): number of processes participating in the job
args: other arguments passed through the CLI
"""
all the changes as you suggested are done along with its updated README file as well |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarjil77 👋
Thanks for the updates only a few small things left and we are good to merge 👍
After changes are applied please run:
make style
or
pip3 install ruff
ruff format .
ruff check --fix .
import time | ||
import numpy as np | ||
import torch | ||
import wandb |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
missing import
import multiprocessing as mp
|
||
""" | ||
Args: | ||
---- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove the
----
-->
"""
Args:
rank (int): device id to put the model on
world_size (int): number of processes participating in the job
args: other arguments passed through the CLI
"""
We changed this in the meanwhile :)
with open(os.path.join(args.train_path, "labels.json"), "rb") as f: | ||
train_hash = hashlib.sha256(f.read()).hexdigest() | ||
|
||
if args.show_samples: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sarjil77 if rank == 0 missing :)
i have made all the changes a s required :), and i am not adding "import multiprocessing as mp" as mp bacause mp is already defined using "pytorch.multiprocessing as mp", i have taken this reference from DDP script of Recognition task and i think it is good to merge now. :) |
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## main #1777 +/- ##
==========================================
- Coverage 96.56% 96.55% -0.02%
==========================================
Files 164 165 +1
Lines 7895 7901 +6
==========================================
+ Hits 7624 7629 +5
- Misses 271 272 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚨 Try these New Features:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @sarjil77 👋,
Thanks for the updates 👍
2 small things left but don't worry i can do this thanks for the PR :)
in this script i have implemented the DDP training for detection task.