Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding pytorch DDP script of detection task #1777

Merged
merged 3 commits into from
Nov 22, 2024

Conversation

sarjil77
Copy link
Contributor

in this script i have implemented the DDP training for detection task.

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarjil77 👋,

Thanks for your PR i left some comments here :D

Before you push your code please run make style to fix the formatting 👍

If all changes are applied it would be create if you could also update the corresponding readme please ?

File: https://github.com/mindee/doctr/blob/main/references/detection/README.md

content to add:

Multi-GPU support (PyTorch only - Experimental)

Multi-GPU support on recognition task with PyTorch has been added. It'll be probably added for other tasks.
Arguments are the same than the ones from single GPU, except:

  • --devices: by default, if you do not pass --devices, it will use all GPUs on your computer.
    You can use specific GPUs by passing a list of ids (ex: 0 1 2). To find them, you can use the following snippet:
import torch
devices = [torch.cuda.device(i) for i in range(torch.cuda.device_count())]
device_names = [torch.cuda.get_device_name(d) for d in devices]
  • --backend: you can specify another backend for DistribuedDataParallel if the default one is not available on
    your operating system. Fastest one is nccl according to PyTorch Documentation.
python references/detection/train_pytorch_ddp.py db_resnet50 --train_path path/to/your/train_set --val_path path/to/your/val_set --epochs 5 --devices 0 1 --backend nccl

Same as you can find here:

https://github.com/mindee/doctr/blob/main/references/recognition/README.md

import datetime
import hashlib
import logging
import multiprocessing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

import multiprocessing as mp

That's required for the worker count if not directly passed

references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved

"""
rank(int) : it is the unique identifier to each process and you can also say that it is your device id
world_size(int) : total number of processes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

->

"""
Args:
----
    rank (int): device id to put the model on
    world_size (int): number of processes participating in the job
    args: other arguments passed through the CLI
"""

references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
references/detection/train_pytorch_DDP.py Outdated Show resolved Hide resolved
@felixdittrich92 felixdittrich92 self-assigned this Nov 15, 2024
@felixdittrich92 felixdittrich92 added this to the 0.11.0 milestone Nov 15, 2024
@felixdittrich92 felixdittrich92 added topic: documentation Improvements or additions to documentation type: enhancement Improvement ext: references Related to references folder framework: pytorch Related to PyTorch backend topic: text detection Related to the task of text detection labels Nov 15, 2024
@sarjil77
Copy link
Contributor Author

all the changes as you suggested are done along with its updated README file as well

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarjil77 👋

Thanks for the updates only a few small things left and we are good to merge 👍

After changes are applied please run:

make style

or

pip3 install ruff
ruff format .
ruff check --fix .

import time
import numpy as np
import torch
import wandb
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

missing import

import multiprocessing as mp


"""
Args:
----
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove the

----

-->

"""
Args:
    rank (int): device id to put the model on
    world_size (int): number of processes participating in the job
    args: other arguments passed through the CLI
"""

We changed this in the meanwhile :)

with open(os.path.join(args.train_path, "labels.json"), "rb") as f:
train_hash = hashlib.sha256(f.read()).hexdigest()

if args.show_samples:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sarjil77 if rank == 0 missing :)

@sarjil77
Copy link
Contributor Author

i have made all the changes a s required :), and i am not adding "import multiprocessing as mp" as mp bacause mp is already defined using "pytorch.multiprocessing as mp", i have taken this reference from DDP script of Recognition task and i think it is good to merge now. :)

Copy link

codecov bot commented Nov 21, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 96.55%. Comparing base (83f1bc5) to head (a522f0e).
Report is 10 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1777      +/-   ##
==========================================
- Coverage   96.56%   96.55%   -0.02%     
==========================================
  Files         164      165       +1     
  Lines        7895     7901       +6     
==========================================
+ Hits         7624     7629       +5     
- Misses        271      272       +1     
Flag Coverage Δ
unittests 96.55% <ø> (-0.02%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.


🚨 Try these New Features:

Copy link
Contributor

@felixdittrich92 felixdittrich92 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @sarjil77 👋,

Thanks for the updates 👍

2 small things left but don't worry i can do this thanks for the PR :)

@felixdittrich92 felixdittrich92 merged commit c3ec3cb into mindee:main Nov 22, 2024
66 of 70 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ext: references Related to references folder framework: pytorch Related to PyTorch backend topic: documentation Improvements or additions to documentation topic: text detection Related to the task of text detection type: enhancement Improvement
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants