Distributed training optimal parameters #98

JRMeyer · 2021-03-07T08:10:09Z

JRMeyer
Mar 7, 2021
Maintainer

>>> vcjobacc
[September 26, 2019, 10:12am]

Hello everyone!

I have an access to a server with v100 gpu. So I tried to train a model
there with batch size 32 for training and 16 for testing. Unfortunately,
the gpu is not being used 100%. I mean, in average the load is 30-40%,
occasionally it goes up to 85-90 % for a short time, and goes down to
14%. My question is, if I set the batch size, let's say, 48 (with it set
to 32 it uses 11gb of gpu ram), will the gpu load go up (training
boosts) WHILE the final model performance will not be damaged?

After tests on single V100 I want to do distributed training. Should I
just use distributed.py instead of train.py and provide the same config
as for the single gpu training? Should I leave the same batch size?

Thank you a lot!

[This is an archived TTS discussion thread from discourse.mozilla.org/t/distributed-training-optimal-parameters]

JRMeyer · 2021-03-07T08:10:12Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[September 26, 2019, 12:48pm]

For the first question, your CPU is probably bottlenecking your gpu; and
what you wrote about higher batch size should also be correct. I've
trained models with 64 batch size and haven't noticed anything wonky
with it.

{.avatar

> After tests on single V100 I want to do distributed training. Should I
> just use distributed.py instead of train.py and provide the same
> config as for the single gpu training? Should I leave the same batch
> size?

You should use train.py. The batch_size in the config is not effective
batch size, so if you want a batch_size of 32 and you have 2 GPUs, you
should set your batch_size to 16(16 slash *2).

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:15Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> vcjobacc
[October 1, 2019, 2:16pm]

Thank you a lot for your reply!

Today I've tried to use train.py for distributed training. What I end up
with - I've been waiting for like half an hour, while one single python3
process was loading the CPU up to 100, multiple workers were loading the
cpu up to 2-3%, in NVIDIA-SMI I saw at first nothing for like a 10
minutes, then one process started to use ram of the 1st of 4 gpus. So 20
minutes later I decided to give distributed.py a try. After like 10-15
seconds epoch started and all 4 gpu were equally involved. Did I do
something wrong? train.py saw I have 4 GPU. According to its code, it
seems like it should word (it imports some important functions from
distributed.py actually). I used for both cases the same config. slash
Thank you!

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:17Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 1, 2019, 2:29pm]

{.avatar

> ording to its code, it seems like it should word (it imports some
> important functions from distributed.py actually). I used for both
> cases the same config.

You aren't doing anything wrong. That is probably the intended usage. I
use apex to launch training
processes(using train.py, otherwise it hangs like you said). You should
be good.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:20Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> vcjobacc
[October 1, 2019, 2:31pm]

Thank you a lot, will try now! Actually, with 4x Tesla V100 training
seems to be equal speed as with a signle RTX2080TI. So my last hope is
to make train.py work!

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:22Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 1, 2019, 2:52pm]

I don't think thats your problem. 4 V100s afford you a lot of memory
which means you can try out larger batchsizes which should speed up your
training. I am not sure how a large( slash >64) batchsize is going to affect
your training though.

If your effective batchsize on the single gpu is the same as that on 4
gpus, you shouldn't be seeing much gain on time anyway.

I could be wrong; I'd wait for second opinions.
Any insights on this?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:25Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> vcjobacc
[October 1, 2019, 3:41pm]

Thank you for the reply!

I've tried to install apex, but I guess due to CUDA mismatch (I have
10.1, apex requires 10.0 (or PyTorch)) even though I commented this
check and it says apex is installed, train.py still doesn't work. slash
Anyway, if it isn't suppose to improve the speed, let it go. But, here's
the question. Why Erogol used two GPU and total batch size was 32?
What's the point than to use two gpu, while it is not supposed to
improve the speed?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:27Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 1, 2019, 3:58pm]

{.avatar

> Anyway, if it isn't suppose to improve the speed, let it go. But,
> here's the question. Why Erogol used two GPU and total batch size
> was 32? What's the point than to use two gpu, while it is not supposed
> to improve the speed?

for the VRAM I suppose; like i said, i am not sure if my assumptions
.

{.avatar

> train.py still doesn't work.

CUDA_VISIBLE_DEVICES=0,1,2,3 NCCL_DEBUG=INFO python -m
apex.parallel.multiproc /tts/TTS/train.py slash --config_path
/config_moz_runningtests.json slash --restore_path
./results/r2_test-September-30-2019_10+27AM-/checkpoint_237000.pth.tar

You're gonna need that.(check the bold parts) slash
Also, Absolute paths are necessary when you use this.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:30Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> carlfm01
[October 2, 2019, 3:38am]

{.avatar

> I could be wrong; I'd wait for second opinions.
Any insights on this?

Hello, as much I can recall I tried using multi gpu while I was trying
to find out correct hparams, so I can't really tell if an improvement
came from multi gpu or hparams change. I was also trying different
pytorch versions to fix my issue of slow training, so any suggestion is
biased due to changes from run to run.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:33Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> vcjobacc
[October 2, 2019, 6:37am]

Thank you a lof for that command, I just had to add '--world_size' key
to train.py for some reasons, but it worked! The speed remained the
same, though. It's indeed slower than on the single GPU... two times
slower.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:35Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 2, 2019, 6:29am]

{.avatar

> I just had to ass '--world_size' key to train.py

This is necessary too. I missed that. But I believe you were on the
right path with distribute.py anyway.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:38Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> vcjobacc
[October 2, 2019, 6:29am]

Hello!

Did you succeed to gain some reasonable speed improvement after all? Is
it worth trying multiple GPU at all?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:40Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 2, 2019, 6:32am]

I use multiple GPUs to be able to train on an effective batch_size of 32
( my exemplars are quite long). I believe, if you use a larger
batch_size you will achieve speed up, but I dont know how that's going
to affect your training. Like I said, erogol could authoritatively
clarify this; let's wait for his reply.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:43Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[October 2, 2019, 9:46am]

does apex work on
python3?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:45Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 2, 2019, 9:50am]

Yes it does.

Could you please clear the question about the speed up? Are my
assumptions right?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:48Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[October 2, 2019, 9:56am]

but if I pip install apex and call the command you have above it raises
python3 incompatible issues. Maybe the way I installed it is wrong.

Yes it is mostyl CPU bottlenecking. Any if your sequences alongates, it
takes even more time. I don't count the first epoch since it computes
phonemes and caches them.

If there is something technical, there is a good pointer
https://github.com/espnet/espnet#error-due-to-acs-multiple-gpus

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:50Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 2, 2019, 9:57am]

{.avatar

> but if I pip install apex and call the command you have above it
> raises python3 incompatible issues.

could you please post the log?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:53Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[October 2, 2019, 9:58am]

I'll with my next run.

Could you also say it performs better than TTS distributed?

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:56Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> alchemi5t
[October 2, 2019, 10:01am]

{.avatar

> Could you also say it performs better than TTS distributed?

Give me a few days(I do not have access to my machine right now.), I
will benchmark them and give you details.

My 2 cents, I do not believe it's faster by much. Should equally time
consuming or slightly better.

[Archived Post]

0 replies

JRMeyer · 2021-03-07T08:10:58Z

JRMeyer
Mar 7, 2021
Maintainer Author

>>> erogol
[October 2, 2019, 10:05am]

Thx. That would be great. I also plan to have apex lower precision
training in TTS soon.

[next page
→

[Archived Post]

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Distributed training optimal parameters #98

{{title}}

Replies: 19 comments

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Distributed training optimal parameters #98

JRMeyer Mar 7, 2021 Maintainer

Replies: 19 comments

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer Mar 7, 2021 Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author

JRMeyer
Mar 7, 2021
Maintainer Author