-
Notifications
You must be signed in to change notification settings - Fork 333
working prototype of wandb #271
base: main
Are you sure you want to change the base?
Conversation
also cc @QuentinDuval and @min-xu-ai :) |
Right now a critical issue is that for some errors, wandb stalls instead of crashing. When testing this, if a run stalls, rerun it with wandb disabled (by running |
This is exciting. Can you share a bit more info on how to test this out? I can check out your branch and I suppose I need to install some wandb pip packages and etc.? Can you share the commands for folks to test it out and to see how it is used in action? |
Hi, to install wandb
If you don't have a wandb account you should start by creating one. Once that's done, wandb will give you instructions on what to set up on your remote server. In terms of reproducing the error, in my end if you do distributed training (num_nodes > 1) and put a super large batch size leading to OOM, wandb currently stalls and I'm not sure why. It may also be an issue specific to me, I will also investigate. |
This is the exact command I am running :
running after
With
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see how this hook can cause the stall when OOM should have been raised. Maybe it wasn't really a stall? How about the case without an OOM, does the training go on as normal? Doesn't get slow down when wandb is enabled? I'd suggest put some prints in the hook and the train loop to see where exactly the stall happens if it indeed stalls.
vissl/utils/wandb.py
Outdated
|
||
wandb_available = True | ||
except ImportError: | ||
logging.info("Tensorboard is not available") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
stale msg
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes
import wandb | ||
from vissl.hooks import SSLWandbHook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
move these to the top or is it due to circular imports?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I want to only import wandb if the program is calling the hook. This way it will still work if a user has not installed wandb yet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me to do local imports .
from classy_vision.generic.distributed_util import is_primary | ||
from classy_vision.hooks.classy_hook import ClassyHook | ||
|
||
if is_primary(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do this check? Importing on all ranks should be fine too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
wandb import has some overhead to it, so I was trying to limit it to the main worker since it's the only one using it. This can be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to me to do it on primary_rank only. :)
Ok, I will try this tomorrow and let you know. Note that this is not specific to an OOM error, I saw the same thing with another error on the data side. |
Hi, quick update : I'm unable to use the current logger when launching jobs with SLURM. If I ask for an interactive session and launch a job it works ok, however launching directly with slurm I get the following errror :
If anyone with distributed programming knowledge knows why this is happening, please let me know |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you @pclucas14 :)
from classy_vision.generic.distributed_util import is_primary | ||
from classy_vision.hooks.classy_hook import ClassyHook | ||
|
||
if is_primary(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it makes sense to me to do it on primary_rank only. :)
|
||
def is_wandb_available(): | ||
""" | ||
Check whether wandb is available or not. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a link here to wandb documentation that open source users can read/access? :)
Check whether wandb is available or not. | ||
|
||
Returns: | ||
wandb_available (bool): based on wandb imports, returns whether tensboarboard |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: tensboarboard -> wandb
import wandb | ||
from vissl.hooks import SSLWandbHook |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
makes sense to me to do local imports .
f.write(wandb_id) | ||
|
||
name = cfg.HOOKS.WANDB_SETUP.EXP_NAME | ||
if name == "??": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: let's set name = "" instead of ??
|
||
BYTE_TO_MiB = 2 ** 20 | ||
|
||
class SSLWandbHook(ClassyHook): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A question on this: is this hook similar to Tensorboard hook with the only difference being in logging to "wandb" instead of tensorboard?
Is it possible that we can inherit the TensorboardHook
? Or alternatively, does it make sense to extend the TensorboardHook
directly to optionally log to wandb as well is user is using WandB ?
regarding this, we need to understand what wandb requires in order to run via SLURM. Are there specific requirements or documentation on configuring wandb when using slurm? |
I emailed the support team at Weights and Biases, will keep you posted. |
After speaking with people from WandB and having them look at the log, it seems that the following line gets called more than once, which is problematic. I need it to be called exactly once. Any tips on how to make that happen ? I though wrapping it in a |
Indeed, this is because @prigoyal I think we need to rework this a bit, and either:
What you do both think? |
indeed. THere is already a |
I should specify that I am indeed using |
It will not help: it's the very function to avoid in that case because @pclucas14 Could you check that by replacing If it does not work, could you try this instead?
The function we should create would look like this:
The other option is not to check the initialisation of
|
thank you so much @pclucas14 for clarification and fully agree @QuentinDuval . I think we should go for the above function and implement it in |
Hi @pclucas14! Thank you for your pull request. We require contributors to sign our Contributor License Agreement, and yours needs attention. You currently have a record in our system, but the CLA is no longer valid, and will need to be resubmitted. ProcessIn order for us to review and merge your suggested changes, please sign at https://code.facebook.com/cla. If you are contributing on behalf of someone else (eg your employer), the individual CLA may not be sufficient and your employer may need to sign the corporate CLA. Once the CLA is signed, our tooling will perform checks and validations. Afterwards, the pull request will be tagged with If you have received this in error or have any questions, please contact us at [email protected]. Thanks! |
This is a working prototype of a wandb logger. I'm filing the PR now so the team can have a look, but there are probably still a few things to iron out first.