You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When using the pipeline module/engine, checkpoints are saved per layer within each module/stage. But when using one of the FP16 optimizers, each layers checkpoint size is the full size of the whole pipeline stage (all layers in the stage), which causes checkpoint sizes to be quadratic in the number of layers per stage.
Currently, I can overcome this by passing per-layer param-groups to the optimizer:
params = [{"params": [p for p in layer.parameters() if p.requires_grad]} for layer in net.forward_funcs]
But this doesn't seem super clean. Is there a cleaner way I missed to handle this? And if not maybe worth adding a method to PipelineModule to retrieve the parameters in groups per layer?
The text was updated successfully, but these errors were encountered:
Hi,
When using the pipeline module/engine, checkpoints are saved per layer within each module/stage. But when using one of the FP16 optimizers, each layers checkpoint size is the full size of the whole pipeline stage (all layers in the stage), which causes checkpoint sizes to be quadratic in the number of layers per stage.
The issue seems to be related to the FP16 optimizers using one large flattened param buffer for the whole param group, but pytorch saves the whole tensor even when only saving a view of it.
Currently, I can overcome this by passing per-layer param-groups to the optimizer:
params = [{"params": [p for p in layer.parameters() if p.requires_grad]} for layer in net.forward_funcs]
But this doesn't seem super clean. Is there a cleaner way I missed to handle this? And if not maybe worth adding a method to PipelineModule to retrieve the parameters in groups per layer?
The text was updated successfully, but these errors were encountered: