Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

nnUNetv2 problem with changing patch_size (hc-leipzig-7t-mp2rage) #70

Closed
KaterinaKrejci231054 opened this issue Jul 25, 2024 · 4 comments

Comments

@KaterinaKrejci231054
Copy link
Contributor

KaterinaKrejci231054 commented Jul 25, 2024

nnUNetv2 problem with changing patch_size

Based on the information from the Ivadomed meeting, I took the following steps with hc-leipzig-7t-mp2rage dataset:

  • I ran nnUNetv2_plan_and_preprocess and then tried to modify the patch_size parameter (original: [128, 192, 96]) in the nnUNetPlans.json file to median_image_size_in_voxels.

Screenshot from 2024-07-25 16-46-04

  • Then I tried to run nnUNetv2_train with the modified nnUNetPlans.json file. This caused the error (see below).

  • Then I changed it back to the original patch_size and the training started correctly, so the problem will be probably with changed patch_size.

@naga-karthik and @valosekj, have you had a similar experience with nnUNet training, please? Do you have any suggestions for how to handle this error, please?

error
`Traceback (most recent call last):
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/bin/nnUNetv2_train", line 8, in <module>
    sys.exit(run_training_entry())
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 274, in run_training_entry
    run_training(args.dataset_name_or_id, args.configuration, args.fold, args.tr, args.p, args.pretrained_weights,
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/run/run_training.py", line 210, in run_training
    nnunet_trainer.run_training()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 1295, in run_training
    train_outputs.append(self.train_step(next(self.dataloader_train)))
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/nnunetv2/training/nnUNetTrainer/nnUNetTrainer.py", line 922, in train_step
    output = self.network(data)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
    return fn(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 921, in catch_errors
    return callback(frame, cache_entry, hooks, frame_state, skip=1)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 786, in _convert_frame
    result = inner_convert(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 400, in _convert_frame_assert
    return _compile(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/contextlib.py", line 79, in inner
    return func(*args, **kwds)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 676, in _compile
    guarded_code = compile_inner(code, one_graph, hooks, transform)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 262, in time_wrapper
    r = func(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 535, in compile_inner
    out_code = transform_code_object(code, transform)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/bytecode_transformation.py", line 1036, in transform_code_object
    transformations(instructions, code_options)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 165, in _fn
    return fn(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/convert_frame.py", line 500, in transform
    tracer.run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2149, in run
    super().run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper
    return inner_fn(self, inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1219, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/nn_module.py", line 336, in call_function
    return tx.inline_user_function_return(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 680, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2285, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2399, in inline_call_
    tracer.run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper
    return inner_fn(self, inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1260, in CALL_FUNCTION_EX
    self.call_function(fn, argsvars.items, kwargsvars)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 335, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 289, in call_function
    return super().call_function(tx, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/functions.py", line 90, in call_function
    return tx.inline_user_function_return(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 680, in inline_user_function_return
    return InliningInstructionTranslator.inline_call(self, fn, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2285, in inline_call
    return cls.inline_call_(parent, func, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 2399, in inline_call_
    tracer.run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 810, in run
    and self.step()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 773, in step
    getattr(self, inst.opname)(inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 489, in wrapper
    return inner_fn(self, inst)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 1219, in CALL_FUNCTION
    self.call_function(fn, args, {})
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/symbolic_convert.py", line 674, in call_function
    self.push(fn.call_function(self, args, kwargs))
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/torch.py", line 679, in call_function
    tensor_variable = wrap_fx_proxy(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1330, in wrap_fx_proxy
    return wrap_fx_proxy_cls(target_cls=TensorVariable, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/variables/builder.py", line 1415, in wrap_fx_proxy_cls
    example_value = get_fake_value(proxy.node, tx, allow_non_graph_fake=True)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1714, in get_fake_value
    raise TorchRuntimeError(str(e)).with_traceback(e.__traceback__) from None
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1656, in get_fake_value
    ret_val = wrap_fake_exception(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1190, in wrap_fake_exception
    return fn()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1657, in <lambda>
    lambda: run_node(tx.output, node, args, kwargs, nnmodule)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1782, in run_node
    raise RuntimeError(make_error_message(e)).with_traceback(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_dynamo/utils.py", line 1764, in run_node
    return node.target(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/utils/_stats.py", line 20, in wrapper
    return fn(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 896, in __torch_dispatch__
    return self.dispatch(func, types, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1241, in dispatch
    return self._cached_dispatch_impl(func, types, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 966, in _cached_dispatch_impl
    output = self._dispatch_impl(func, types, args, kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_subclasses/fake_tensor.py", line 1458, in _dispatch_impl
    r = func(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims_common/wrappers.py", line 252, in _fn
    result = fn(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims_common/wrappers.py", line 137, in _fn
    result = fn(**bound.arguments)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_refs/__init__.py", line 2799, in cat
    return prims.cat(filtered, dim).clone(memory_format=memory_format)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_ops.py", line 594, in __call__
    return self_._op(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/_prims/__init__.py", line 1917, in _cat_meta
    torch._check(
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/__init__.py", line 1140, in _check
    _check_with(RuntimeError, cond, message)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/__init__.py", line 1123, in _check_with
    raise error_type(message_evaluated)
torch._dynamo.exc.TorchRuntimeError: Failed running call_function <built-in method cat of type object at 0x7f7996797760>(*((FakeTensor(..., device='cuda:0', size=(2, 320, 16, 24, 12),
           dtype=torch.float16, grad_fn=<ConvolutionBackward0>), FakeTensor(..., device='cuda:0', size=(2, 320, 16, 23, 12),
           dtype=torch.float16, grad_fn=<ViewBackward0>)), 1), **{}):
Sizes of tensors must match except in dimension 1. Expected 24 but got 23 for tensor number 1 in the list

from user code:
   File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/dynamic_network_architectures/architectures/unet.py", line 62, in forward
    return self.decoder(skips)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/dynamic_network_architectures/building_blocks/unet_decoder.py", line 110, in forward
    x = torch.cat((x, skips[-(s+2)]), 1)

Set TORCH_LOGS="+dynamo" and TORCHDYNAMO_VERBOSE=1 for more information


You can suppress this exception and fall back to eager by setting:
    import torch._dynamo
    torch._dynamo.config.suppress_errors = True

Exception in thread Thread-2:
Traceback (most recent call last):
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message
Exception in thread Thread-1:
Traceback (most recent call last):
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 980, in _bootstrap_inner
    self.run()
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/threading.py", line 917, in run
    self._target(*self._args, **self._kwargs)
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 125, in results_loop
    raise e
  File "/home/ge.polymtl.ca/p120942/.conda/envs/nnunet/lib/python3.9/site-packages/batchgenerators/dataloading/nondet_multi_threaded_augmenter.py", line 103, in results_loop
    raise RuntimeError("One or more background workers are no longer alive. Exiting. Please check the "
RuntimeError: One or more background workers are no longer alive. Exiting. Please check the print statements above for the actual error message` 
@valosekj
Copy link
Member

tried to modify the patch_size parameter (original: [128, 192, 96]) in the nnUNetPlans.json file to median_image_size_in_voxels.

Screenshot from 2024-07-25 16-46-04

Okay. Based on our today's in-person discussion, I thought that you were modifying the patch size only along the S-I axis. (to ensure that the model always has the context about all the rootlet levels). But based on the screenshot, you're actually changing all the axes. So maybe the problem might be indeed related to memory issue.

@naga-karthik
Copy link
Member

Agree with Jan's comment about the memory issue. The patch size might be too big! AND, more importantly, the patch-size you chose is not divisible by 2**x where x=3, 4, or 5. Usually, patch sizes are divided by 2 multiple times depending on the number of layers in nnunet (maybe 4 or 5) during training so it's usually good to ensure that the patch size you choose are divisible by 2**4 (=16) or 2**5 (=32)

@valosekj
Copy link
Member

valosekj commented Jul 26, 2024

fyi I manually modified the patch_size for lumbar model training and training has started; details: #67 (comment)

@KaterinaKrejci231054
Copy link
Contributor Author

Thanks for the suggestions and for the help @valosekj and @naga-karthik - I tried to modify only the SI patch size - with the value 368 (23 * 16) in SI it crashed again because of memory, so I tried a smaller multiple - 352 (22 * 16) and with that it started to train correctly.

image

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants