-
Notifications
You must be signed in to change notification settings - Fork 33
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ZeroDivisionError during training #41
Comments
The problem is actually at here
Maybe you have to adjust the version of protobuf first, so that to enable the training successfully. |
The problem maybe at "RuntimeError: Sizes of tensors must match except in dimension 1. Expected size 1 but got size 8 for tensor number 1 in the list." bash scripts/train_xx.sh data/girl output/girl 0 [ITER 2000] Evaluating train: L1 0.26527885198593143 PSNR 6.75002088546753 [01/11 02:22:20] [ITER 2000] Evaluating train: L1 0.030914516001939774 PSNR 21.484372329711917 [01/11 02:24:34] [ITER 4000] Evaluating train: L1 0.032561696320772174 PSNR 18.908275985717776 [01/11 02:25:06] [ITER 6000] Evaluating train: L1 0.03408423215150833 PSNR 18.4128719329834 [01/11 02:25:48] [ITER 8000] Evaluating train: L1 0.034265464171767235 PSNR 18.355935668945314 [01/11 02:26:29] [ITER 10000] Evaluating train: L1 0.034051183983683585 PSNR 18.44828567504883 [01/11 02:27:11] [ITER 10000] Saving Gaussians [01/11 02:27:11] [ITER 10000] Saving Checkpoint [01/11 02:27:11] [ITER 12000] Evaluating train: L1 0.033628519624471664 PSNR 18.511128234863282 [01/11 02:27:53] [ITER 14000] Evaluating train: L1 0.03344566896557808 PSNR 18.616846466064455 [01/11 02:28:34] [ITER 16000] Evaluating train: L1 0.03363056853413582 PSNR 18.59799690246582 [01/11 02:29:16] [ITER 18000] Evaluating train: L1 0.03328476175665856 PSNR 18.66069221496582 [01/11 02:29:58] [ITER 20000] Evaluating train: L1 0.03308735378086567 PSNR 18.7146484375 [01/11 02:30:40] [ITER 20000] Saving Gaussians [01/11 02:30:40] [ITER 20000] Saving Checkpoint [01/11 02:30:40] [ITER 22000] Evaluating train: L1 0.033412421122193336 PSNR 18.71449546813965 [01/11 02:31:22] [ITER 24000] Evaluating train: L1 0.033169005438685416 PSNR 18.720870590209962 [01/11 02:32:04] [ITER 26000] Evaluating train: L1 0.03327131196856499 PSNR 18.703716278076172 [01/11 02:32:46] [ITER 28000] Evaluating train: L1 0.033213584497570996 PSNR 18.74058723449707 [01/11 02:33:28] [ITER 30000] Evaluating train: L1 0.03314627334475517 PSNR 18.737094497680665 [01/11 02:34:09] [ITER 30000] Saving Gaussians [01/11 02:34:09] [ITER 30000] Saving Checkpoint [01/11 02:34:10] [ITER 32000] Evaluating train: L1 0.033325108140707015 PSNR 18.667801666259766 [01/11 02:34:52] [ITER 34000] Evaluating train: L1 0.0330487035214901 PSNR 18.781600952148438 [01/11 02:35:34] [ITER 36000] Evaluating train: L1 0.03348513059318066 PSNR 18.725484085083007 [01/11 02:36:16] [ITER 38000] Evaluating train: L1 0.03272325024008751 PSNR 18.83039894104004 [01/11 02:36:59] [ITER 40000] Evaluating train: L1 0.03270529806613922 PSNR 18.863579940795898 [01/11 02:37:42] [ITER 40000] Saving Gaussians [01/11 02:37:42] [ITER 40000] Saving Checkpoint [01/11 02:37:42] [ITER 42000] Evaluating train: L1 0.032594759762287144 PSNR 18.860652923583984 [01/11 02:38:24] [ITER 44000] Evaluating train: L1 0.032557861506938936 PSNR 18.859840393066406 [01/11 02:39:06] [ITER 46000] Evaluating train: L1 0.03249254301190376 PSNR 18.854668807983398 [01/11 02:39:48] [ITER 48000] Evaluating train: L1 0.023388715833425524 PSNR 22.124017333984376 [01/11 02:40:45] [ITER 50000] Evaluating test: L1 0.029450798015061175 PSNR 19.591847971866006 [01/11 02:41:53] [ITER 50000] Evaluating train: L1 0.02219633013010025 PSNR 22.390225982666017 [01/11 02:41:56] [ITER 50000] Saving Gaussians [01/11 02:41:56] [ITER 50000] Saving Checkpoint [01/11 02:41:56] Training complete. [01/11 02:41:57] |
ptimizing output/may_cut
Output folder: output/may_cut [15/09 18:06:45]
[libprotobuf FATAL google/protobuf/stubs/common.cc:83] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
scripts/train_xx.sh: line 8: 7139 Aborted (core dumped) python train_mouth.py -s $dataset -m $workspace --audio_extractor $audio_extractor
Optimizing output/may_cut
Output folder: output/may_cut [15/09 18:06:46]
[libprotobuf FATAL google/protobuf/stubs/common.cc:83] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
scripts/train_xx.sh: line 9: 7167 Aborted (core dumped) python train_face.py -s $dataset -m $workspace --init_num 2000 --densify_grad_threshold 0.0005 --audio_extractor $audio_extractor
Optimizing output/may_cut
Output folder: output/may_cut [15/09 18:06:47]
[libprotobuf FATAL google/protobuf/stubs/common.cc:83] This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
terminate called after throwing an instance of 'google::protobuf::FatalException'
what(): This program was compiled against version 3.9.2 of the Protocol Buffer runtime library, which is not compatible with the installed version (3.19.0). Contact the program author for an update. If you compiled the program yourself, make sure that your headers are from the same version of Protocol Buffers as your link-time library. (Version verification failed in "bazel-out/k8-opt/bin/tensorflow/core/framework/tensor_shape.pb.cc".)
scripts/train_xx.sh: line 10: 7194 Aborted (core dumped) python train_fuse.py -s $dataset -m $workspace --opacity_lr 0.001 --audio_extractor $audio_extractor
Looking for config file in output/may_cut/cfg_args
Config file found: output/may_cut/cfg_args
Rendering output/may_cut
Found transforms_train.json file, assuming Blender data set! [15/09 18:06:48]
Reading Test Transforms [15/09 18:06:48]
137it [00:00, 4946.11it/s]
137it [00:01, 72.23it/s]
Generating random point cloud (10000)... [15/09 18:06:50]
Loading Training Cameras [15/09 18:06:50]
Loading Test Cameras [15/09 18:06:51]
Number of points at initialisation : 10000 [15/09 18:06:51]
Traceback (most recent call last):
File "synthesize_fuse.py", line 125, in
render_sets(model.extract(args), args.iteration, pipeline.extract(args), args.use_train, args.fast, args.dilate)
File "synthesize_fuse.py", line 93, in render_sets
(model_params, motion_params, model_mouth_params, motion_mouth_params) = torch.load(os.path.join(dataset.model_path, "chkpnt_fuse_latest.pth"))
File "/home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 699, in load
with _open_file_like(f, 'rb') as opened_file:
File "/home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 230, in _open_file_like
return _open_file(name_or_buffer, mode)
File "/home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/torch/serialization.py", line 211, in init
super(_open_file, self).init(open(name, mode))
FileNotFoundError: [Errno 2] No such file or directory: 'output/may_cut/chkpnt_fuse_latest.pth'
Setting up [LPIPS] perceptual loss: trunk [alex], v[0.1], spatial [off]
/home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/torchvision/models/_utils.py:209: UserWarning: The parameter 'pretrained' is deprecated since 0.13 and will be removed in 0.15, please use 'weights' instead.
f"The parameter '{pretrained_param}' is deprecated since 0.13 and will be removed in 0.15, "
/home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/torchvision/models/_utils.py:223: UserWarning: Arguments other than a weight enum or
None
for 'weights' are deprecated since 0.13 and will be removed in 0.15. The current behavior is equivalent to passingweights=AlexNet_Weights.IMAGENET1K_V1
. You can also useweights=AlexNet_Weights.DEFAULT
to get the most up-to-date weights.warnings.warn(msg)
Loading model from: /home/xxx/miniconda3/envs/talking_gaussian/lib/python3.7/site-packages/lpips/weights/v0.1/alex.pth
Traceback (most recent call last):
File "metrics.py", line 215, in
print(lmd_meter.report())
File "metrics.py", line 102, in report
return f'LMD ({self.backend}) = {self.measure():.6f}'
File "metrics.py", line 96, in measure
return self.V / self.N
ZeroDivisionError: division by zero
Someone reported this issue before and the author suggested memory size may be the issue. I cut the May video into 1 minute. During the training, I still face the ZeroDivisionError. My memory is 32Gb. Any suggestion? Thank you so much.
The text was updated successfully, but these errors were encountered: