Hi ,i get the error msg like this : #19

ross-Hr · 2022-10-12T07:57:08Z

2022-10-12 15:43:57.254005: W tensorflow/core/grappler/optimizers/data/slack.cc:103] Could not find a finalprefetch` in the input pipeline to which to introduce slack.
I1012 15:43:57.996680 140468541171456 api.py:459] train_step begins...
I1012 15:44:07.279798 140468532778752 api.py:459] train_step begins...
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:10.852259 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:17.169317 140468541171456 api.py:446] Trainable variables:
I1012 15:44:17.426999 140468541171456 api.py:446] vit/stem_conv/kernel:0 (16, 16, 3, 768)
I1012 15:44:17.432081 140468541171456 api.py:446] vit/stem_conv/bias:0 (768,)
I1012 15:44:17.436969 140468541171456 api.py:446] vit/stem_ln/gamma:0 (768,)
....
INFO:tensorflow:batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:31.484436 140499206152832 cross_device_ops.py:897] batch_all_reduce: 369 all-reduces with algorithm = nccl, num_packs = 1
I1012 15:44:37.695064 140468532778752 api.py:459] train_step ends...
I1012 15:44:38.920633 140468541171456 api.py:459] train_step ends...
2022-10-12 15:45:08.671253: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 351529
Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call
ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area
retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 351529
2022-10-12 15:45:08.671413: W tensorflow/core/framework/op_kernel.cc:1768] UNKNOWN: KeyError: 415619
Traceback (most recent call last):

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/ops/script_ops.py", line 271, in call
ret = func(*args)

File "/root/anaconda3/envs/pix2seq/lib/python3.9/site-packages/tensorflow/python/autograph/impl/api.py", line 642, in wrapper
return func(*args, **kwargs)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in get_area
retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

File "/tmp/autograph_generated_filecefzj46v.py", line 22, in
retval__1 = ag.converted_call(ag__.ld(np).asarray, ([ag__.ld(id_to_ann)[ag__.ld(i)]['area'] for i in ag__.ld(ids)],), dict(dtype=ag__.ld(np).float32), fscope_1)

KeyError: 415619

`

My gpu is 2 * RTX 3070 with 8G .

The text was updated successfully, but these errors were encountered:

ross-Hr · 2022-10-12T08:02:52Z

Is the GPU memory too small ？

chentingpc · 2022-10-12T16:11:34Z

This looks like some data issue as the complaint was about a keyerror probably related to image id.

…

On Wed, Oct 12, 2022 at 1:03 AM ross-Hr ***@***.***> wrote: Is the GPU memory too small ？ — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

ross-Hr · 2022-10-18T01:30:46Z

It is the annoantions error. I reload the annoations to solve the error.
But the new error likes :

W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias

My tf==2.10.0

This looks like some data issue as the complaint was about a keyerror probably related to image id.
…
On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ？ — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.>

chentingpc · 2022-10-18T02:46:40Z

this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match.

…

On Mon, Oct 17, 2022 at 6:30 PM ross-Hr ***@***.***> wrote: It is the annoantions error. I reload the annoations to solve the error. But the new error likes : W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias My tf==2.10.0 This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m_2240461384712268694_> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr *@*.*> wrote: Is the GPU memory too small ？ — Reply to this email directly, view it on GitHub <#19 (comment) <#19 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU <https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU> . You are receiving this because you are subscribed to this thread.Message ID: @.*> — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU> . You are receiving this because you commented.Message ID: ***@***.***>

ross-Hr · 2022-10-18T05:32:15Z

this looks like the checkpoint specified (either pretrained checkpoint, or checkpoint restored from last training in the same model directory) is different from the configured architecture/encoder, please check if the architecture/encoder variant, depth, dim etc match.
…
On Mon, Oct 17, 2022 at 6:30 PM ross-Hr @.> wrote: It is the annoantions error. I reload the annoations to solve the error. But the new error likes : W1018 09:27:13.350448 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense1.bias WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel W1018 09:27:13.350490 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.kernel WARNING:tensorflow:Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias W1018 09:27:13.350531 139820555069248 checkpoint.py:213] Value in checkpoint could not be found in the restored object: (root).optimizer's state 'v' for (root).model.decoder.decoder.dec_layers.5.mlp.mlp_layers.0.dense2.bias My tf==2.10.0 This looks like some data issue as the complaint was about a keyerror probably related to image id. … <#m_1252035150792023031_m_2240461384712268694_> On Wed, Oct 12, 2022 at 1:03 AM ross-Hr @.> wrote: Is the GPU memory too small ？ — Reply to this email directly, view it on GitHub <#19 (comment) <#19 (comment)>>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU https://github.com/notifications/unsubscribe-auth/AAKERUNSBRJU2Q3HUGY73TTWCZWDNANCNFSM6AAAAAARDAN2FU . You are receiving this because you are subscribed to this thread.Message ID: @.> — Reply to this email directly, view it on GitHub <#19 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/AAKERUI2WQF5WTTWZUH2FS3WDX4VDANCNFSM6AAAAAARDAN2FU . You are receiving this because you commented.Message ID: @.**>

I git clone the repo and did not change anything.
Which version of TF are you using?
I put the Object365 checkpoints into model_dir and the command likes

python3 run.py --mode=train --model_dir=/data/c/Objects365-vitb-640/ --config=configs/config_det_finetune.py --config.dataset.data_dir=/data/c/pix2seq --config.dataset.coco_annotations_dir=/data/c/annotations --config.train.batch_size=8 --config.train.epochs=20 --config.optimization.learning_rate=3e-5

but get the above error.
The config.dataset.data_dir is my offline coco tfds.

By the way , I wonder if this is wrong

ross-Hr · 2022-10-18T09:18:04Z

well， i change the code in model.py
latest_ckpt, ckpt, self._verify_restored = utils.restore_from_checkpoint( model_dir, False, model=model, global_step=optimizer.iterations, optimizer=optimizer)
by
False to True, i.e. using
checkpoint.restore(latest_ckpt).expect_partial()
can avoid the error. But i still confused about that.

ross-Hr · 2022-11-08T09:40:29Z

@chentingpc
Hi, do you know how to debug with strategy.run(...) in train_multiple_steps function ?
I can not step into the train_step function.

chentingpc · 2022-11-08T19:44:48Z

you should be able to do pdb in the code when running in eager mode

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi ,i get the error msg like this : #19

Hi ,i get the error msg like this : #19

ross-Hr commented Oct 12, 2022

ross-Hr commented Oct 12, 2022

chentingpc commented Oct 12, 2022 via email

ross-Hr commented Oct 18, 2022

chentingpc commented Oct 18, 2022 via email

ross-Hr commented Oct 18, 2022 •

edited

Loading

ross-Hr commented Oct 18, 2022

ross-Hr commented Nov 8, 2022

chentingpc commented Nov 8, 2022

Hi ,i get the error msg like this : #19

Hi ,i get the error msg like this : #19

Comments

ross-Hr commented Oct 12, 2022

ross-Hr commented Oct 12, 2022

chentingpc commented Oct 12, 2022 via email

ross-Hr commented Oct 18, 2022

chentingpc commented Oct 18, 2022 via email

ross-Hr commented Oct 18, 2022 • edited Loading

ross-Hr commented Oct 18, 2022

ross-Hr commented Nov 8, 2022

chentingpc commented Nov 8, 2022

ross-Hr commented Oct 18, 2022 •

edited

Loading