-
Notifications
You must be signed in to change notification settings - Fork 212
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Training rushes through all epochs after error while decoding to model/output_dev #156
Comments
When you say ' I also tried to decrease the vocab-size to 15k' did you re-run prepare_data.py ? If not then you need too. The error you are experiencing is, indeed due to your GPU running out of memory. Reducing the batch size will fix this. What is the size of your model? |
I always ran prepare_data.py after changing any settings, and as I said the error occured even with a batch-size of 4 but a few external evaluations later than with higher ones. |
HwInfo is very good, quite accurate also. What is the size of your model and what gpu are you training it on? |
I train on a gtx1070ti and what exactly do you mean by size of the model? everything is default except vocab-size=75.000 and i have 10.7 million training pairs (total 3.8gb text files). |
I mean the amount of neurons and layers you have in your network. You should easily be able to fit the model you have described with a batch size of 4 in 8GB of ram. Are you updating the settings in settings.py and do you have overide existing settings set to true ? |
I got standart model size of num_layers=2 with num_units=512 and override_loaded_hparams=True, I set the settings in settings.py |
Small update: |
How much system RAM do you have? |
I have 16GB DDR4 3000mhz ram, the training uses about 4GB, total ram usage is at about 70% during training |
So seems like I fixed the issue, but the solution is not perfekt: |
First of all my specs:
gtx 1070ti 8gb vram
16gb ram
ryzen 7 2700
training on m.2 ssd
My issue is that the model somewhat fails to decode to the model/output_dev file while training (at diffrent steps each time, most times after 5k or 10k steps), which causes it to rush through all other epochs with the same error instantly and then finishing training. I've read about someone who had the same issue and he solved it by decreasing the batch-size, but I tried that as well and nothing helped:
decoding to output model/output_dev_5000
2020-05-08 18:17:53.781721: W tensorflow/core/common_runtime/bfc_allocator.cc:419] Allocator (GPU_0_bfc) ran out of memory trying to allocate 64.0KiB (rounded to 65536). Current allocation summary follows.
2020-05-08 18:17:53.786389: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (256): Total Chunks: 25, Chunks in use: 24. 6.3KiB allocated for chunks. 6.0KiB in use in bin. 118B client-requested in use in bin.
2020-05-08 18:17:53.791567: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (512): Total Chunks: 1, Chunks in use: 0. 768B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.796353: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1024): Total Chunks: 3, Chunks in use: 3. 3.8KiB allocated for chunks. 3.8KiB in use in bin. 3.0KiB client-requested in use in bin.
2020-05-08 18:17:53.802074: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2048): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.806886: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4096): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.811956: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8192): Total Chunks: 20, Chunks in use: 20. 160.0KiB allocated for chunks. 160.0KiB in use in bin. 160.0KiB client-requested in use in bin.
2020-05-08 18:17:53.816760: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16384): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.821948: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (32768): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.827046: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (65536): Total Chunks: 7, Chunks in use: 7. 539.5KiB allocated for chunks. 539.5KiB in use in bin. 494.0KiB client-requested in use in bin.
2020-05-08 18:17:53.831938: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (131072): Total Chunks: 556, Chunks in use: 556. 86.89MiB allocated for chunks. 86.89MiB in use in bin. 59.73MiB client-requested in use in bin.
2020-05-08 18:17:53.837532: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (262144): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.842321: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (524288): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.847328: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (1048576): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.851795: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (2097152): Total Chunks: 9, Chunks in use: 9. 22.00MiB allocated for chunks. 22.00MiB in use in bin. 22.00MiB client-requested in use in bin.
2020-05-08 18:17:53.857609: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (4194304): Total Chunks: 1, Chunks in use: 1. 5.97MiB allocated for chunks. 5.97MiB in use in bin. 3.00MiB client-requested in use in bin.
2020-05-08 18:17:53.862976: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (8388608): Total Chunks: 576, Chunks in use: 576. 5.16GiB allocated for chunks. 5.16GiB in use in bin. 5.15GiB client-requested in use in bin.
2020-05-08 18:17:53.868138: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (16777216): Total Chunks: 1, Chunks in use: 1. 16.14MiB allocated for chunks. 16.14MiB in use in bin. 9.16MiB client-requested in use in bin.
2020-05-08 18:17:53.873656: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (33554432): Total Chunks: 1, Chunks in use: 1. 55.00MiB allocated for chunks. 55.00MiB in use in bin. 55.00MiB client-requested in use in bin.
2020-05-08 18:17:53.878707: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (67108864): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.883832: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (134217728): Total Chunks: 6, Chunks in use: 6. 885.64MiB allocated for chunks. 885.64MiB in use in bin. 842.45MiB client-requested in use in bin.
2020-05-08 18:17:53.889042: I tensorflow/core/common_runtime/bfc_allocator.cc:869] Bin (268435456): Total Chunks: 0, Chunks in use: 0. 0B allocated for chunks. 0B in use in bin. 0B client-requested in use in bin.
2020-05-08 18:17:53.894022: I tensorflow/core/common_runtime/bfc_allocator.cc:885] Bin for 64.0KiB was 64.0KiB, Chunk State:
2020-05-08 18:17:53.897269: I tensorflow/core/common_runtime/bfc_allocator.cc:898] Next region of size 6667798272
2020-05-08 18:17:57.863482: I tensorflow/core/common_runtime/bfc_allocator.cc:905] InUse at 00000008926C5600 next 18446744073709551615 of size 16920832
2020-05-08 18:17:57.866990: I tensorflow/core/common_runtime/bfc_allocator.cc:914] Summary of in-use Chunks by size:
2020-05-08 18:17:57.869504: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 24 Chunks of size 256 totalling 6.0KiB
2020-05-08 18:17:57.872427: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 3 Chunks of size 1280 totalling 3.8KiB
2020-05-08 18:17:57.874727: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 20 Chunks of size 8192 totalling 160.0KiB
2020-05-08 18:17:57.877055: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 65536 totalling 320.0KiB
2020-05-08 18:17:57.879407: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112128 totalling 109.5KiB
2020-05-08 18:17:57.882387: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 112640 totalling 110.0KiB
2020-05-08 18:17:57.884772: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 277 Chunks of size 131072 totalling 34.63MiB
2020-05-08 18:17:57.887408: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 149504 totalling 146.0KiB
2020-05-08 18:17:57.890274: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 278 Chunks of size 196608 totalling 52.13MiB
2020-05-08 18:17:57.892768: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 2097152 totalling 10.00MiB
2020-05-08 18:17:57.895113: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 4 Chunks of size 3145728 totalling 12.00MiB
2020-05-08 18:17:57.898162: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 6257152 totalling 5.97MiB
2020-05-08 18:17:57.900500: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 14 Chunks of size 8388608 totalling 112.00MiB
2020-05-08 18:17:57.903097: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 553 Chunks of size 9600512 totalling 4.94GiB
2020-05-08 18:17:57.905463: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12258048 totalling 11.69MiB
2020-05-08 18:17:57.908372: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 12467456 totalling 11.89MiB
2020-05-08 18:17:57.910728: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 6 Chunks of size 12582912 totalling 72.00MiB
2020-05-08 18:17:57.913084: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16633088 totalling 15.86MiB
2020-05-08 18:17:57.915528: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 16920832 totalling 16.14MiB
2020-05-08 18:17:57.918616: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 57671680 totalling 55.00MiB
2020-05-08 18:17:57.920980: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 5 Chunks of size 153606144 totalling 732.45MiB
2020-05-08 18:17:57.923380: I tensorflow/core/common_runtime/bfc_allocator.cc:917] 1 Chunks of size 160628736 totalling 153.19MiB
2020-05-08 18:17:57.926221: I tensorflow/core/common_runtime/bfc_allocator.cc:921] Sum Total of in-use chunks: 6.21GiB
2020-05-08 18:17:57.928590: I tensorflow/core/common_runtime/bfc_allocator.cc:923] total_region_allocated_bytes_: 6667798272 memory_limit_: 6667798446 available bytes: 174 curr_region_allocation_bytes_: 13335597056
2020-05-08 18:17:57.932841: I tensorflow/core/common_runtime/bfc_allocator.cc:929] Stats:
Limit: 6667798446
InUse: 6667797248
MaxInUse: 6667798016
NumAllocs: 666955
MaxAllocSize: 489619712
2020-05-08 18:17:57.938305: W tensorflow/core/common_runtime/bfc_allocator.cc:424] ****************************************************************************************************
Exception in thread Thread-5:
Traceback (most recent call last):
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1365, in _do_call
return fn(*args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1350, in _run_fn
target_list, run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1443, in _call_tf_sessionrun
run_metadata)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Dst tensor is not initialized.
[[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]]
(1) Internal: Dst tensor is not initialized.
[[{{node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup}}]]
[[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]]
0 successful operations.
0 derived errors ignored.
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 88, in nmt_train
tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main
train_fn(hparams, target_session=target_session, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 518, in train
sample_tgt_data, avg_ckpts, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 351, in run_full_eval
summary_writer, avg_ckpts, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 288, in run_internal_and_external_eval
summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 177, in run_external_eval
avg_ckpts=avg_ckpts)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 740, in _external_eval
infer_mode=hparams.infer_mode)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\utils\nmt_utils.py", line 60, in decode_and_evaluate
nmt_outputs, _ = model.decode(sess)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 692, in decode
output_tuple = self.infer(sess)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 680, in infer
return sess.run(output_tuple)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 956, in run
run_metadata_ptr)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1180, in _run
feed_dict_tensor, options, run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1359, in _do_run
run_metadata)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\client\session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Dst tensor is not initialized.
[[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
(1) Internal: Dst tensor is not initialized.
[[node dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup (defined at C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py:1748) ]]
[[dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/All/_221]]
0 successful operations.
0 derived errors ignored.
Original stack trace for 'dynamic_seq2seq/decoder/decoder/while/BasicDecoderStep/cond/embedding_lookup':
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 890, in _bootstrap
self._bootstrap_inner()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 926, in _bootstrap_inner
self.run()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "train.py", line 88, in nmt_train
tf.app.run(main=nmt.main, argv=[os.getcwd() + '\nmt\nmt\nmt.py'] + unparsed)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\platform\app.py", line 40, in run
_run(main=main, argv=argv, flags_parser=_parse_flags_tolerate_undef)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 299, in run
_run_main(main, args)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\absl\app.py", line 250, in _run_main
sys.exit(main(argv))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 701, in main
run_main(FLAGS, default_hparams, train_fn, inference_fn)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\nmt.py", line 694, in run_main
train_fn(hparams, target_session=target_session, summary_callback=summary_callback)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\train.py", line 477, in train
infer_model = model_helper.create_infer_model(model_creator, hparams, scope)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model_helper.py", line 228, in create_infer_model
extra_args=extra_args)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\attention_model.py", line 64, in init
extra_args=extra_args)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 95, in init
res = self.build_graph(hparams, scope=scope)
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 393, in build_graph
self._build_decoder(self.encoder_outputs, encoder_state, hparams))
File "C:\Users\Der Gerät\Desktop\nmt-chatbot (Small Dataset)/nmt\nmt\model.py", line 587, in _build_decoder
scope=decoder_scope)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 469, in dynamic_decode
swap_memory=swap_memory)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2753, in while_loop
return_same_structure)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2245, in BuildLoop
pred, body, original_loop_vars, loop_vars, shape_invariants)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2170, in _BuildLoop
body_result = body(*packed_vars_for_body)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 2705, in
body = lambda i, lv: (i + 1, orig_body(*lv))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\decoder.py", line 412, in body
decoder_finished) = decoder.step(time, inputs, state)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\basic_decoder.py", line 145, in step
sample_ids=sample_ids)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in next_inputs
lambda: self._embedding_fn(sample_ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1235, in cond
orig_res_f, res_f = context_f.BuildCondBranch(false_fn)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\control_flow_ops.py", line 1061, in BuildCondBranch
original_result = fn()
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 627, in
lambda: self._embedding_fn(sample_ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\contrib\seq2seq\python\ops\helper.py", line 579, in
lambda ids: embedding_ops.embedding_lookup(embedding, ids))
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 317, in embedding_lookup
transform_fn=None)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\embedding_ops.py", line 135, in _embedding_lookup_and_transform
array_ops.gather(params[0], ids, name=name), ids, max_norm)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\dispatch.py", line 180, in wrapper
return target(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\array_ops.py", line 3956, in gather
params, indices, axis, name=name)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\ops\gen_array_ops.py", line 4082, in gather_v2
batch_dims=batch_dims, name=name)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\op_def_library.py", line 794, in _apply_op_helper
op_def=op_def)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\util\deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3357, in create_op
attrs, op_def, compute_device)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 3426, in _create_op_internal
op_def=op_def)
File "C:\Users\Der Gerät\AppData\Local\Programs\Python\Python37\lib\site-packages\tensorflow_core\python\framework\ops.py", line 1748, in init
self._traceback = tf_stack.extract_stack()
I have a vocab-size of 75k, and i am trying to train a model with ~10.7 million pairs. I trained a smaller model with around 800k pairs before with no issues. The guy from the other issue report says that it's a memory issue and it's the vault of a too big batch-size and he mentions that a batch-size of 16 worked for him, but even a batch-size of 4 causes this error (at step 40k) in my case which is interesting because he has the same graphics card. I also tried to decrease the vocab-size to 15k, but again same error. Can someone help me?
Thanks.
The text was updated successfully, but these errors were encountered: