Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

'CUDA error: an illegal memory access was encountered' in forward #308

Open
gongwei-130 opened this issue Aug 7, 2020 · 7 comments
Open
Assignees

Comments

@gongwei-130
Copy link

gongwei-130 commented Aug 7, 2020

Hi, I'm running into the following error when attempting to train bert with ds_train_bert_bsz64k_seq128_m.sh. I printed out all tensor shapes in the batch and it looks fine since I used train_micro_batch_size_per_gpu=8 and train_batch_size=64 since I have 8 cards.

This error occurs during the forward pass of the first training step.

08/07/2020 15:02:47 - INFO - turing.logger -   worker-0: begin epoch 1 current_sample_count 0 shard_length 1000 global_data_samples 0
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
!!!! kernel execution error.
  0%|                                                                                                                              | 0/1000 [00:03<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 238, in train_tfrecords
    loss = model.network(batch)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 691, in forward
    loss = self.module(*inputs, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 1056, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 977, in forward
    checkpoint_activations=checkpoint_activations)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py", line 594, in forward
    hidden_states = layer_module(hidden_states, attention_mask)
  File "/usr/local/lib/python3.7/dist-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_cuda.py", line 520, in forward
    self.config)
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_cuda.py", line 196, in forward
    config.gelu_checkpoint)
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f74df4931e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f74df6e1f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f74df4819cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7f74f041a322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7f74f041a3c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7f74f4fe109b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

@tjruwase
Copy link
Contributor

tjruwase commented Aug 7, 2020

@gongwei-130 Thanks for reporting this issue.
To help with triaging can you rerun please with the custom kernels disabled by removing --deepspeed_transformer_kernel flag from the script?

@gongwei-130
Copy link
Author

@gongwei-130 Thanks for reporting this issue.
To help with triaging can you rerun please with the custom kernels disabled by removing --deepspeed_transformer_kernel flag from the script?

Sure. Error changes after I removed --deepspeed_transformer_kernel flag.

0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
 File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
   main()
 File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
   run(args, model, optimizer, start_epoch)
 File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
   train_tfrecords(args, index, model, optimizer, train_data)
 File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
   model.network.step()
 File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
   self.optimizer.step()
 File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
   return self.step_fused_lamb()
 File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
   norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
 File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
   total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
 what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff4a28221e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7ff4a2a70f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff4a28109cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7ff493010322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7ff4930103c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7ff4a920f09b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @gongwei-130

Thanks for trying out deepspeed. I wonder which version of CUDA and PyTorch you are using here. Also, which GPU architecture you are using for your training?
We can make sure that the CUDA part of the deepspeed works by running some unit test? Could please run one of these tests by running "pytest tests/unit/test_cuda_forward.py -sv"?
If you can also share the full training log and the result of this test, we can understand the problem better.
Thanks.

Best regards,
Reza

@gongwei-130
Copy link
Author

Hi @gongwei-130

Thanks for trying out deepspeed. I wonder which version of CUDA and PyTorch you are using here. Also, which GPU architecture you are using for your training?
We can make sure that the CUDA part of the deepspeed works by running some unit test? Could please run one of these tests by running "pytest tests/unit/test_cuda_forward.py -sv"?
If you can also share the full training log and the result of this test, we can understand the problem better.
Thanks.

Best regards,
Reza

Hi Reza, my cuda version is 10.2, pytorch version 1.6.0. GPU I have is Tesla V100-SXM2 32G. Full training log and the result of this test are as follows.

nvcc --version
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Wed_Oct_23_19:24:38_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
$ pytest tests/unit/test_cuda_forward.py -sv
==================================================================== test session starts ====================================================================
platform linux -- Python 3.7.3, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /opt/tiger/workspace/DeepSpeed
plugins: forked-1.3.0
collected 26 items                                                                                                                                          

tests/unit/test_cuda_forward.py::test_forward[64-1024-128-16-3-True-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[64-1024-128-16-3-True-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-384-16-3-True-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-384-16-3-True-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-512-16-3-True-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-512-16-3-True-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[64-1024-128-16-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[64-1024-128-16-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-384-16-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-384-16-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 384, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-512-16-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1024-512-16-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1536-128-24-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-1536-128-24-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1536, 'max_seq_length': 128, 'heads': 24, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-2048-128-32-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-2048-128-32-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 2048, 'max_seq_length': 128, 'heads': 32, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-2560-128-40-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward[8-2560-128-40-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 2560, 'max_seq_length': 128, 'heads': 40, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward_with_small_bsz[8-3-1024-512-16-3-True-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward_with_small_bsz[8-7-1024-512-16-3-True-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward_with_small_bsz[8-3-1024-512-16-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward_with_small_bsz[8-7-1024-512-16-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 8, 'hidden_size': 1024, 'max_seq_length': 512, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': False}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward_stochastic[64-1024-128-16-3-True-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward_stochastic[64-1024-128-16-3-True-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': True, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #2 is created with date type [half].
PASSED
tests/unit/test_cuda_forward.py::test_forward_stochastic[64-1024-128-16-3-False-False] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #0 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #1 is created with date type [float].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': False, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #2 is created with date type [float].
PASSED
tests/unit/test_cuda_forward.py::test_forward_stochastic[64-1024-128-16-3-False-True] DeepSpeed Transformer config is  {'layer_id': 0, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #0 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 1, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #1 is created with date type [half].
DeepSpeed Transformer config is  {'layer_id': 2, 'batch_size': 64, 'hidden_size': 1024, 'max_seq_length': 128, 'heads': 16, 'attn_dropout_ratio': 0.0, 'hidden_dropout_ratio': 0.0, 'num_hidden_layers': 3, 'initializer_range': 0.02, 'fp16': True, 'pre_layer_norm': False, 'local_rank': -1, 'seed': -1, 'normalize_invertible': False, 'gelu_checkpoint': False, 'adjust_init_range': True, 'test_gemm': False, 'training': True, 'is_grad_enabled': True, 'attn_dropout_checkpoint': False, 'stochastic_mode': True}
layer #2 is created with date type [half].
PASSED

===================================================================== warnings summary ======================================================================
deepspeed/pt/zero_optimizer_stage1.py:43
  /opt/tiger/workspace/DeepSpeed/deepspeed/pt/zero_optimizer_stage1.py:43: SyntaxWarning: assertion is always true, perhaps remove parentheses?
    assert (max_elements_per_comm >= dp,

deepspeed/pt/deepspeed_fused_lamb.py:46
  /opt/tiger/workspace/DeepSpeed/deepspeed/pt/deepspeed_fused_lamb.py:46: DeprecationWarning: invalid escape sequence \:
    """

-- Docs: https://docs.pytest.org/en/stable/warnings.html
============================================================== 26 passed, 2 warnings in 22.66s =============================================================
[2020-08-08 00:50:30,445] [WARNING] [deepspeed_run.py:90:fetch_hostfile] Unable to find hostfile, will proceed with training with local resources only.
[2020-08-08 00:50:30,925] [INFO] [deepspeed_run.py:333:main] cmd=['/usr/bin/python3', '-u', '-m', 'deepspeed.pt.deepspeed_launch', '--world_info=eyJsb2NhbGhvc3QiOiBbMCwgMSwgMiwgMywgNCwgNSwgNiwgN119', '--master_addr=127.0.0.1', '--master_port=29500', '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py', '--use_tfrecords', '--train_data=hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', '--cf', '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', '--max_seq_length', '128', '--output_dir', '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', '--deepspeed', '--print_steps', '100', '--lr_schedule', 'EE', '--lr_offset', '10e-4', '--job_name', 'lamb_64k_seq128', '--deepspeed_config', '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', '--max_predictions_per_seq=20', '--data_path_prefix', '/data/bert']
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:64:main] 0 NCCL_TREE_THRESHOLD 0
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:71:main] WORLD INFO DICT: {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]}
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:80:main] nnodes=1, num_local_procs=8, node_rank=0
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:92:main] global_rank_mapping=defaultdict(<class 'list'>, {'localhost': [0, 1, 2, 3, 4, 5, 6, 7]})
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:93:main] dist_world_size=8
[2020-08-08 00:50:31,478] [INFO] [deepspeed_launch.py:96:main] Setting CUDA_VISIBLE_DEVICES=0,1,2,3,4,5,6,7
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=0, logger=<turing.logger.Logger object at 0x7fe04a677eb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=6, logger=<turing.logger.Logger object at 0x7fead62d9eb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=1, logger=<turing.logger.Logger object at 0x7ff4a89feeb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
08/08/2020 00:50:32 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=5, logger=<turing.logger.Logger object at 0x7f734cddaeb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=3, logger=<turing.logger.Logger object at 0x7f0492febeb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=4, logger=<turing.logger.Logger object at 0x7f71a8280eb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:32 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:33 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
08/08/2020 00:50:33 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=2, logger=<turing.logger.Logger object at 0x7fb4f4056eb8>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:33 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
Running Config File:  lamb_64k_seq128
Args = Namespace(attention_dropout_checkpoint=False, ckpt_to_save=None, config={'name': 'bing_bert_large_lamb_seq', 'bert_token_file': '/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased', 'bert_model_file': 'bert-large-uncased', 'bert_model_config': {'vocab_size_or_config_json_file': 119547, 'hidden_size': 1024, 'num_hidden_layers': 24, 'num_attention_heads': 16, 'intermediate_size': 4096, 'hidden_act': 'gelu', 'hidden_dropout_prob': 0.1, 'attention_probs_dropout_prob': 0.1, 'max_position_embeddings': 512, 'type_vocab_size': 2, 'initializer_range': 0.02}, 'data': {'flags': {'pretrain_dataset': True, 'pretrain_type': 'wiki_bc'}, 'datasets': {'wiki_pretrain_dataset': 'bnorick_format/128/wiki_pretrain', 'bc_pretrain_dataset': 'bnorick_format/128/bookcorpus_pretrain'}}, 'validation': {'path': 'validation_set/'}, 'training': {'num_epochs': 150, 'warmup_proportion': 0.06, 'learning_rate': 0.011, 'num_workers': 0, 'async_worker': True, 'decay_rate': 0.9, 'decay_step': 250, 'total_training_steps': 7500}}, config_file='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_large_lamb.json', data_path_prefix='/data/bert', deepscale=False, deepscale_config=None, deepspeed=True, deepspeed_config='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_bsz64k_lamb_config_seq128.json', deepspeed_mpi=False, deepspeed_transformer_kernel=False, do_lower_case=True, finetune=False, gelu_checkpoint=False, job_name='lamb_64k_seq128', load_checkpoint_id=None, load_training_checkpoint=None, local_rank=7, logger=<turing.logger.Logger object at 0x7f98af1e5e80>, lr_offset=0.001, lr_schedule='EE', max_predictions_per_seq=20, max_seq_length=128, max_steps=9223372036854775807, max_steps_per_epoch=9223372036854775807, no_cuda=False, normalize_invertible=False, output_dir='/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert_model_outputs', print_steps=100, refresh_bucket_size=1, rewarmup=False, seed=42, stochastic_mode=False, train_data='hdfs://haruna/home/byte_arnold_lq/lab/yuchen/bert_en/tfrecord/lower_case_1_seq_len_128_max_pred_20_masked_lm_prob_0.15_random_seed_12345_dupe_factor_5_shard_1472_test_split_10/books_wiki_en_corpus/training/books*', use_nvidia_dataset=False, use_pretrain=False, use_tfrecords=True, validation_data_path_prefix=None)
>>>> vocab_file: /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:33 - INFO - pytorch_pretrained_bert.tokenization -   loading vocabulary file /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/bert-large-uncased/vocab.txt
08/08/2020 00:50:33 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
08/08/2020 00:50:33 - WARNING - root -   Skipping validation because validation_data_path_prefix is unspecified
VOCAB SIZE: 30528
VOCAB SIZE: 30528
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
08/08/2020 00:50:36 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
08/08/2020 00:50:36 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
08/08/2020 00:50:36 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
08/08/2020 00:50:37 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
08/08/2020 00:50:37 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
08/08/2020 00:50:37 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
08/08/2020 00:50:37 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
08/08/2020 00:50:37 - INFO - nvidia.modelingpreln -   Init BERT pretrain model
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
[2020-08-08 00:50:38,786] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:38,786] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
Accounting for accumulation on the residual path
[2020-08-08 00:50:38,825] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 6 within node.
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
[2020-08-08 00:50:38,995] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:38,995] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
[2020-08-08 00:50:39,088] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 1 within node.
Accounting for accumulation on the residual path
Accounting for accumulation on the residual path
[2020-08-08 00:50:39,500] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:39,501] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:39,543] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 0 within node.
[2020-08-08 00:50:39,916] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:39,917] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:39,960] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 5 within node.
[2020-08-08 00:50:40,035] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:40,036] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:40,070] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:40,070] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:40,096] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 3 within node.
[2020-08-08 00:50:40,123] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 4 within node.
[2020-08-08 00:50:40,193] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:40,193] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:40,300] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 2 within node.
[2020-08-08 00:50:40,311] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=96c4daa, git-branch=0.2
[2020-08-08 00:50:40,312] [INFO] [deepspeed_config.py:411:_set_batch_related_parameters]  After Train batch 64 micro_batch 8 and grad_acc 1
[2020-08-08 00:50:40,495] [INFO] [deepspeed_light.py:403:_init_distributed] Set device to local rank 7 within node.
[2020-08-08 00:50:42,836] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
[2020-08-08 00:50:43,026] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
[2020-08-08 00:50:43,335] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:43,966] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:44,007] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:44,047] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:44,270] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:44,279] [INFO] [deepspeed_light.py:74:_initialize_parameter_parallel_groups] data_parallel_size: 8, parameter_parallel_size: 8
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 00:50:45,862] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,862] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,863] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,878] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,879] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,879] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,879] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,881] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,881] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,881] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,889] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,889] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:45,890] [INFO] [fp16_unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:484:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:486:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.01

Parameter Group 1
    betas: (0.9, 0.999)
    bias_correction: False
    eps: 1e-08
    lr: 0.011
    max_coeff: 0.3
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 00:50:45,890] [INFO] [deepspeed_light.py:547:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 00:50:46,103] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,103] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,103] [INFO] [deepspeed_light.py:912:_report_progress] rank:0 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:424:print] DeepSpeedLight configuration:
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   activation_checkpointing_config  <deepspeed.pt.deepspeed_checkpointing_config.DeepSpeedActivationCheckpointingConfig object at 0x7fe037969048>
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   allgather_size ............... 500000000
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   allreduce_always_fp32 ........ False
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   disable_allgather ............ False
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   dump_state ................... False
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   dynamic_loss_scale_args ...... None
[2020-08-08 00:50:46,103] [INFO] [deepspeed_config.py:428:print]   fp16_enabled ................. True
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   global_rank .................. 0
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   gradient_accumulation_steps .. 1
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   gradient_clipping ............ 1.0
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   gradient_predivide_factor .... 1.0
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   initial_dynamic_scale ........ 4294967296
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   loss_scale ................... 0
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   memory_breakdown ............. False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   optimizer_legacy_fusion ...... False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   optimizer_name ............... lamb
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   optimizer_params ............. {'lr': 0.011, 'weight_decay': 0.01, 'bias_correction': False, 'max_coeff': 0.3, 'min_coeff': 0.01}
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   prescale_gradients ........... False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   scheduler_name ............... None
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   scheduler_params ............. None
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   sparse_gradients_enabled ..... False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   steps_per_print .............. 1000
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   tensorboard_enabled .......... False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   tensorboard_output_path ...... 
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   train_batch_size ............. 64
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   train_micro_batch_size_per_gpu  8
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   wall_clock_breakdown ......... False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   world_size ................... 8
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   zero_allow_untested_optimizer  False
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   zero_config .................. <deepspeed.pt.deepspeed_zero_config.DeepSpeedZeroConfig object at 0x7fe037969320>
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   zero_enabled ................. False
[2020-08-08 00:50:46,104] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:428:print]   zero_optimization_stage ...... 0
[2020-08-08 00:50:46,104] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,104] [INFO] [deepspeed_light.py:912:_report_progress] rank:2 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,104] [INFO] [deepspeed_config.py:434:print]   json = {
    "fp16":{
        "enabled":true,
        "loss_scale":0
    },
    "gradient_clipping":1.0,
    "optimizer":{
        "params":{
            "bias_correction":false,
            "lr":0.011,
            "max_coeff":0.3,
            "min_coeff":0.01,
            "weight_decay":0.01
        },
        "type":"Lamb"
    },
    "prescale_gradients":false,
    "steps_per_print":1000,
    "train_batch_size":64,
    "train_micro_batch_size_per_gpu":8,
    "wall_clock_breakdown":false
}
[2020-08-08 00:50:46,104] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:4 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,105] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:3 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:7 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:1 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,105] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [WARNING] [deepspeed_light.py:359:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:6 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:361:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 00:50:46,105] [INFO] [deepspeed_light.py:912:_report_progress] rank:5 step=0, skipped=0, lr=[0.011, 0.011], mom=[(0.9, 0.999), (0.9, 0.999)]
08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

2020-08-08 00:50:46.973280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-08 00:50:46.973280: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:46.973864: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-08 00:50:46.981569: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:46.981664: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:46.982259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:46.989876: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:46.989975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:46.990564: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:46.998237: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:46.998336: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:46.998929: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
08/08/2020 00:50:46 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:47.000751: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-08 00:50:47.007122: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.007225: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.007829: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.011539: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:47.017710: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.017805: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.018407: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.022123: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:47.028286: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.028381: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.028992: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.032672: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:47.038845: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.038942: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.039544: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.043244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:47.044921: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-08 00:50:47.049917: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.050023: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.050089: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050175: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050184: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050267: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050274: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050338: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050351: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050409: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050446: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050449: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.050509: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050542: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050607: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050610: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050705: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050789: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050856: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050935: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.050994: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.053622: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.053633: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.053651: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.053661: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.053942: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.053942: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.053955: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.053984: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.054231: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.054522: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.054628: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:47.060954: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.061062: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:47.064216: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.066975: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.067075: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:47.068920: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xe0d8460 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.068950: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.074561: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.074710: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.074787: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.074844: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.074924: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.074927: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.075010: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.075066: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.077909: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.077937: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.078187: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:47.081060: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.081058: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

2020-08-08 00:50:47.081480: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:47.082214: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:21: The name tf.enable_eager_execution is deprecated. Please use tf.compat.v1.enable_eager_execution instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:30: The name tf.set_random_seed is deprecated. Please use tf.compat.v1.set_random_seed instead.

08/08/2020 00:50:47 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:44: The name tf.FixedLenFeature is deprecated. Please use tf.io.FixedLenFeature instead.

2020-08-08 00:50:47.084199: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.084364: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcuda.so.1
2020-08-08 00:50:47.084985: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xea108f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.085018: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.085105: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x13eac000 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.085135: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.087484: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.090971: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x149b15b0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.090999: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.102196: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:47.104322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:47.110073: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.110148: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 0 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1a:00.0
2020-08-08 00:50:47.129908: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:47.131629: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:47.139000: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.139086: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 1 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:1b:00.0
2020-08-08 00:50:47.163623: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:47.164943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:47.170107: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.170204: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 2 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3d:00.0
2020-08-08 00:50:47.170296: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.170376: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.170437: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.170494: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.170552: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.170619: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.174105: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.174135: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.174454: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.183149: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.185677: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.186709: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x12e1ff00 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.186719: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.186739: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.189716: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 3 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:3e:00.0
2020-08-08 00:50:47.209926: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.211342: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.215331: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 4 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:88:00.0
2020-08-08 00:50:47.234356: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.235898: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.238740: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 5 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:89:00.0
2020-08-08 00:50:47.323380: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.324512: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.360474: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 6 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b1:00.0
2020-08-08 00:50:47.412322: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.412582: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.412651: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.412713: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.412771: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.412832: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.412890: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413028: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.413253: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413323: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413384: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413444: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413505: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.413562: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.416157: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.416185: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.416533: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.416982: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.417013: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.417328: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.425583: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.427817: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.429080: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0xedbc310 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.429112: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.432654: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x1807e180 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.432693: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:47.516465: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1639] Found device 7 with properties: 
name: Tesla V100-SXM2-32GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:b2:00.0
2020-08-08 00:50:47.516721: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcudart.so.10.0'; dlerror: libcudart.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.516797: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcublas.so.10.0'; dlerror: libcublas.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.516859: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcufft.so.10.0'; dlerror: libcufft.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.516920: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcurand.so.10.0'; dlerror: libcurand.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.517002: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusolver.so.10.0'; dlerror: libcusolver.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.517063: W tensorflow/stream_executor/platform/default/dso_loader.cc:55] Could not load dynamic library 'libcusparse.so.10.0'; dlerror: libcusparse.so.10.0: cannot open shared object file: No such file or directory; LD_LIBRARY_PATH: /opt/tiger/yarn_deploy/jdk/jre/lib/amd64/server:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native/ufs:/opt/tiger/yarn_deploy/hadoop/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lib/native:/opt/tiger/yarn_deploy/hadoop_current/lzo/lib:/opt/tiger/cuda/compat:/opt/tiger/cuda/lib64:
2020-08-08 00:50:47.520452: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudnn.so.7
2020-08-08 00:50:47.520479: W tensorflow/core/common_runtime/gpu/gpu_device.cc:1662] Cannot dlopen some GPU libraries. Please make sure the missing libraries mentioned above are installed properly if you would like to use GPU. Follow the guide at https://www.tensorflow.org/install/gpu for how to download and setup the required libraries for your platform.
Skipping registering GPU devices...
2020-08-08 00:50:47.520866: I tensorflow/core/platform/cpu_feature_guard.cc:142] Your CPU supports instructions that this TensorFlow binary was not compiled to use: AVX2 AVX512F FMA
2020-08-08 00:50:47.530916: I tensorflow/core/platform/profile_utils/cpu_utils.cc:94] CPU Frequency: 2400000000 Hz
2020-08-08 00:50:47.534558: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x116418f0 initialized for platform Host (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:47.534597: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Host, Default Version
2020-08-08 00:50:57.107605: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x3987b60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.107655: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107664: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107671: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107678: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107685: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107692: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107699: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.107705: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.127696: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.127743: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.133790: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9d2bbd10 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.133833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133842: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133848: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133855: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133862: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133868: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133874: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.133880: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152050: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.152083: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.152788: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9cca1e60 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.152824: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152833: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152840: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152847: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152853: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152860: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152866: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.152872: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.169974: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.170003: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.170265: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9b6622a0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.170309: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170318: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170325: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170340: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170347: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170353: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170360: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.170366: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172653: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9be277b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.172690: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172699: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172706: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172713: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172719: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172726: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172732: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.172739: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.183091: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.183129: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.185223: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.185254: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.198807: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9b3fb9b0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.198848: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198857: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198866: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198875: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198882: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198888: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198894: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.198900: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.206610: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.206648: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.213250: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9e5437d0 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.213296: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213311: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213321: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213332: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213342: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213353: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213363: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.213372: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218842: I tensorflow/compiler/xla/service/service.cc:168] XLA service 0x9bb2b520 initialized for platform CUDA (this does not guarantee that XLA will be used). Devices:
2020-08-08 00:50:57.218888: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (0): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218900: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (1): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218909: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (2): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218921: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (3): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218931: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (4): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218940: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (5): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218949: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (6): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.218958: I tensorflow/compiler/xla/service/service.cc:176]   StreamExecutor device (7): Tesla V100-SXM2-32GB, Compute Capability 7.0
2020-08-08 00:50:57.220906: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.220943: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
2020-08-08 00:50:57.224031: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1180] Device interconnect StreamExecutor with strength 1 edge matrix:
2020-08-08 00:50:57.224072: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1186]      
08/08/2020 00:51:10 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:10 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:11 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:11 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:11 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:11 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:12 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:12 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:12 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
08/08/2020 00:51:12 - INFO - turing.logger -   Training Epoch: 1
08/08/2020 00:51:12 - INFO - turing.logger -   worker-0: begin epoch 1 current_sample_count 0 shard_length 1000 global_data_samples 0
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:12 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]08/08/2020 00:51:12 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
08/08/2020 00:51:15 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/data/util/random_seed.py:58: where (from tensorflow.python.ops.array_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use tf.where in 2.0, which has the same broadcast rule as np.where
08/08/2020 00:51:15 - WARNING - tensorflow -   
The TensorFlow contrib module will not be included in TensorFlow 2.0.
For more information, please see:
  * https://github.com/tensorflow/community/blob/master/rfcs/20180907-contrib-sunset.md
  * https://github.com/tensorflow/addons
  * https://github.com/tensorflow/io (for I/O related ops)
If you depend on functionality not listed there, please file an issue.

08/08/2020 00:51:16 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:73: parallel_interleave (from tensorflow.contrib.data.python.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.parallel_interleave(...)`.
08/08/2020 00:51:16 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/interleave_ops.py:77: parallel_interleave (from tensorflow.python.data.experimental.ops.interleave_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.interleave(map_func, cycle_length, block_length, num_parallel_calls=tf.data.experimental.AUTOTUNE)` instead. If sloppy execution is desired, use `tf.data.Options.experimental_determinstic`.
08/08/2020 00:51:16 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:87: map_and_batch (from tensorflow.contrib.data.python.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.experimental.map_and_batch(...)`.
08/08/2020 00:51:16 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/contrib/data/python/ops/batching.py:276: map_and_batch (from tensorflow.python.data.experimental.ops.batching) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.data.Dataset.map(map_func, num_parallel_calls)` followed by `tf.data.Dataset.batch(batch_size, drop_remainder)`. Static tf.data optimizations will take care of using the fused implementation.
08/08/2020 00:51:16 - WARNING - tensorflow -   From /usr/local/lib/python3.7/dist-packages/tensorflow_core/python/autograph/converters/directives.py:119: The name tf.parse_single_example is deprecated. Please use tf.io.parse_single_example instead.

08/08/2020 00:51:16 - WARNING - tensorflow -   From /opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/tf_dl.py:111: to_int32 (from tensorflow.python.ops.math_ops) is deprecated and will be removed in a future version.
Instructions for updating:
Use `tf.cast` instead.
  0%|                                                                                                                              | 0/1000 [00:00<?, ?it/s]>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/nvidia/modelingpreln.py:1061: UserWarning: This overload of nonzero is deprecated:
        nonzero(Tensor input, *, Tensor out)
Consider using one of the following signatures instead:
        nonzero(Tensor input, *, bool as_tuple) (Triggered internally at  /pytorch/torch/csrc/utils/python_arg_parser.cpp:766.)
  (masked_lm_labels + 1).view(-1)).view(-1)
[2020-08-08 00:51:26,691] [INFO] [fp16_unfused_optimizer.py:241:_update_scale] Grad overflow on iteration: 0
[2020-08-08 00:51:26,692] [INFO] [fp16_unfused_optimizer.py:243:_update_scale] Reducing dynamic loss scale from 65536.0 to 32768.0
[2020-08-08 00:51:26,692] [INFO] [fp16_unfused_optimizer.py:143:step_fused_lamb] [deepspeed] OVERFLOW! Skipping step. Attempted loss scale: 65536.0, reducing to 32768.0
  0%|                                                                                                                    | 1/1000 [00:13<3:51:42, 13.92s/it]>>>>>>>>input_ids: torch.Size([8, 128]), input_mask:torch.Size([8, 128]), segment_ids: torch.Size([8, 128]), next_sentence_labels: torch.Size([8, 1]), mask_labels: torch.Size([8, 128])
  0%|                                                                                                                              | 0/1000 [00:10<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:13<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
  0%|                                                                                                                              | 0/1000 [00:14<?, ?it/s]
Traceback (most recent call last):
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 598, in <module>
    main()
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 591, in main
    run(args, model, optimizer, start_epoch)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 555, in run
    train_tfrecords(args, index, model, optimizer, train_data)
  File "/opt/tiger/workspace/DeepSpeed/DeepSpeedExamples2/bing_bert/deepspeed_train.py", line 258, in train_tfrecords
    model.network.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_light.py", line 812, in step
    self.optimizer.step()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 158, in step
    return self.step_fused_lamb()
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/fp16_unfused_optimizer.py", line 132, in step_fused_lamb
    norm_groups.append(get_weight_norm(grads_groups_flat[i], mpu=self.mpu))
  File "/usr/local/lib/python3.7/dist-packages/deepspeed/pt/deepspeed_utils.py", line 234, in get_weight_norm
    total_norm_cuda = torch.cuda.FloatTensor([float(total_norm)])
RuntimeError: CUDA error: an illegal memory access was encountered
terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7ff4a28221e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7ff4a2a70f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7ff4a28109cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7ff493010322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7ff4930103c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7ff4a920f09b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fb4ded191e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fb4def67f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fb4ded079cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7fb4efca0322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7fb4efca03c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7fb4f486709b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f7337a9d1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f7337cebf92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f7337a8b9cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7f7348a24322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7f7348a243c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7f734d5eb09b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fe04449b1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fe0446e9f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fe0444899cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7fe004c79322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7fe004c793c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7fe04ae8809b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7fead00fd1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7fead034bf92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7fead00eb9cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7feaca8f3322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7feaca8f33c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7fead6aea09b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f047dcae1e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f047defcf92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f047dc9c9cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7f048ec35322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7f048ec353c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7f04937fc09b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

terminate called after throwing an instance of 'c10::Error'
  what():  CUDA error: an illegal memory access was encountered
Exception raised from create_event_internal at /pytorch/c10/cuda/CUDACachingAllocator.cpp:687 (most recent call first):
frame #0: c10::Error::Error(c10::SourceLocation, std::string) + 0x42 (0x7f71a31041e2 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #1: c10::cuda::CUDACachingAllocator::raw_delete(void*) + 0xad2 (0x7f71a3352f92 in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10_cuda.so)
frame #2: c10::TensorImpl::release_resources() + 0x4d (0x7f71a30f29cd in /usr/local/lib/python3.7/dist-packages/torch/lib/libc10.so)
frame #3: <unknown function> + 0x541322 (0x7f71a3eca322 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #4: <unknown function> + 0x5413c6 (0x7f71a3eca3c6 in /usr/local/lib/python3.7/dist-packages/torch/lib/libtorch_python.so)
frame #5: /usr/bin/python3() [0x5968b1]
frame #6: /usr/bin/python3() [0x5bc890]
frame #7: /usr/bin/python3() [0x4d2ece]
frame #8: /usr/bin/python3() [0x5bcae8]
frame #9: /usr/bin/python3() [0x59688d]
frame #10: /usr/bin/python3() [0x5bc890]
frame #11: /usr/bin/python3() [0x4d2ece]
frame #12: /usr/bin/python3() [0x5bcb01]
frame #13: /usr/bin/python3() [0x59688d]
frame #14: /usr/bin/python3() [0x5bc890]
frame #15: /usr/bin/python3() [0x4d2ece]
frame #16: /usr/bin/python3() [0x5bcb01]
frame #17: /usr/bin/python3() [0x59688d]
frame #18: /usr/bin/python3() [0x5bc890]
frame #19: /usr/bin/python3() [0x4d2ece]
frame #20: /usr/bin/python3() [0x5bcb01]
frame #21: /usr/bin/python3() [0x59688d]
frame #22: /usr/bin/python3() [0x5bc890]
frame #23: /usr/bin/python3() [0x4d2ece]
frame #24: /usr/bin/python3() [0x5bcb01]
frame #25: /usr/bin/python3() [0x59688d]
frame #26: _PyTrash_thread_destroy_chain + 0x35 (0x647675 in /usr/bin/python3)
frame #27: /usr/bin/python3() [0x5d6108]
frame #28: /usr/bin/python3() [0x5d64c3]
frame #29: PyDict_SetItem + 0x337 (0x5b8ca7 in /usr/bin/python3)
frame #30: _PyModule_ClearDict + 0x107 (0x5aaae7 in /usr/bin/python3)
frame #31: PyImport_Cleanup + 0x354 (0x5386b4 in /usr/bin/python3)
frame #32: Py_FinalizeEx + 0x6e (0x633f9e in /usr/bin/python3)
frame #33: /usr/bin/python3() [0x653fcd]
frame #34: _Py_UnixMain + 0x2e (0x65420e in /usr/bin/python3)
frame #35: __libc_start_main + 0xeb (0x7f71a8a9109b in /lib/x86_64-linux-gnu/libc.so.6)
frame #36: _start + 0x2a (0x5df66a in /usr/bin/python3)

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @gongwei-130,

Thanks for running the test. So, it seems there is nothing wrong with the forward of the transformer kernels.
Can you also run this test: "pytest tests/unit/test_dynamic_loss_scale.py::test_unfused_no_overflow -sv", and share the result?
Thanks.

Best regards,
Reza

@gongwei-130
Copy link
Author

Hi @gongwei-130,

Thanks for running the test. So, it seems there is nothing wrong with the forward of the transformer kernels.
Can you also run this test: "pytest tests/unit/test_dynamic_loss_scale.py::test_unfused_no_overflow -sv", and share the result?
Thanks.

Best regards,
Reza

Done.

$ pytest tests/unit/test_dynamic_loss_scale.py::test_unfused_no_overflow -sv
==================================================================== test session starts ====================================================================
platform linux -- Python 3.7.3, pytest-6.0.1, py-1.9.0, pluggy-0.13.1 -- /usr/bin/python3
cachedir: .pytest_cache
rootdir: /opt/tiger/workspace/DeepSpeed
plugins: forked-1.3.0
collected 1 item

tests/unit/test_dynamic_loss_scale.py::test_unfused_no_overflow [2020-08-08 03:27:24,031] [INFO] [__init__.py:90:initialize] DeepSpeed info: version=0.2.0, git-hash=None, git-branch=None
[2020-08-08 03:27:24,032] [INFO] [deepspeed_config.py:430:_set_batch_related_parameters]  After Train batch 1 micro_batch 1 and grad_acc 1
[2020-08-08 03:27:24,032] [INFO] [deepspeed_light.py:414:_init_distributed] Set device to local rank 0 within node.
[2020-08-08 03:27:27,447] [INFO] [deepspeed_light.py:75:_initialize_parameter_parallel_groups] data_parallel_size: 1, parameter_parallel_size: 1
libibverbs: Warning: couldn't open config directory '/etc/libibverbs.d'.
[2020-08-08 03:27:27,548] [INFO] [deepspeed_light.py:502:_configure_optimizer] Using DeepSpeed Optimizer param name lamb as basic optimizer
[2020-08-08 03:27:27,548] [INFO] [deepspeed_light.py:504:_configure_optimizer] DeepSpeed Basic Optimizer = FusedLamb (
Parameter Group 0
    betas: (0.9, 0.999)
    bias_correction: True
    eps: 1e-08
    lr: 0.00015
    max_coeff: 10.0
    max_grad_norm: 0.0
    min_coeff: 0.01
    weight_decay: 0.0
)
[2020-08-08 03:27:27,548] [INFO] [deepspeed_light.py:572:_configure_fp16_optimizer] Creating fp16 unfused optimizer with dynamic loss scale
[2020-08-08 03:27:27,548] [INFO] [fp16_unfused_optimizer.py:36:__init__] Fused Lamb Legacy : True 
[2020-08-08 03:27:27,549] [WARNING] [deepspeed_light.py:370:_configure_lr_scheduler] DeepSpeed using client LR scheduler
[2020-08-08 03:27:27,549] [INFO] [deepspeed_light.py:372:_configure_lr_scheduler] DeepSpeed LR Scheduler = None
[2020-08-08 03:27:27,550] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=0, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:443:print] DeepSpeedLight configuration:
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   activation_checkpointing_config  <deepspeed.pt.deepspeed_checkpointing_config.DeepSpeedActivationCheckpointingConfig object at 0x7f477c612518>
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   allgather_size ............... 500000000
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   allreduce_always_fp32 ........ False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   amp_enabled .................. False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   amp_params ................... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   disable_allgather ............ False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   dump_state ................... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   dynamic_loss_scale_args ...... {'init_scale': 256, 'scale_window': 2, 'delayed_shift': 2, 'min_scale': 1}
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   fp16_enabled ................. True
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   global_rank .................. 0
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   gradient_accumulation_steps .. 1
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   gradient_clipping ............ 0.0
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   gradient_predivide_factor .... 1.0
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   initial_dynamic_scale ........ 256
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   loss_scale ................... 0
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   memory_breakdown ............. False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   optimizer_legacy_fusion ...... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   optimizer_name ............... lamb
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   optimizer_params ............. {'lr': 0.00015}
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   prescale_gradients ........... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   scheduler_name ............... None
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   scheduler_params ............. None
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   sparse_gradients_enabled ..... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   steps_per_print .............. 1
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   tensorboard_enabled .......... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   tensorboard_job_name ......... DeepSpeedJobName
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   tensorboard_output_path ...... 
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   train_batch_size ............. 1
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   train_micro_batch_size_per_gpu  1
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   wall_clock_breakdown ......... False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   world_size ................... 1
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   zero_allow_untested_optimizer  False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   zero_config .................. <deepspeed.pt.deepspeed_zero_config.DeepSpeedZeroConfig object at 0x7f477c6124e0>
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   zero_enabled ................. False
[2020-08-08 03:27:27,550] [INFO] [deepspeed_config.py:447:print]   zero_optimization_stage ...... 0
[2020-08-08 03:27:27,551] [INFO] [deepspeed_config.py:453:print]   json = {
    "fp16":{
        "enabled":true,
        "initial_scale_power":8,
        "loss_scale":0,
        "loss_scale_window":2
    },
    "optimizer":{
        "params":{
            "lr":0.00015
        },
        "type":"Lamb"
    },
    "steps_per_print":1,
    "train_batch_size":1
}
[2020-08-08 03:27:27,554] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=1, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,554] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=2, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,555] [INFO] [fp16_unfused_optimizer.py:252:_update_scale] No Grad overflow for 2 iterations
[2020-08-08 03:27:27,555] [INFO] [fp16_unfused_optimizer.py:254:_update_scale] Increasing dynamic loss scale from 256 to 512.0
[2020-08-08 03:27:27,555] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=3, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,556] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=4, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,556] [INFO] [fp16_unfused_optimizer.py:252:_update_scale] No Grad overflow for 2 iterations
[2020-08-08 03:27:27,556] [INFO] [fp16_unfused_optimizer.py:254:_update_scale] Increasing dynamic loss scale from 512.0 to 1024.0
[2020-08-08 03:27:27,556] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=5, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,557] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=6, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,557] [INFO] [fp16_unfused_optimizer.py:252:_update_scale] No Grad overflow for 2 iterations
[2020-08-08 03:27:27,557] [INFO] [fp16_unfused_optimizer.py:254:_update_scale] Increasing dynamic loss scale from 1024.0 to 2048.0
[2020-08-08 03:27:27,557] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=7, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,558] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=8, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,558] [INFO] [fp16_unfused_optimizer.py:252:_update_scale] No Grad overflow for 2 iterations
[2020-08-08 03:27:27,558] [INFO] [fp16_unfused_optimizer.py:254:_update_scale] Increasing dynamic loss scale from 2048.0 to 4096.0
[2020-08-08 03:27:27,558] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=9, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
[2020-08-08 03:27:27,559] [INFO] [deepspeed_light.py:960:_report_progress] rank:0 step=10, skipped=0, lr=[0.00015], mom=[(0.9, 0.999)]
PASSED

===================================================================== 1 passed in 4.48s ====================================================================

@RezaYazdaniAminabadi
Copy link
Contributor

Hi @gongwei-130

Sorry for the delayed response.
It seems all these components are working fine using the stand alone unit tests. So, it means that when you use them in your training environment, they raise the illegal memory access issue.
I have tried to reproduce the same issue that you see with batch size 64 (micro batch size 8) on 8 V100 GPUs. However, I can run without any error using both PyTorch and deepspeed transformer kernels.
I wonder if you have modified the training scripts or the modeling file which you are running. If so, could you please share them so that I can investigate this problem further.
Thanks.

Best regards,
Reza

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants