failed to load the pretrained v2 model to run Go bot #138

hejin · 2019-02-16T15:02:13Z

Hi guys,

I completely followed the project homepage instructions (all the software versions are strictly aligned) and tried to run the Go bot with the pretrained v2 model but failed with the msg:
"
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var".
Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".
"

The box is a 24 core x86-64 with a Nvidia GPU V100 / 16GB.

The full log is here and thanks much!

(base) roobot@ELF:~/play-ELF/ELF/scripts/elfgames/go$ ./run.sh /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
Python version: 3.7.1 (default, Dec 14 2018, 19:28:38)
[GCC 7.3.0]
PyTorch version: 1.0.1.post2
CUDA version 10.0.130
Conda env: base
[2019-02-16 22:29:30.383] [rlpytorch.model_loader.load_env0] [info] Loading env
<module 'elfgames.go.game' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/game.py'> elfgames.go.game
<module 'elfgames.go.df_model3' from '/home/roobot/play-ELF/ELF/src_py/elfgames/go/df_model3.py'> elfgames.go.df_model3
[2019-02-16 22:29:30.394] [rlpytorch.model_loader.load_env0] [info] Parsed options: {'T': 1,
'actor_only': False,
'adam_eps': 0.001,
'additional_labels': ['aug_code', 'move_idx'],
'batchsize': 16,
'batchsize2': -1,
'black_use_policy_network_only': False,
'bn': True,
'bn_eps': 1e-05,
'bn_momentum': 0.1,
'cheat_eval_new_model_wins_half': False,
'cheat_selfplay_random_result': False,
'check_loaded_options': False,
'client_max_delay_sec': 1200,
'comment': '',
'data_aug': -1,
'dim': 256,
'dist_rank': -1,
'dist_url': '',
'dist_world_size': -1,
'dump_record_prefix': '',
'epsilon': 0.0,
'eval_model_pair': '',
'eval_num_games': 400,
'eval_old_model': -1,
'eval_stats': '',
'eval_winrate_thres': 0.55,
'expected_num_clients': -1,
'following_pass': False,
'gpu': 0,
'greedy': True,
'keep_prev_selfplay': False,
'keys_in_reply': ['V', 'rv'],
'leaky_relu': False,
'list_files': [],
'load': '/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'load_model_sleep_interval': 0.0,
'loglevel': 'debug',
'lr': 0.001,
'mcts_alpha': 0.0,
'mcts_epsilon': 0.0,
'mcts_persistent_tree': True,
'mcts_pick_method': 'most_visited',
'mcts_puct': 1.5,
'mcts_rollout_per_batch': 16,
'mcts_rollout_per_thread': 8192,
'mcts_root_unexplored_q_zero': False,
'mcts_threads': 2,
'mcts_unexplored_q_zero': False,
'mcts_use_prior': True,
'mcts_verbose': False,
'mcts_verbose_time': True,
'mcts_virtual_loss': 1,
'mode': 'online',
'model': 'online',
'momentum': 0.9,
'move_cutoff': -1,
'multipred_backprop': True,
'num_block': 20,
'num_future_actions': 1,
'num_games': 1,
'num_games_per_thread': -1,
'num_minibatch': 5000,
'num_reader': 50,
'num_reset_ranking': 5000,
'omit_keys': [],
'onload': [],
'opt_method': 'adam',
'parameter_print': False,
'parsed_args': ['df_console.py',
'--mode',
'online',
'--keys_in_reply',
'V',
'rv',
'--use_mcts',
'--mcts_verbose_time',
'--mcts_use_prior',
'--mcts_persistent_tree',
'--load',
'/home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin',
'--server_addr',
'localhost',
'--port',
'1234',
'--replace_prefix',
'resnet.module,resnet',
'--no_check_loaded_options',
'--no_parameter_print',
'--verbose',
'--gpu',
'0',
'--num_block',
'20',
'--dim',
'256',
'--mcts_puct',
'1.50',
'--batchsize',
'16',
'--mcts_rollout_per_batch',
'16',
'--mcts_threads',
'2',
'--mcts_rollout_per_thread',
'8192',
'--resign_thres',
'0.05',
'--mcts_virtual_loss',
'1',
'--loglevel',
'debug'],
'ply_pass_enabled': 0,
'policy_distri_cutoff': 0,
'policy_distri_training_for_all': False,
'port': 1234,
'preload_sgf': '',
'preload_sgf_move_to': -1,
'print_result': False,
'q_max_size': 1000,
'q_min_size': 10,
'ratio_pre_moves': 0,
'replace_prefix': ['resnet.module,resnet'],
'resign_thres': 0.05,
'sample_nodes': ['pi,a'],
'sample_policy': 'epsilon-greedy',
'selfplay_async': False,
'selfplay_init_num': 2000,
'selfplay_timeout_usec': 0,
'selfplay_update_num': 1000,
'server_addr': 'localhost',
'server_id': '',
'start_ratio_pre_moves': 0.5,
'store_greedy': False,
'suicide_after_n_games': -1,
'use_data_parallel': False,
'use_data_parallel_distributed': False,
'use_df_feature': False,
'use_fp16': False,
'use_mcts': True,
'use_mcts_ai2': False,
'verbose': True,
'weight_decay': 0.0,
'white_mcts_rollout_per_batch': -1,
'white_mcts_rollout_per_thread': -1,
'white_puct': -1.0,
'white_use_policy_network_only': False}
[2019-02-16 22:29:30.396] [rlpytorch.model_loader.load_env0] [info] Finished loading env
[2019-02-16 22:29:30.397] [elf::base::ThreadedDispatcherT-11] [info] Wait all games[1] to register their mailbox
human_actor: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'a', 'V'], 'batchsize': 1}
SharedMem: "human_actor", keys: ['a', 'V', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
actor_black: {'input': ['s', 'aug_code', 'move_idx'], 'reply': ['pi', 'V', 'a', 'rv'], 'timeout_usec': 10, 'batchsize': 16}
SharedMem: "actor_black", keys: ['a', 'V', 'rv', 'pi', 's', 'aug_code', 'move_idx']
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
a int64_t [16]
V float [16]
rv int64_t [16]
pi float [16, 362]
s float [16, 18, 19, 19]
aug_code int32_t [16]
move_idx int32_t [16]
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] Loading model from /home/roobot/play-ELF/ELF/scripts/elfgames/go/pretrained-go-19x19-v2.bin
[2019-02-16 22:29:34.512] [rlpytorch.model_loader.ModelLoader-1-model_indexNone] [info] replace_prefix for state dict: [['resnet.module', 'resnet']]
Traceback (most recent call last):
File "df_console.py", line 87, in
main()
File "df_console.py", line 47, in main
model = model_loader.load_model(GC.params)
File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_loader.py", line 161, in load_model
check_loaded_options=self.options.check_loaded_options)
File "/home/roobot/play-ELF/ELF/src_py/rlpytorch/model_base.py", line 147, in load
self.load_state_dict(sd)
File "/home/roobot/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py", line 769, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for Model_PolicyValue:
Missing key(s) in state_dict: "init_conv.0.weight", "init_conv.0.bias", "init_conv.1.weight", "init_conv.1.bias", "init_conv.1.running_mean", "init_conv.1.running_var".
Unexpected key(s) in state_dict: "init_conv.module.0.weight", "init_conv.module.0.bias", "init_conv.module.1.weight", "init_conv.module.1.bias", "init_conv.module.1.running_mean", "init_conv.module.1.running_var", "init_conv.module.1.num_batches_tracked".

l1t1 · 2019-02-16T23:41:37Z

#133 (comment)
i still have two errors not solved by using replace prefix

l1t1 · 2019-02-17T00:13:33Z

did you try the sever.sh and client.sh?

hejin · 2019-02-17T02:05:07Z

No :(
I will try. Thanks much! @l1t1

yuandong-tian · 2019-02-18T06:17:09Z

This is probably because of the version of PyTorch. A fix is on the way.

yuandong-tian · 2019-02-18T06:23:58Z

@hejin @l1t1 what version of pytorch did you use? We use PyTorch 1.0.

l1t1 · 2019-02-18T07:33:13Z

I use 1.0.1 with elf_convert.py too, but the windows binary df_console.exe shouldnt require pytorch installed by user

Python 3.7.1 (v3.7.1:260ec2c36a, Oct 20 2018, 14:57:15) [MSC v.1915 64 bit (AMD64)] on win32
Type "help", "copyright", "credits" or "license" for more information.
>>> import torch
>>> print(torch.__version__)
1.0.1

l1t1 · 2019-02-18T07:51:37Z

suggest df_console.exe also support load elfv2.bin and train data such as 1500000.bin etc

jma127 · 2019-02-20T22:59:56Z

Could you please try the newly-revised gtp.sh in master?

l1t1 · 2019-02-21T00:57:47Z

I download todays

D:\elfv2>\tool\wget -c https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
--2019-02-21 07:45:54--  https://dl.fbaipublicfiles.com/elfopengo/play/play_opengo_v2.zip
Length: 1076887016 (1.0G) [application/zip]
Saving to: 'play_opengo_v2.zip'

play_opengo_v2.zip            100%[=================================================>]   1.00G

2019-02-21 08:30:52 (391 KB/s) - 'play_opengo_v2.zip' saved [1076887016/1076887016]

and run the cpu version with buildin sabaki
set engine to D:\elfv2\play_opengo_v2\elf_cpu_full\elf\df_console.exe
it dosent work at all

○ newelfv2> name 
connection failed
○ newelfv2> version 
connection failed
○ newelfv2> protocol_version 
connection failed
○ newelfv2> list_commands 
connection failed
○ newelfv2> komi 6.5
connection failed
[5504] Failed to execute script df_console
Traceback (most recent call last):
  File "df_console.py", line 92, in <module>
  File "df_console.py", line 85, in main
  File "elf\utils_elf.py", line 435, in run
  File "elf\utils_elf.py", line 383, in _call
  File "elf\utils_elf.py", line 253, in cpu2gpu
  File "elf\utils_elf.py", line 253, in <dictcomp>
  File "site-packages\torch\cuda\__init__.py", line 161, in _lazy_init
  File "site-packages\torch\cuda\__init__.py", line 75, in _check_driver
AssertionError: Torch not compiled with CUDA enabled

l1t1 · 2019-02-21T01:48:29Z

but the gpu version works

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console

list_commands
= boardsize
clear_board
exit
final_score
genmove
komi
list_commands
name
play
protocol_version
quit
showboard
version

play b d16
=

genmove w
= N1

l1t1 · 2019-02-21T02:06:28Z

and the gpu version also support --load weights

D:\>fc /b D:\elfv2\play_opengo_v2\elf_gpu_full\elf\model-v2.bin d:\elfv2.bin |more
正在比较文件 D:\ELFV2\PLAY_OPENGO_V2\ELF_GPU_FULL\ELF\model-v2.bin 和 D:\ELFV2.BIN
FC: 找不到差异

some tests

quit
[2019-02-21 09:52:26.508] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:52:26.692] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.259] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 3 y = 15 move: dp please try a
gain
[2019-02-21 09:52:27.369] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 09:52:27.682] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 09:52:27.684] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 09:52:27.687] [elf::base::Context-3] [info] Stop tmp pool...

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/elfv2.bin
version
= 1.0

quit
[2019-02-21 09:55:16.300] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 09:55:16.301] [elfgames::go::GoGameSelfPlay-0-15] [warning] Invalid move: x = 0 y = 1 move: ab please try ag
ain
[2019-02-21 09:55:16.303] [elfgames::go::mcts::MCTSActor-21] [error] model version 1 and required version 1290000 are no
t consistent

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load d:/1500000.bin
genmove b
= D3


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input


? Invalid input

genmove w
= C16

quit
[2019-02-21 10:08:29.307] [elf::base::Context-3] [info] Prepare to stop ...
[2019-02-21 10:08:30.431] [elf::base::Context-3] [info] Stop all game threads ...
[2019-02-21 10:08:30.933] [elf::base::Context-3] [info] All games sent notification, Waiting until they join
[2019-02-21 10:08:30.937] [elf::base::Context-3] [info] Stop all collectors ...
[2019-02-21 10:08:30.957] [elf::base::Context-3] [info] Stop tmp pool...

l1t1 · 2019-02-21T02:20:07Z

test elf v1 weight

D:\elfv2\play_opengo_v2\elf_gpu_full\elf>df_console --load  d:/pretrained-go-19x19-v1.bin --num_block 20 --dim 224

? Invalid input


? Invalid input

genmove b
= Q16


? Invalid input

l1t1 mentioned this issue Feb 23, 2019

the model with windows binary is not the final version #134

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

failed to load the pretrained v2 model to run Go bot #138

failed to load the pretrained v2 model to run Go bot #138

hejin commented Feb 16, 2019

l1t1 commented Feb 16, 2019

l1t1 commented Feb 17, 2019

hejin commented Feb 17, 2019

yuandong-tian commented Feb 18, 2019

yuandong-tian commented Feb 18, 2019 •

edited

Loading

l1t1 commented Feb 18, 2019

l1t1 commented Feb 18, 2019

jma127 commented Feb 20, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

failed to load the pretrained v2 model to run Go bot #138

failed to load the pretrained v2 model to run Go bot #138

Comments

hejin commented Feb 16, 2019

l1t1 commented Feb 16, 2019

l1t1 commented Feb 17, 2019

hejin commented Feb 17, 2019

yuandong-tian commented Feb 18, 2019

yuandong-tian commented Feb 18, 2019 • edited Loading

l1t1 commented Feb 18, 2019

l1t1 commented Feb 18, 2019

jma127 commented Feb 20, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

l1t1 commented Feb 21, 2019

yuandong-tian commented Feb 18, 2019 •

edited

Loading