Evaluation get stuck #12

Juanhui28 · 2022-10-14T02:25:09Z

Hi,

Seems there is still a chance for the evalution to get stuck. When we run the train_shallow_wikikgv2.sh , it runs after 4799999 steps and gets stuck in the evaluation. When we stop it with keyboard interrupt, we got the following message:

And when we run the train_concat_wikikgv2.sh , it stucks at the first time for the evaluation. When we stop it with keyboard interrupt, it shows similar error messages with the train_shallow_wikikgv2.sh.

Could you please help to check? Any help is appreciated!

hyren · 2022-10-14T02:38:18Z

Hi, can you try running with a single GPU?

Juanhui28 · 2022-10-14T07:07:05Z

Hi,

We tried a single gpu on both train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh, they both stuck in the evalution. Thanks.

hyren · 2022-10-14T07:34:35Z

Just to make sure, have you pulled the latest change? What is the script you are running? We will look into this and reproduce.

Juanhui28 · 2022-10-14T14:26:29Z

Hi, yes we have already pulled the latest change. We are running train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh in the training/vec_scripts folder. Thanks!

Hanjun-Dai · 2022-10-18T01:31:19Z

Hi there,
I'm not sure if the gpu is compatible with the async op. Could you please kindly try to add --train_async_rw=False flag?

Juanhui28 · 2022-10-18T01:38:49Z

Hi,
Thank you for the follow up. We add this flag in the script. And actually we found there is still a chance for the training to stuck with multiple gpusm, but it goes well with single gpu.
Thank you!

Hanjun-Dai · 2022-10-18T01:46:14Z

really sorry for the back-and-forth! I guess it is mostly due to the compatibility of customized kernel.
Would you mind sharing more information of the versions for your CUDA, pytorch and python?

Juanhui28 · 2022-10-18T02:23:01Z

Hi, the information is listed as follows:
CUDA: 11.6
pytorch: 1.12.1
python: 3.9

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Evaluation get stuck #12

Evaluation get stuck #12

Juanhui28 commented Oct 14, 2022

hyren commented Oct 14, 2022

Juanhui28 commented Oct 14, 2022 •

edited

Loading

hyren commented Oct 14, 2022

Juanhui28 commented Oct 14, 2022 •

edited

Loading

Hanjun-Dai commented Oct 18, 2022

Juanhui28 commented Oct 18, 2022

Hanjun-Dai commented Oct 18, 2022

Juanhui28 commented Oct 18, 2022

Evaluation get stuck #12

Evaluation get stuck #12

Comments

Juanhui28 commented Oct 14, 2022

hyren commented Oct 14, 2022

Juanhui28 commented Oct 14, 2022 • edited Loading

hyren commented Oct 14, 2022

Juanhui28 commented Oct 14, 2022 • edited Loading

Hanjun-Dai commented Oct 18, 2022

Juanhui28 commented Oct 18, 2022

Hanjun-Dai commented Oct 18, 2022

Juanhui28 commented Oct 18, 2022

Juanhui28 commented Oct 14, 2022 •

edited

Loading

Juanhui28 commented Oct 14, 2022 •

edited

Loading