Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Evaluation get stuck #12

Open
Juanhui28 opened this issue Oct 14, 2022 · 8 comments
Open

Evaluation get stuck #12

Juanhui28 opened this issue Oct 14, 2022 · 8 comments

Comments

@Juanhui28
Copy link

Hi,

Seems there is still a chance for the evalution to get stuck. When we run the train_shallow_wikikgv2.sh , it runs after 4799999 steps and gets stuck in the evaluation. When we stop it with keyboard interrupt, we got the following message:

截屏2022-10-13 下午10 20 48

And when we run the train_concat_wikikgv2.sh , it stucks at the first time for the evaluation. When we stop it with keyboard interrupt, it shows similar error messages with the train_shallow_wikikgv2.sh.
截屏2022-10-13 下午10 23 14

Could you please help to check? Any help is appreciated!

@hyren
Copy link
Collaborator

hyren commented Oct 14, 2022

Hi, can you try running with a single GPU?

@Juanhui28
Copy link
Author

Juanhui28 commented Oct 14, 2022

Hi,

We tried a single gpu on both train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh, they both stuck in the evalution. Thanks.

@hyren
Copy link
Collaborator

hyren commented Oct 14, 2022

Just to make sure, have you pulled the latest change? What is the script you are running? We will look into this and reproduce.

@Juanhui28
Copy link
Author

Juanhui28 commented Oct 14, 2022

Hi, yes we have already pulled the latest change. We are running train_shallow_wikikgv2.sh and train_concat_wikikgv2.sh in the training/vec_scripts folder. Thanks!

@Hanjun-Dai
Copy link
Collaborator

Hi there,
I'm not sure if the gpu is compatible with the async op. Could you please kindly try to add --train_async_rw=False flag?

@Juanhui28
Copy link
Author

Hi,
Thank you for the follow up. We add this flag in the script. And actually we found there is still a chance for the training to stuck with multiple gpusm, but it goes well with single gpu.
Thank you!

@Hanjun-Dai
Copy link
Collaborator

really sorry for the back-and-forth! I guess it is mostly due to the compatibility of customized kernel.
Would you mind sharing more information of the versions for your CUDA, pytorch and python?

@Juanhui28
Copy link
Author

Hi, the information is listed as follows:
CUDA: 11.6
pytorch: 1.12.1
python: 3.9

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants