Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tensorflow Cloud Tutorial failed to train model #341

Open
karlunho opened this issue Jun 19, 2021 · 3 comments
Open

Tensorflow Cloud Tutorial failed to train model #341

karlunho opened this issue Jun 19, 2021 · 3 comments

Comments

@karlunho
Copy link

When using the TensorFlow Cloud Tutorial https://www.tensorflow.org/cloud/tutorials/overview , the model fail to train on GCP. I've added the log file and a screenshot of the error . I'm a google employee and ldap is alankho - happy to share my GCP project with whoever is working on the bug.

The failure happened in Google AI Platform when training the model after about 5 hours

downloaded-logs-20210618-235625.csv

@karlunho
Copy link
Author

Tried again with TPUs, but also errored out

downloaded-logs-20210619-002835.csv

@karlunho
Copy link
Author

I can't tell if AI Platform is using the GPUs properly because when training, the GPU page show that the utilization is 0%, where as the CPU page shows close to 80%. I've uploaded screenshots for the team.

Screen Shot 2021-06-19 at 8 32 57 AM
Screen Shot 2021-06-19 at 8 31 19 AM

@aarushisoni
Copy link

Hi my name is Aarushi Soni . I want to contribute to this issue . Is this issue still open ? I am first time contributor . Please guide me through this process.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants