Running out of RAM on cloud TPU when reading data from Cloud Storage #67
-
Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'` The training runs for a few iterations, and then fails with the I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally. |
Beta Was this translation helpful? Give feedback.
Replies: 2 comments
-
Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage. |
Beta Was this translation helpful? Give feedback.
-
Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:
|
Beta Was this translation helpful? Give feedback.
Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:
cache_raw
to False in the config:big_vision/big_vision/configs/vit_s16_i1k.py
Line 48 in c01707f
cache_final
to False:big_vision/big_vision/evaluators/classification.py
Line 59 in c01707f