Running out of RAM on cloud TPU when reading data from Cloud Storage #67

izmailovpavel · 2023-04-08T02:01:19Z

izmailovpavel
Apr 8, 2023

Hi! I am trying to run the vit_s16_i1k.py script on a TPU-v3-8 machine. I put the data in a google-storage bucket and I am running the following command

TFDS_DATA_DIR=gs://bucket-name/ python3 -m big_vision.train --config big_vision/configs/vit_s16_i1k.py --workdir workdirs/i1k_training_`date '+%m-%d_%H%M'`

The training runs for a few iterations, and then fails with the killed message. When I look at htop outputs, the memory used by the process grows all the way up to 335G available before the process crashes.

I have been able to work around this issue by creating a data disk, mounting it on the TPU VM and putting the data there. In that case the same process only uses 205G of RAM and runs normally.

Answered by lucasb-eyer

Sep 29, 2023

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

Set cache_raw to False in the config:

big_vision/big_vision/configs/vit_s16_i1k.py

Line 48 in c01707f

config.input.cache_raw = True # Needs up to 120GB of RAM!
For all evaluators, in their config, set cache_final to False:

big_vision/big_vision/evaluators/classification.py

Line 59 in c01707f

cache_final=True, cache_raw=False, prefetch=1,

View full answer

ajayansaroj17 · 2023-08-27T18:58:32Z

ajayansaroj17
Aug 27, 2023

Try by reducing your Batch size since your workaround of using a data disk and mounting it on the TPU VM seems to have alleviated the issue by reducing memory usage.

0 replies

lucasb-eyer · 2023-09-29T13:59:30Z

lucasb-eyer
Sep 29, 2023
Maintainer

Only saw this now. Knowing Pavel, I think it's not relevant to him anymore, but for reference, here are two more options, at the slight expense of a little speed:

Set cache_raw to False in the config:

big_vision/big_vision/configs/vit_s16_i1k.py

Line 48 in c01707f

config.input.cache_raw = True # Needs up to 120GB of RAM!
For all evaluators, in their config, set cache_final to False:

big_vision/big_vision/evaluators/classification.py

Line 59 in c01707f

cache_final=True, cache_raw=False, prefetch=1,

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running out of RAM on cloud TPU when reading data from Cloud Storage #67

{{title}}

Replies: 2 comments

{{title}}

{{title}}

Select a reply

Running out of RAM on cloud TPU when reading data from Cloud Storage #67

izmailovpavel Apr 8, 2023

Replies: 2 comments

ajayansaroj17 Aug 27, 2023

lucasb-eyer Sep 29, 2023 Maintainer

izmailovpavel
Apr 8, 2023

ajayansaroj17
Aug 27, 2023

lucasb-eyer
Sep 29, 2023
Maintainer