Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Training Time Issue #83

Open
imethanlee opened this issue Mar 15, 2022 · 4 comments
Open

Training Time Issue #83

imethanlee opened this issue Mar 15, 2022 · 4 comments

Comments

@imethanlee
Copy link

Hi,

What is the expected time to train PET model on yelp_full dataset (with default arguments)? I started the training the day before yesterday with a RTX 3090 GPU and it is still running.

Thanks.

@timoschick
Copy link
Owner

I don't know how efficient RTX 3090's are, but with a single Nvidia Geforce 1080Ti, training PET (not iPET) with the default parameters is a matter of a few hours. In case you haven't fixed the issue yourself yet, could you provide me with the exact command that you've used to train the model? Also, did you check (e.g., with nvidia-smi) whether the GPU is actually used?

@jmcrey
Copy link

jmcrey commented Apr 19, 2022

Hi @timoschick,

I am having the same issue here. I started the training on a RTX 3090 yesterday and it is still running. The command I am using is as follows:

python pet/cli.py \
    --method pet \
    --pattern_ids 0 3 5 \
    --data_dir ${DATA_DIR} \
    --model_type albert \
    --model_name_or_path albert-xxlarge-v2 \
    --task_name boolq \
    --output_dir ${OUTPUT_DIR} \
    --do_train \
    --do_eval \
    --pet_per_gpu_eval_batch_size 8 \
    --pet_per_gpu_train_batch_size 2 \
    --pet_gradient_accumulation_steps 8 \
    --pet_max_steps 250 \
    --pet_max_seq_length 256 \
    --pet_repetitions 3 \
    --sc_per_gpu_train_batch_size 2 \
    --sc_per_gpu_unlabeled_batch_size 2 \
    --sc_gradient_accumulation_steps 8 \
    --sc_max_steps 5000 \
    --sc_max_seq_length 256 \
    --sc_repetitions 1

@jmcrey
Copy link

jmcrey commented Apr 20, 2022

Just a heads up -- I bumped up the version of PyTorch to 1.8.0 and CUDA to 11.3 and that solved the performance issues. I am now able to run through the first 126 epochs in about 12 minutes compared to 1.5 hours. I am still waiting to see if this affects the results, but the performance is much better.

@jacksonchen1998
Copy link

jacksonchen1998 commented Feb 16, 2023

@jmcrey So, the result is ok ?

I'm now use 1080 Ti and trained with CUDA 11.5 and having TensorRT with 3 epoch.
My pre-trained model is Roberta-large and the dataset is AG News, other's arguments set to default.
It's looks like the training time needs to take half a day.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants