Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hi~ #1601

Closed
leroy182 opened this issue Oct 30, 2024 · 0 comments
Closed

Hi~ #1601

leroy182 opened this issue Oct 30, 2024 · 0 comments

Comments

@leroy182
Copy link

Hi~

首先非常感谢你们的开源工作,也很抱歉2024年还需要你们帮忙解决SimCSE相关的问题

我试图在4张24GB的4090显卡,完成有监督SimCSE-BERT-base的训练

我采用的.sh程序如下(只是把 torch.distributed.launch 替换为了 torchrun):

#!/bin/bash

# In this example, we show how to train SimCSE using multiple GPU cards and PyTorch's distributed data parallel on supervised NLI dataset.
# Set how many GPUs to use

NUM_GPU=4

# Randomly set a port number
# If you encounter "address already used" error, just run again or manually set an available port id.
# PORT_ID=$(expr $RANDOM + 1000)

# Allow multiple threads
export OMP_NUM_THREADS=8
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Use distributed data parallel
# If you only want to use one card, uncomment the following line and comment the line with "torch.distributed.launch"
# python train.py \
# python -m torch.distributed.launch --nproc_per_node $NUM_GPU train.py \

torchrun --nproc_per_node $NUM_GPU train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/nli_for_simcse.csv \
    --output_dir result/my-sup-simcse-bert-base-uncased \
    --num_train_epochs 3 \
    --per_device_train_batch_size 128 \
    --learning_rate 5e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

并且我没有对 train.py 做任何改动

然而,当我在终端执行该.sh脚本后,程序很快就显示:torch.cuda.OutOfMemoryError: CUDA out of memory.

根据其他的issues,有监督的SimCSE-BERT-base是借助4张3090显卡完成的,因此理论上程序在4x4090环境中不该OOM

我希望请教下造成该现象的原因,以及该如何解决

Python 3.9.16
torch 2.2.1
transformers 4.2.1

Originally posted by @Anonymous-AI1 in princeton-nlp/SimCSE#286

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants