Hi~ #1601

leroy182 · 2024-10-30T09:22:19Z

Hi~

首先非常感谢你们的开源工作，也很抱歉2024年还需要你们帮忙解决SimCSE相关的问题

我试图在4张24GB的4090显卡，完成有监督SimCSE-BERT-base的训练

我采用的.sh程序如下（只是把 torch.distributed.launch 替换为了 torchrun）：

#!/bin/bash

# In this example, we show how to train SimCSE using multiple GPU cards and PyTorch's distributed data parallel on supervised NLI dataset.
# Set how many GPUs to use

NUM_GPU=4

# Randomly set a port number
# If you encounter "address already used" error, just run again or manually set an available port id.
# PORT_ID=$(expr $RANDOM + 1000)

# Allow multiple threads
export OMP_NUM_THREADS=8
export CUDA_VISIBLE_DEVICES=0,1,2,3

# Use distributed data parallel
# If you only want to use one card, uncomment the following line and comment the line with "torch.distributed.launch"
# python train.py \
# python -m torch.distributed.launch --nproc_per_node $NUM_GPU train.py \

torchrun --nproc_per_node $NUM_GPU train.py \
    --model_name_or_path bert-base-uncased \
    --train_file data/nli_for_simcse.csv \
    --output_dir result/my-sup-simcse-bert-base-uncased \
    --num_train_epochs 3 \
    --per_device_train_batch_size 128 \
    --learning_rate 5e-5 \
    --max_seq_length 32 \
    --evaluation_strategy steps \
    --metric_for_best_model stsb_spearman \
    --load_best_model_at_end \
    --eval_steps 125 \
    --pooler_type cls \
    --overwrite_output_dir \
    --temp 0.05 \
    --do_train \
    --do_eval \
    --fp16 \
    "$@"

并且我没有对 train.py 做任何改动

然而，当我在终端执行该.sh脚本后，程序很快就显示：torch.cuda.OutOfMemoryError: CUDA out of memory.

根据其他的issues，有监督的SimCSE-BERT-base是借助4张3090显卡完成的，因此理论上程序在4x4090环境中不该OOM

我希望请教下造成该现象的原因，以及该如何解决

Python 3.9.16
torch 2.2.1
transformers 4.2.1

Originally posted by @Anonymous-AI1 in princeton-nlp/SimCSE#286

The text was updated successfully, but these errors were encountered:

leroy182 mentioned this issue Oct 30, 2024

Hi~ leroy182/-aygo188#5

Open

ali-fareed closed this as completed Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hi~ #1601

Hi~ #1601

leroy182 commented Oct 30, 2024

Hi~ #1601

Hi~ #1601

Comments

leroy182 commented Oct 30, 2024