can you provide the running config of 65b models? #7

cyz14 · 2023-06-20T13:42:21Z

Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine?
It would be very nice if you add config/args_lomo.yaml and config/ds_config.json for 65b models.
Thanks.

KaiLv69 · 2023-06-20T15:31:00Z

Hi. You can change model_name_or_path in config/args_lomo.yaml to the corresponding name or path of 65b model to do that.

JoinHands · 2023-06-20T16:20:02Z

I have the same problem, llama 65B model with 8 * V100, hit oom, any other parameter should be set ?
args:

# model
model_name_or_path: '/data/home/scv9622/run/LLaMA/65B_hf'
# data
dataset_name: 'multirc'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
tag: 'lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config.json'
do_train: true
do_eval: false
evaluation_strategy: 'epoch'
per_device_train_batch_size: 16
per_device_eval_batch_size: 2
learning_rate: 0.03
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.1
clip_grad_norm: 1.0
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: true

{

    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e8,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_gather_16bit_weights_on_model_save": true
    },


    "gradient_accumulation_steps": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false
}

message details:
"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 968.00 MiB (GPU 0; 31.75 GiB total capacity; 25.87 GiB already allocated; 805.75 MiB free; 29.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

KaiLv69 · 2023-06-20T16:44:55Z

Sry that I misunderstood your question.
batch_size 16 and data_max_length 1024 is not suitable for 65B model on RTX 3090 or V100, because the activation is too large.
Maybe you can set batch_size to 1 or 2, or shorten data_max_length? :)

JoinHands · 2023-06-21T10:17:21Z

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

KaiLv69 · 2023-06-21T11:09:05Z

Glad to hear that!
To speed up, you can turn off loss scaling (e.g. use BF16) and use grad clip instead of grad norm (set clip_grad_norm in config to None and clip_grad_value to 1.0 or somewhat) to save the extra computation. BTW, your speed (47min for 1000 samples with 835 tokens) is already faster than the performance in the paper.

alisyzhu · 2023-06-29T11:26:57Z

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ?
I try it with 65B/33B on lomo / lomo+lora, all failed.

this is my args_lomo.yaml file

this is my ds_config.json file

KaiLv69 · 2023-06-29T11:55:49Z

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch（26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper，is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed.

this is my args_lomo.yaml file this is my ds_config.json file

Solved here. #28

KaiLv69 closed this as completed Jul 19, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

can you provide the running config of 65b models? #7

can you provide the running config of 65b models? #7

cyz14 commented Jun 20, 2023

KaiLv69 commented Jun 20, 2023

JoinHands commented Jun 20, 2023

KaiLv69 commented Jun 20, 2023 •

edited

Loading

JoinHands commented Jun 21, 2023

KaiLv69 commented Jun 21, 2023 •

edited

Loading

alisyzhu commented Jun 29, 2023

KaiLv69 commented Jun 29, 2023

can you provide the running config of 65b models? #7

can you provide the running config of 65b models? #7

Comments

cyz14 commented Jun 20, 2023

KaiLv69 commented Jun 20, 2023

JoinHands commented Jun 20, 2023

KaiLv69 commented Jun 20, 2023 • edited Loading

JoinHands commented Jun 21, 2023

KaiLv69 commented Jun 21, 2023 • edited Loading

alisyzhu commented Jun 29, 2023

KaiLv69 commented Jun 29, 2023

KaiLv69 commented Jun 20, 2023 •

edited

Loading

KaiLv69 commented Jun 21, 2023 •

edited

Loading