Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

can you provide the running config of 65b models? #7

Closed
cyz14 opened this issue Jun 20, 2023 · 7 comments
Closed

can you provide the running config of 65b models? #7

cyz14 opened this issue Jun 20, 2023 · 7 comments

Comments

@cyz14
Copy link

cyz14 commented Jun 20, 2023

Hi, I'd like to run a 65B llama with LOMO, what config should I use to run the training on a 8*RTX 3090 machine?
It would be very nice if you add config/args_lomo.yaml and config/ds_config.json for 65b models.
Thanks.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 20, 2023

Hi. You can change model_name_or_path in config/args_lomo.yaml to the corresponding name or path of 65b model to do that.

@JoinHands
Copy link

I have the same problem, llama 65B model with 8 * V100, hit oom, any other parameter should be set ?
args:

# model
model_name_or_path: '/data/home/scv9622/run/LLaMA/65B_hf'
# data
dataset_name: 'multirc'
refresh: false
data_tag: 'base'
train_on_inputs: false
data_max_length: 1024
# training
# trainer
tag: 'lomo'
output_dir: 'outputs'
overwrite_output_dir: true
deepspeed: 'config/ds_config.json'
do_train: true
do_eval: false
evaluation_strategy: 'epoch'
per_device_train_batch_size: 16
per_device_eval_batch_size: 2
learning_rate: 0.03
weight_decay: 0
num_train_epochs: 10
lr_scheduler_type: 'linear'
warmup: 0.1
clip_grad_norm: 1.0
save_strategy: 'no'
save_total_limit: 0
seed: 42
#bf16: true
remove_unused_columns: false
load_best_model_at_end: false
metric_for_best_model: 'acc'
group_by_length: false
#report_to: 'wandb'
dataloader_pin_memory: false
gradient_checkpointing: true
predict_with_generate: true
{

    "bf16": {
        "enabled": false
    },
    "fp16": {
        "enabled": true
    },
    "zero_allow_untested_optimizer": true,
    "zero_force_ds_cpu_optimizer": false,

    "zero_optimization": {
        "stage": 3,
        "overlap_comm": true,
        "contiguous_gradients": true,
        "sub_group_size": 1e8,
        "stage3_max_live_parameters": 1e8,
        "stage3_max_reuse_distance": 1e8,
        "stage3_gather_16bit_weights_on_model_save": true
    },


    "gradient_accumulation_steps": 1,
    "steps_per_print": 2000,
    "train_micro_batch_size_per_gpu": 2,
    "wall_clock_breakdown": false
}

message details:
"torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 968.00 MiB (GPU 0; 31.75 GiB total capacity; 25.87 GiB already allocated; 805.75 MiB free; 29.64 GiB reserved in total by PyTorch) If reserved memory is >> allocated memory try setting max_split_size_mb to avoid fragmentation. See documentation for Memory Management and PYTORCH_CUDA_ALLOC_CONF"

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 20, 2023

Sry that I misunderstood your question.
batch_size 16 and data_max_length 1024 is not suitable for 65B model on RTX 3090 or V100, because the activation is too large.
Maybe you can set batch_size to 1 or 2, or shorten data_max_length? :)

@JoinHands
Copy link

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 21, 2023

Glad to hear that!
To speed up, you can turn off loss scaling (e.g. use BF16) and use grad clip instead of grad norm (set clip_grad_norm in config to None and clip_grad_value to 1.0 or somewhat) to save the extra computation. BTW, your speed (47min for 1000 samples with 835 tokens) is already faster than the performance in the paper.

@alisyzhu
Copy link

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ?
I try it with 65B/33B on lomo / lomo+lora, all failed.
image

this is my args_lomo.yaml file
image
this is my ds_config.json file
image

@KaiLv69
Copy link
Collaborator

KaiLv69 commented Jun 29, 2023

thx, I succeed run 65B model with batch_size=1 and data_max_length=835, it costs 47min each epoch(26G gpu memery 、 8*v100 node with nvlink). to achieve the performance as has been noted by the paper,is the 3090 Gpus had nvlink?
Finally, we successfully train the 65B model using 8 RTX 3090 GPUs, achieving a throughput of 4.93 TGS. Utilizing such a server configuration and LOMO, the training process on 1000 samples, each containing 512 tokens, requires approximately 3.6 hours.

hi, I run 65B llama model with batch_size=1, and data_max_length=512, (32G gpu memory, 8 * V100 node), but failed. could you tell me your successful config info ? I try it with 65B/33B on lomo / lomo+lora, all failed. image

this is my args_lomo.yaml file image this is my ds_config.json file image

Solved here. #28

@KaiLv69 KaiLv69 closed this as completed Jul 19, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants