Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] <title>LoRA 微调问题 #342

Closed
2 tasks done
SHIMURA0 opened this issue Jul 16, 2024 · 21 comments
Closed
2 tasks done

[BUG] <title>LoRA 微调问题 #342

SHIMURA0 opened this issue Jul 16, 2024 · 21 comments
Labels

Comments

@SHIMURA0
Copy link

是否已有关于该错误的issue或讨论? | Is there an existing issue / discussion for this?

  • 我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答? | Is there an existing answer for this in FAQ?

  • 我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在LoRA过程中遇到了image start token != image end tokens 以及
UserWarning: None of the inputs have requires_grad=True
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
我想应该和准备的数据集有点关系,请教一下该怎么修复呢, 提前感谢!

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 16, 2024

你的数据集目前是什么样子的

@SHIMURA0
Copy link
Author

JSON 文件里是一整个list,list里面是一个个的字典,每个字典包含“ID”, “image” 是图片的路径,“conversations” 是一个包含对话的list

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 16, 2024

具体看一下conversations

@SHIMURA0
Copy link
Author

{
"role": "user",
"content": "Classify the image as label 0 or 1."
},
{
"role": "assistant",
"content": "This image is classified as label 0."
},

@SHIMURA0
Copy link
Author

很奇怪的是当我在AutoModel.from_pretrained() 后面加上.to("cuda")后这个bug就没了,但是出现了cuda out of memory的bug

@SHIMURA0
Copy link
Author

同时我还想请教下另一个问题,目前我一台服务器上有8张NVIDIA GPU 显卡但是只能使用其中的7张(除了index0那张)我要修改finetune.py 这个文件以及finetune_lora.sh这个文件里面的分布式训练代码吗🤔

@SHIMURA0
Copy link
Author

还是得解决最开始的问题😂

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 16, 2024

你的训练文件里面没有加这个token,请参照这个组织你的数据https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#data-preparation
不需要修改finetune.py 需要在执行脚本前设定当前可见的显卡序列 请参照这个设定https://discuss.pytorch.org/t/what-does-export-cuda-visible-devices-1-really-do/90340

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 16, 2024

你的显卡是什么

@SHIMURA0
Copy link
Author

好的,谢谢我先看看
显卡是NVIDIA V00*8
你说的是那个 token吗,我就是按照那个准备的,但我看了后面写的那句话If you don't provide , the image will be placed at the front of the conversation, 然后就把那个token去掉了,你的意思是必须要加上那个token对吧,明天实验下
再次感谢

@Mihaiii
Copy link

Mihaiii commented Jul 16, 2024

I also get "RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation" when running the Lora script. I made sure I have the data in the desired format.

Attached is the output I get when running the command: output_lora.txt

FWIW, I also tried with this version of finetune.py and I get the same error I get on current main branch of the official repo.

Regarding deps, I ran the ones in requirements.txt + the pinned versions mentioned in this PR.

I use only one machine (no parallelism).

Here is the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:46:00.0 Off |                    0 |
| N/A   32C    P0             41W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 17, 2024

Hi, we update the code please try again.

@SHIMURA0
Copy link
Author

加上token也没用还是原来的error。

@SHIMURA0
Copy link
Author

Hi, we update the code please try again.

用了最近的代码,产生了一个新的error
AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings"
错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164,72

@SHIMURA0
Copy link
Author

Hi, we update the code please try again.

用了最近的代码,产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164,72

模型我用了从modelscope下载的本地缓存,然后在finetune.py文件中修改了模型的导入,改为使用modescope中的AutoModel and AutoTokenizer然后加在本地模型缓存,这会有影响吗

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 17, 2024

我建议你直接拿huggingface的重新下载一遍

@Mihaiii
Copy link

Mihaiii commented Jul 17, 2024

Hi, we update the code please try again.

I can confirm it's working now with "--bf16 true --bf16_full_eval true --fp16 false --fp16_full_eval false". Initially I tried fp16 and I got an error saying "Attempting to unscale FP16 gradients" so I switched to BF16.

Thank you for the fix!

@qyc-98
Copy link
Collaborator

qyc-98 commented Jul 17, 2024

You are welcome!

@SHIMURA0
Copy link
Author

嵌入的问题解决了,但是huggingface和modescope的模型难道有细微不同吗?;
现在的问题是cuda out of memory 但是一些常见的解决方案我都试了,我估计是分布式训练中有点问题,如果我只想在GPU2上单卡训练应该怎么修改finetune.py的代码呢求教

@SHIMURA0
Copy link
Author

不对,我目前没有使用任何GPU,但还是报错cuda out of memory

@SHIMURA0
Copy link
Author

okok 我解决了

@qyc-98 qyc-98 closed this as completed Jul 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants