[BUG] <title>LoRA 微调问题 #342

SHIMURA0 · 2024-07-16T06:08:10Z

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

我已经搜索过已有的issues和讨论 | I have searched the existing issues / discussions

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

我已经搜索过FAQ | I have searched FAQ

当前行为 | Current Behavior

在LoRA过程中遇到了image start token != image end tokens 以及
UserWarning: None of the inputs have requires_grad=True
RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation
我想应该和准备的数据集有点关系，请教一下该怎么修复呢，提前感谢！

期望行为 | Expected Behavior

No response

复现方法 | Steps To Reproduce

No response

运行环境 | Environment

- OS:Ubuntu
- Python:
- Transformers:
- PyTorch:
- CUDA (`python -c 'import torch; print(torch.version.cuda)'`):

备注 | Anything else?

No response

qyc-98 · 2024-07-16T06:33:14Z

你的数据集目前是什么样子的

SHIMURA0 · 2024-07-16T07:23:59Z

JSON 文件里是一整个list，list里面是一个个的字典，每个字典包含“ID”， “image” 是图片的路径，“conversations” 是一个包含对话的list

qyc-98 · 2024-07-16T07:51:50Z

具体看一下conversations

SHIMURA0 · 2024-07-16T08:09:34Z

{
"role": "user",
"content": "Classify the image as label 0 or 1."
},
{
"role": "assistant",
"content": "This image is classified as label 0."
},

SHIMURA0 · 2024-07-16T08:11:22Z

很奇怪的是当我在AutoModel.from_pretrained() 后面加上.to("cuda")后这个bug就没了，但是出现了cuda out of memory的bug

SHIMURA0 · 2024-07-16T08:19:02Z

同时我还想请教下另一个问题，目前我一台服务器上有8张NVIDIA GPU 显卡但是只能使用其中的7张（除了index0那张）我要修改finetune.py 这个文件以及finetune_lora.sh这个文件里面的分布式训练代码吗🤔

SHIMURA0 · 2024-07-16T09:02:23Z

还是得解决最开始的问题😂

qyc-98 · 2024-07-16T11:11:13Z

你的训练文件里面没有加这个token，请参照这个组织你的数据https://github.com/OpenBMB/MiniCPM-V/tree/main/finetune#data-preparation
不需要修改finetune.py 需要在执行脚本前设定当前可见的显卡序列请参照这个设定https://discuss.pytorch.org/t/what-does-export-cuda-visible-devices-1-really-do/90340

qyc-98 · 2024-07-16T11:11:40Z

你的显卡是什么

SHIMURA0 · 2024-07-16T11:34:41Z

好的，谢谢我先看看
显卡是NVIDIA V00*8
你说的是那个 token吗，我就是按照那个准备的，但我看了后面写的那句话If you don't provide , the image will be placed at the front of the conversation，然后就把那个token去掉了，你的意思是必须要加上那个token对吧，明天实验下
再次感谢

Mihaiii · 2024-07-16T22:13:56Z

I also get "RuntimeError: a view of a leaf Variable that requires grad is being used in an in-place operation" when running the Lora script. I made sure I have the data in the desired format.

Attached is the output I get when running the command: output_lora.txt

FWIW, I also tried with this version of finetune.py and I get the same error I get on current main branch of the official repo.

Regarding deps, I ran the ones in requirements.txt + the pinned versions mentioned in this PR.

I use only one machine (no parallelism).

Here is the nvidia-smi output:

+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 555.42.02              Driver Version: 555.42.02      CUDA Version: 12.5     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100 80GB PCIe          On  |   00000000:46:00.0 Off |                    0 |
| N/A   32C    P0             41W /  300W |       1MiB /  81920MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+
                                                                                         
+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI        PID   Type   Process name                              GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

qyc-98 · 2024-07-17T01:31:47Z

Hi, we update the code please try again.

SHIMURA0 · 2024-07-17T01:37:53Z

加上token也没用还是原来的error。

SHIMURA0 · 2024-07-17T01:48:48Z

Hi, we update the code please try again.

用了最近的代码，产生了一个新的error
AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings"
错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164，72

SHIMURA0 · 2024-07-17T02:46:51Z

Hi, we update the code please try again.

用了最近的代码，产生了一个新的error AttributeError: "ModulesToSaveWrapper" object has no attribute "embeddings" 错误发生在模型文件里面的modeling_minicpmv.py 脚本中的 line164，72

模型我用了从modelscope下载的本地缓存，然后在finetune.py文件中修改了模型的导入，改为使用modescope中的AutoModel and AutoTokenizer然后加在本地模型缓存，这会有影响吗

qyc-98 · 2024-07-17T02:47:41Z

我建议你直接拿huggingface的重新下载一遍

Mihaiii · 2024-07-17T05:55:11Z

Hi, we update the code please try again.

I can confirm it's working now with "--bf16 true --bf16_full_eval true --fp16 false --fp16_full_eval false". Initially I tried fp16 and I got an error saying "Attempting to unscale FP16 gradients" so I switched to BF16.

Thank you for the fix!

qyc-98 · 2024-07-17T13:17:54Z

You are welcome!

SHIMURA0 · 2024-07-18T06:02:02Z

嵌入的问题解决了，但是huggingface和modescope的模型难道有细微不同吗？；
现在的问题是cuda out of memory 但是一些常见的解决方案我都试了，我估计是分布式训练中有点问题，如果我只想在GPU2上单卡训练应该怎么修改finetune.py的代码呢求教

SHIMURA0 · 2024-07-18T07:27:51Z

不对，我目前没有使用任何GPU，但还是报错cuda out of memory

SHIMURA0 · 2024-07-18T07:53:50Z

okok 我解决了

qyc-98 added the Finetune label Jul 16, 2024

qyc-98 closed this as completed Jul 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] <title>LoRA 微调问题 #342

[BUG] <title>LoRA 微调问题 #342

SHIMURA0 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

Mihaiii commented Jul 16, 2024 •

edited

Loading

qyc-98 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

qyc-98 commented Jul 17, 2024

Mihaiii commented Jul 17, 2024 •

edited

Loading

qyc-98 commented Jul 17, 2024

SHIMURA0 commented Jul 18, 2024

SHIMURA0 commented Jul 18, 2024

SHIMURA0 commented Jul 18, 2024

[BUG] <title>LoRA 微调问题 #342

[BUG] <title>LoRA 微调问题 #342

Comments

SHIMURA0 commented Jul 16, 2024

是否已有关于该错误的issue或讨论？ | Is there an existing issue / discussion for this?

该问题是否在FAQ中有解答？ | Is there an existing answer for this in FAQ?

当前行为 | Current Behavior

期望行为 | Expected Behavior

复现方法 | Steps To Reproduce

运行环境 | Environment

备注 | Anything else?

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

qyc-98 commented Jul 16, 2024

SHIMURA0 commented Jul 16, 2024

Mihaiii commented Jul 16, 2024 • edited Loading

qyc-98 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

SHIMURA0 commented Jul 17, 2024

qyc-98 commented Jul 17, 2024

Mihaiii commented Jul 17, 2024 • edited Loading

qyc-98 commented Jul 17, 2024

SHIMURA0 commented Jul 18, 2024

SHIMURA0 commented Jul 18, 2024

SHIMURA0 commented Jul 18, 2024

Mihaiii commented Jul 16, 2024 •

edited

Loading

Mihaiii commented Jul 17, 2024 •

edited

Loading