Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

关于llava 预处理 左对齐还是右对齐 #190

Open
powermano opened this issue Oct 29, 2024 · 6 comments
Open

关于llava 预处理 左对齐还是右对齐 #190

powermano opened this issue Oct 29, 2024 · 6 comments
Labels

Comments

@powermano
Copy link

powermano commented Oct 29, 2024

我看官方代码好像是在右边打padding的,默认会把在末尾的 <image>替换到前面,然后使用右对齐padding.

默认会把在末尾的 <image>替换到前面

def preprocess_multimodal(
    sources: Sequence[str],
    data_args: DataArguments
) -> Dict:
    is_multimodal = data_args.is_multimodal
    if not is_multimodal:
        return sources

    for source in sources:
        for sentence in source:
            if DEFAULT_IMAGE_TOKEN in sentence['value']:
                sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '').strip()
                sentence['value'] = DEFAULT_IMAGE_TOKEN + '\n' + sentence['value']
                sentence['value'] = sentence['value'].strip()
                if "mmtag" in conversation_lib.default_conversation.version:
                    sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '<Image>' + DEFAULT_IMAGE_TOKEN + '</Image>')
            replace_token = DEFAULT_IMAGE_TOKEN
            if data_args.mm_use_im_start_end:
                replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
            sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)

    return sources

torch.nn.utils.rnn.pad_sequence 这个好像是在右边打padding, 如果《image》在前面,最好实在后面打padding吧

···
@DataClass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""

tokenizer: transformers.PreTrainedTokenizer

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
    input_ids, labels = tuple([instance[key] for instance in instances]
                              for key in ("input_ids", "labels"))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids,
        batch_first=True,
        padding_value=self.tokenizer.pad_token_id)
    labels = torch.nn.utils.rnn.pad_sequence(labels,
                                             batch_first=True,
                                             padding_value=IGNORE_INDEX)
    input_ids = input_ids[:, :self.tokenizer.model_max_length]
    labels = labels[:, :self.tokenizer.model_max_length]
    batch = dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
    )

    if 'image' in instances[0]:
        images = [instance['image'] for instance in instances]
        if all(x is not None and x.shape == images[0].shape for x in images):
            batch['images'] = torch.stack(images)
        else:
            batch['images'] = images

    return batch

···

@powermano
Copy link
Author

@yuanzhoulvpi2017 可以帮忙看下吗

@yuanzhoulvpi2017
Copy link
Owner

你看一下这个链接吧#186 (comment)

@powermano
Copy link
Author

你看一下这个链接吧#186 (comment)

好的,我看了数据,<image> 有时候在前面,有时候在后面,感觉需要统一到一个位置比较好吧。官方的代码里面好像处理了这种情况,统一把<image>放到前面。 这个是不是要加到data.py里面

@yuanzhoulvpi2017
Copy link
Owner

这个不影响吧, 位置在哪里不是特别重要吧。当然期待你的实验结果~

@powermano
Copy link
Author

我只能用qwen1.5-0.5B 模型训练, 只有4080的卡,batch_size_per_gpu 只有 2 , 感觉loss好大,是不是LLM 太小导致的。我做几个实验看下,有没有影响。

@weiaicunzai
Copy link

如果你能实验结果出来发现左右padding确实影响效果,你可以深究一下,看为啥,说不定可以发一篇论文。这种小细节的问题但是能带来提升的论文,我个人挺喜欢的。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants