关于llava 预处理左对齐还是右对齐 #190

powermano · 2024-10-29T09:23:59Z

我看官方代码好像是在右边打padding的，默认会把在末尾的 <image>替换到前面，然后使用右对齐padding.

默认会把在末尾的 <image>替换到前面

def preprocess_multimodal(
    sources: Sequence[str],
    data_args: DataArguments
) -> Dict:
    is_multimodal = data_args.is_multimodal
    if not is_multimodal:
        return sources

    for source in sources:
        for sentence in source:
            if DEFAULT_IMAGE_TOKEN in sentence['value']:
                sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '').strip()
                sentence['value'] = DEFAULT_IMAGE_TOKEN + '\n' + sentence['value']
                sentence['value'] = sentence['value'].strip()
                if "mmtag" in conversation_lib.default_conversation.version:
                    sentence['value'] = sentence['value'].replace(DEFAULT_IMAGE_TOKEN, '<Image>' + DEFAULT_IMAGE_TOKEN + '</Image>')
            replace_token = DEFAULT_IMAGE_TOKEN
            if data_args.mm_use_im_start_end:
                replace_token = DEFAULT_IM_START_TOKEN + replace_token + DEFAULT_IM_END_TOKEN
            sentence["value"] = sentence["value"].replace(DEFAULT_IMAGE_TOKEN, replace_token)

    return sources

torch.nn.utils.rnn.pad_sequence 这个好像是在右边打padding, 如果《image》在前面，最好实在后面打padding吧

···
@DataClass
class DataCollatorForSupervisedDataset(object):
"""Collate examples for supervised fine-tuning."""

tokenizer: transformers.PreTrainedTokenizer

def __call__(self, instances: Sequence[Dict]) -> Dict[str, torch.Tensor]:
    input_ids, labels = tuple([instance[key] for instance in instances]
                              for key in ("input_ids", "labels"))
    input_ids = torch.nn.utils.rnn.pad_sequence(
        input_ids,
        batch_first=True,
        padding_value=self.tokenizer.pad_token_id)
    labels = torch.nn.utils.rnn.pad_sequence(labels,
                                             batch_first=True,
                                             padding_value=IGNORE_INDEX)
    input_ids = input_ids[:, :self.tokenizer.model_max_length]
    labels = labels[:, :self.tokenizer.model_max_length]
    batch = dict(
        input_ids=input_ids,
        labels=labels,
        attention_mask=input_ids.ne(self.tokenizer.pad_token_id),
    )

    if 'image' in instances[0]:
        images = [instance['image'] for instance in instances]
        if all(x is not None and x.shape == images[0].shape for x in images):
            batch['images'] = torch.stack(images)
        else:
            batch['images'] = images

    return batch

···

The text was updated successfully, but these errors were encountered:

powermano · 2024-10-30T01:24:11Z

@yuanzhoulvpi2017 可以帮忙看下吗

yuanzhoulvpi2017 · 2024-10-30T01:27:20Z

你看一下这个链接吧#186 (comment)

powermano · 2024-10-30T01:33:28Z

你看一下这个链接吧#186 (comment)

好的，我看了数据，<image> 有时候在前面，有时候在后面，感觉需要统一到一个位置比较好吧。官方的代码里面好像处理了这种情况，统一把<image>放到前面。这个是不是要加到data.py里面

yuanzhoulvpi2017 · 2024-10-30T06:10:50Z

这个不影响吧，位置在哪里不是特别重要吧。当然期待你的实验结果～

powermano · 2024-10-30T09:02:57Z

我只能用qwen1.5-0.5B 模型训练，只有4080的卡，batch_size_per_gpu 只有 2 ，感觉loss好大，是不是LLM 太小导致的。我做几个实验看下，有没有影响。

weiaicunzai · 2025-01-01T15:51:00Z

如果你能实验结果出来发现左右padding确实影响效果，你可以深究一下，看为啥，说不定可以发一篇论文。这种小细节的问题但是能带来提升的论文，我个人挺喜欢的。

yuanzhoulvpi2017 added the llava label Oct 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

关于llava 预处理左对齐还是右对齐 #190

关于llava 预处理左对齐还是右对齐 #190

powermano commented Oct 29, 2024 •

edited

Loading

powermano commented Oct 30, 2024

yuanzhoulvpi2017 commented Oct 30, 2024

powermano commented Oct 30, 2024

yuanzhoulvpi2017 commented Oct 30, 2024

powermano commented Oct 30, 2024

weiaicunzai commented Jan 1, 2025

关于llava 预处理 左对齐还是右对齐 #190

关于llava 预处理 左对齐还是右对齐 #190

Comments

powermano commented Oct 29, 2024 • edited Loading

powermano commented Oct 30, 2024

yuanzhoulvpi2017 commented Oct 30, 2024

powermano commented Oct 30, 2024

yuanzhoulvpi2017 commented Oct 30, 2024

powermano commented Oct 30, 2024

weiaicunzai commented Jan 1, 2025

关于llava 预处理左对齐还是右对齐 #190

关于llava 预处理左对齐还是右对齐 #190

powermano commented Oct 29, 2024 •

edited

Loading