fix mask_user_labels #130

DoffeBupt · 2023-08-27T07:09:43Z

Dear authors of Starcoder, while using the framework, I noticed that mask_user_labels sometimes does not function properly. Upon investigation, I found that there might be an issue with the function invocation. Here are my modifications：

In the group_texts() function in train.py, the result["input_ids"] has a structure of lists nested within lists. However, mask_user_labels deals with a single-layered list structure. Therefore, mask_user_labels did not function as expected. I used a loop to process each label separately, allowing the function to be called normally.
In the mask_user_labels function, I added masking for system-related labels. Moreover, the condition current_idx < len(labels) in the while loop should be placed at the beginning; otherwise, this condition becomes meaningless and can lead to out-of-bounds access with labels[current_idx].

wanglongxingtianxia · 2023-08-28T09:37:12Z

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

DoffeBupt · 2023-08-28T09:48:37Z

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

change
mask_user_labels(tokenizer, dialogue_template, labels)
to
for label in labels:
mask_user_labels(tokenizer, dialogue_template, label)

I think this should make sense

wanglongxingtianxia · 2023-08-29T02:04:31Z

concatenated_examples['input_ids'] is one dimensional list

fix mask_user_labels

9d9f0a5

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix mask_user_labels #130

fix mask_user_labels #130

DoffeBupt commented Aug 27, 2023

wanglongxingtianxia commented Aug 28, 2023

DoffeBupt commented Aug 28, 2023

wanglongxingtianxia commented Aug 29, 2023

fix mask_user_labels #130

Are you sure you want to change the base?

fix mask_user_labels #130

Conversation

DoffeBupt commented Aug 27, 2023

wanglongxingtianxia commented Aug 28, 2023

DoffeBupt commented Aug 28, 2023

wanglongxingtianxia commented Aug 29, 2023