Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix mask_user_labels #130

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

DoffeBupt
Copy link

Dear authors of Starcoder, while using the framework, I noticed that mask_user_labels sometimes does not function properly. Upon investigation, I found that there might be an issue with the function invocation. Here are my modifications:

  1. In the group_texts() function in train.py, the result["input_ids"] has a structure of lists nested within lists. However, mask_user_labels deals with a single-layered list structure. Therefore, mask_user_labels did not function as expected. I used a loop to process each label separately, allowing the function to be called normally.
  2. In the mask_user_labels function, I added masking for system-related labels. Moreover, the condition current_idx < len(labels) in the while loop should be placed at the beginning; otherwise, this condition becomes meaningless and can lead to out-of-bounds access with labels[current_idx].

@wanglongxingtianxia
Copy link

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

@DoffeBupt
Copy link
Author

def group_texts(examples):
        # Concatenate all texts.
        # print(type(examples))
        concatenated_examples = {k: list(chain(*examples[k])) for k in examples.keys()}
        total_length = len(concatenated_examples[list(examples.keys())[0]])
        # We drop the small remainder, we could add padding if the model supported it instead of this drop, you can
        # customize this part to your needs.
        if total_length >= block_size:
            total_length = (total_length // block_size) * block_size
        # Split by chunks of max_len.
        labels = concatenated_examples["input_ids"].copy()
        mask_user_labels(tokenizer, dialogue_template, labels)
        concatenated_examples['labels'] = labels
        result = {
            k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
            for k, t in concatenated_examples.items()
        }
        return result

Can we change this?

change
mask_user_labels(tokenizer, dialogue_template, labels)
to
for label in labels:
mask_user_labels(tokenizer, dialogue_template, label)

I think this should make sense

@wanglongxingtianxia
Copy link

concatenated_examples['input_ids'] is one dimensional list

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants