Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

train_new_from_iterator() does not work when pre_tokenizer is null #35315

Closed
1 of 4 tasks
cecheta opened this issue Dec 18, 2024 · 3 comments
Closed
1 of 4 tasks

train_new_from_iterator() does not work when pre_tokenizer is null #35315

cecheta opened this issue Dec 18, 2024 · 3 comments
Labels

Comments

@cecheta
Copy link
Contributor

cecheta commented Dec 18, 2024

System Info

transformers version 4.47.1
Ubuntu 20.04.6 LTS
Python 3.10

Who can help?

@ArthurZucker, @itazap

Information

  • The official example scripts
  • My own modified scripts

Tasks

  • An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
  • My own task or dataset (give details below)

Reproduction

Follow the steps listed in https://huggingface.co/learn/nlp-course/chapter6/2, however use microsoft/Phi-3.5-mini-instruct as the model instead of gpt2.

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("code_search_net", "python", trust_remote_code=True)

def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()

old_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
Traceback (most recent call last):
  File "/home/azureuser/tokenizer.py", line 16, in <module>
    tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
  File "/anaconda/envs/myenv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 819, in train_new_from_iterator
    or tokenizer_json["pre_tokenizer"]["type"] == "Sequence"
TypeError: 'NoneType' object is not subscriptable

Expected behavior

The tokenizer should be trained.

@cecheta cecheta added the bug label Dec 18, 2024
@cecheta
Copy link
Contributor Author

cecheta commented Dec 18, 2024

Appears to have regressed in #33556

@Rocketknight1
Copy link
Member

pinging @umarbutler @itazap from PR #33556

@ArthurZucker
Copy link
Collaborator

Closing as PR is merged!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants