`train_new_from_iterator()` does not work when pre_tokenizer is null #35315

cecheta · 2024-12-18T10:59:32Z

System Info

transformers version 4.47.1
Ubuntu 20.04.6 LTS
Python 3.10

Who can help?

@ArthurZucker, @itazap

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

Follow the steps listed in https://huggingface.co/learn/nlp-course/chapter6/2, however use microsoft/Phi-3.5-mini-instruct as the model instead of gpt2.

from datasets import load_dataset
from transformers import AutoTokenizer

raw_datasets = load_dataset("code_search_net", "python", trust_remote_code=True)

def get_training_corpus():
    dataset = raw_datasets["train"]
    for start_idx in range(0, len(dataset), 1000):
        samples = dataset[start_idx : start_idx + 1000]
        yield samples["whole_func_string"]

training_corpus = get_training_corpus()

old_tokenizer = AutoTokenizer.from_pretrained("microsoft/Phi-3.5-mini-instruct")

tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)

Traceback (most recent call last):
  File "/home/azureuser/tokenizer.py", line 16, in <module>
    tokenizer = old_tokenizer.train_new_from_iterator(training_corpus, 52000)
  File "/anaconda/envs/myenv/lib/python3.10/site-packages/transformers/tokenization_utils_fast.py", line 819, in train_new_from_iterator
    or tokenizer_json["pre_tokenizer"]["type"] == "Sequence"
TypeError: 'NoneType' object is not subscriptable

Expected behavior

The tokenizer should be trained.

The text was updated successfully, but these errors were encountered:

cecheta · 2024-12-18T11:00:13Z

Appears to have regressed in #33556

Rocketknight1 · 2024-12-18T18:27:30Z

pinging @umarbutler @itazap from PR #33556

ArthurZucker · 2025-01-09T14:34:52Z

Closing as PR is merged!

cecheta added the bug label Dec 18, 2024

ArthurZucker mentioned this issue Dec 23, 2024

tokenizer train from iterator without pre_tokenizers #35396

Merged

ArthurZucker closed this as completed Jan 9, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`train_new_from_iterator()` does not work when pre_tokenizer is null #35315

`train_new_from_iterator()` does not work when pre_tokenizer is null #35315

cecheta commented Dec 18, 2024

cecheta commented Dec 18, 2024

Rocketknight1 commented Dec 18, 2024

ArthurZucker commented Jan 9, 2025

train_new_from_iterator() does not work when pre_tokenizer is null #35315

train_new_from_iterator() does not work when pre_tokenizer is null #35315

Comments

cecheta commented Dec 18, 2024

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

cecheta commented Dec 18, 2024

Rocketknight1 commented Dec 18, 2024

ArthurZucker commented Jan 9, 2025

`train_new_from_iterator()` does not work when pre_tokenizer is null #35315

`train_new_from_iterator()` does not work when pre_tokenizer is null #35315