Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

Open
j40903272 opened this issue Feb 6, 2025 · 1 comment
Open
Labels
bug Something isn't working

Comments

@j40903272
Copy link

Describe the bug

I am running the script here: https://github.com/NVIDIA/NeMo/blob/main/scripts/llm/pretraining.py
I modified line #160 to change the MockDataModule to HFDatasetDataModule for my own data.

from nemo.collections.llm.gpt.data import PreTrainingDataModule, HFDatasetDataModule, MockDataModule
pretrain.data = run.Config(
        HFDatasetDataModule,
        "YDTsai/pretrain_test",
        seq_length=pretrain.data.seq_length,
        global_batch_size=pretrain.data.global_batch_size,
        micro_batch_size=pretrain.data.micro_batch_size,
    )

It will run into the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in <module>
    fdl_runner_app()
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
    return _main(
           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
    fdl_fn()
  File "/nemo_run/code/nemo/collections/llm/api.py", line 150, in pretrain
    return train(
           ^^^^^^
  File "/nemo_run/code/nemo/collections/llm/api.py", line 96, in train
    app_state = _setup(
                ^^^^^^^
  File "/nemo_run/code/nemo/collections/llm/api.py", line 838, in _setup
    _use_tokenizer(model, data, tokenizer)
  File "/nemo_run/code/nemo/collections/llm/api.py", line 795, in _use_tokenizer
    _set_with_io(model, "tokenizer", data.tokenizer)
                                     ^^^^^^^^^^^^^^
AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer'

In addition, HFDatasetDataModule does not take 'tokenizer' as arguments so passing a tokenizer into it will only be passed all the way to load_dataset function and cause other problems.

# this will not work
from nemo.collections.common.tokenizers.huggingface import AutoTokenizer
pretrain.data.tokenizer = AutoTokenizer("meta-llama/Llama-3.1-8B")

Steps/Code to reproduce bug

  1. Modify MockDataModule to HFDatasetDataModule as described above.
  2. Run the script.

Expected behavior

The HFDatasetDataModule should configure tokenizer from the model or explicitly take tokenizer from the init arguments.
If I am doing it wrong, please let me know what is the right way to build my own datamodule cause I cannot find any document or example for it.

Environment overview (please complete the following information)

I am using docker with the nvcr.io/nvidia/nemo:dev image updated on 02/03/2025 5:48 PM.

@j40903272 j40903272 added the bug Something isn't working label Feb 6, 2025
@j40903272
Copy link
Author

In the llm.api.pretrain function

return train(
model=model,
data=data,
trainer=trainer,
log=log,
resume=resume,
optim=optim,
tokenizer="data",
)

In the llm.api.finetune function

return train(
model=model,
data=data,
trainer=trainer,
log=log,
resume=resume,
optim=optim,
tokenizer="model",
model_transform=peft,
)

Pretrain explicitly passes tokenizer='data' and this is why causing the problem.

def _use_tokenizer(model: pl.LightningModule, data: pl.LightningDataModule, tokenizer: TokenizerType) -> None:
if tokenizer == "data":
_set_with_io(model, "tokenizer", data.tokenizer)
elif tokenizer == "model":
_set_with_io(data, "tokenizer", model.tokenizer)
else:

Not all datamodule has tokenizer attribute.
It is possible to directly assign a tokenizer to the datamodule, but this will fail when run with nemo_run cause the attribute cannot be serialized.

This might also explain why there is a todo here

if tokenizer: # TODO: Improve this
_use_tokenizer(model, data, tokenizer)

A simple fix can be check if tokenizer exists and raise a warning or exception.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant