AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

j40903272 · 2025-02-06T17:24:31Z

Describe the bug

I am running the script here: https://github.com/NVIDIA/NeMo/blob/main/scripts/llm/pretraining.py
I modified line #160 to change the MockDataModule to HFDatasetDataModule for my own data.

from nemo.collections.llm.gpt.data import PreTrainingDataModule, HFDatasetDataModule, MockDataModule
pretrain.data = run.Config(
        HFDatasetDataModule,
        "YDTsai/pretrain_test",
        seq_length=pretrain.data.seq_length,
        global_batch_size=pretrain.data.global_batch_size,
        micro_batch_size=pretrain.data.micro_batch_size,
    )

It will run into the following error:

Traceback (most recent call last):
  File "<frozen runpy>", line 198, in _run_module_as_main
  File "<frozen runpy>", line 88, in _run_code
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 66, in <module>
    fdl_runner_app()
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 340, in __call__
    raise e
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 323, in __call__
    return get_command(self)(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1161, in __call__
    return self.main(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 680, in main
    return _main(
           ^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/core.py", line 198, in _main
    rv = self.invoke(ctx)
         ^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 1443, in invoke
    return ctx.invoke(self.callback, **ctx.params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/click/core.py", line 788, in invoke
    return __callback(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/usr/local/lib/python3.12/dist-packages/typer/main.py", line 698, in wrapper
    return callback(**use_params)
           ^^^^^^^^^^^^^^^^^^^^^^
  File "/opt/NeMo-Run/src/nemo_run/core/runners/fdl_runner.py", line 62, in fdl_direct_run
    fdl_fn()
  File "/nemo_run/code/nemo/collections/llm/api.py", line 150, in pretrain
    return train(
           ^^^^^^
  File "/nemo_run/code/nemo/collections/llm/api.py", line 96, in train
    app_state = _setup(
                ^^^^^^^
  File "/nemo_run/code/nemo/collections/llm/api.py", line 838, in _setup
    _use_tokenizer(model, data, tokenizer)
  File "/nemo_run/code/nemo/collections/llm/api.py", line 795, in _use_tokenizer
    _set_with_io(model, "tokenizer", data.tokenizer)
                                     ^^^^^^^^^^^^^^
AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer'

In addition, HFDatasetDataModule does not take 'tokenizer' as arguments so passing a tokenizer into it will only be passed all the way to load_dataset function and cause other problems.

# this will not work
from nemo.collections.common.tokenizers.huggingface import AutoTokenizer
pretrain.data.tokenizer = AutoTokenizer("meta-llama/Llama-3.1-8B")

Steps/Code to reproduce bug

Modify MockDataModule to HFDatasetDataModule as described above.
Run the script.

Expected behavior

The HFDatasetDataModule should configure tokenizer from the model or explicitly take tokenizer from the init arguments.
If I am doing it wrong, please let me know what is the right way to build my own datamodule cause I cannot find any document or example for it.

Environment overview (please complete the following information)

I am using docker with the nvcr.io/nvidia/nemo:dev image updated on 02/03/2025 5:48 PM.

The text was updated successfully, but these errors were encountered:

j40903272 · 2025-02-07T21:50:27Z

In the llm.api.pretrain function

NeMo/nemo/collections/llm/api.py

Lines 150 to 158 in 1446c89

    
           return train( 
        
               model=model, 
        
               data=data, 
        
               trainer=trainer, 
        
               log=log, 
        
               resume=resume, 
        
               optim=optim, 
        
               tokenizer="data", 
        
           )

In the llm.api.finetune function

NeMo/nemo/collections/llm/api.py

Lines 201 to 210 in 1446c89

    
           return train( 
        
               model=model, 
        
               data=data, 
        
               trainer=trainer, 
        
               log=log, 
        
               resume=resume, 
        
               optim=optim, 
        
               tokenizer="model", 
        
               model_transform=peft, 
        
           )

Pretrain explicitly passes tokenizer='data' and this is why causing the problem.

NeMo/nemo/collections/llm/api.py

Lines 793 to 798 in 1446c89

    
           def _use_tokenizer(model: pl.LightningModule, data: pl.LightningDataModule, tokenizer: TokenizerType) -> None: 
        
               if tokenizer == "data": 
        
                   _set_with_io(model, "tokenizer", data.tokenizer) 
        
               elif tokenizer == "model": 
        
                   _set_with_io(data, "tokenizer", model.tokenizer) 
        
               else:

Not all datamodule has tokenizer attribute.
It is possible to directly assign a tokenizer to the datamodule, but this will fail when run with nemo_run cause the attribute cannot be serialized.

This might also explain why there is a todo here

NeMo/nemo/collections/llm/api.py

Lines 837 to 838 in 1446c89

    
           if tokenizer:  # TODO: Improve this 
        
               _use_tokenizer(model, data, tokenizer)

A simple fix can be check if tokenizer exists and raise a warning or exception.

j40903272 added the bug Something isn't working label Feb 6, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

j40903272 commented Feb 6, 2025

j40903272 commented Feb 7, 2025

AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

AttributeError: 'HFDatasetDataModule' object has no attribute 'tokenizer' #12080

Comments

j40903272 commented Feb 6, 2025

j40903272 commented Feb 7, 2025