在加载bookcorpus的过程中，builder_cls为None #85

Charlly-D · 2024-11-16T01:48:38Z

在
traindata = load_dataset(
'bookcorpus', split='train'
)
的这一步中，
builder_cls = get_dataset_builder_class(dataset_module, dataset_name=dataset_name)得到的builder_cls为None，
所以报错
builder_instance: DatasetBuilder = builder_cls(
TypeError: 'NoneType' object is not callable

pennyLuo-hub · 2024-11-25T12:17:08Z

You can download the data from this link (https://storage.googleapis.com/huggingface-nlp/datasets/bookcorpus/bookcorpus.tar.bz2) and extract it to a folder named "bookcorpus". I solved the issue by doing this, and I hope it helps you as well. @Charlly-D

Charlly-D · 2024-11-26T02:52:18Z

Oh, thank you very much. But may I ask if you know why it dosen't work, and I find that when the datasets need to be loaded by .py document, it will report an error "UnicodeDecodeError: 'utf-8' codec can't decode byte 0x8b in position 1: invalid start byte". @pennyLuo-hub

pennyLuo-hub · 2024-11-26T03:03:40Z

Maybe it's because the file is bookcorpus.tar.bz2 and hasn't been extracted.The data after extraction is as follows:books_large_p1.txt、books_large_p2.txt @Charlly-D .

Charlly-D · 2024-11-26T03:13:17Z

Thank you very much. @pennyLuo-hub

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

在加载bookcorpus的过程中，builder_cls为None #85

在加载bookcorpus的过程中，builder_cls为None #85

Charlly-D commented Nov 16, 2024

pennyLuo-hub commented Nov 25, 2024

Charlly-D commented Nov 26, 2024

pennyLuo-hub commented Nov 26, 2024

Charlly-D commented Nov 26, 2024

在加载bookcorpus的过程中，builder_cls为None #85

在加载bookcorpus的过程中，builder_cls为None #85

Comments

Charlly-D commented Nov 16, 2024

pennyLuo-hub commented Nov 25, 2024

Charlly-D commented Nov 26, 2024

pennyLuo-hub commented Nov 26, 2024

Charlly-D commented Nov 26, 2024