Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

push_to_hub payload too large error when using large ClassLabel feature #7392

Open
DavidRConnell opened this issue Feb 11, 2025 · 1 comment

Comments

@DavidRConnell
Copy link

Describe the bug

When using datasets.DatasetDict.push_to_hub an HfHubHTTPError: 413 Client Error: Payload Too Large for url is raised if the dataset contains a large ClassLabel feature. Even if the total size of the dataset is small.

Steps to reproduce the bug

import random
import sys

import datasets

random.seed(42)


def random_str(sz):
    return "".join(chr(random.randint(ord("a"), ord("z"))) for _ in range(sz))


data = datasets.DatasetDict(
    {
        str(i): datasets.Dataset.from_dict(
            {
                "label": [list(range(3)) for _ in range(10)],
                "abstract": [random_str(10_000) for _ in range(10)],
            },
        )
        for i in range(3)
    }
)
features = data["1"].features.copy()
features["label"] = datasets.Sequence(
    datasets.ClassLabel(names=[str(i) for i in range(50_000)])
)
data = data.map(lambda examples: {}, features=features)

feat_size = sys.getsizeof(data["1"].features["label"].feature.names)
print(f"Size of ClassLabel names: {feat_size}")
# Size of ClassLabel names: 444376


data.push_to_hub("dconnell/pubtator3_test")

Note that this succeeds if ClassLabel has fewer names or if ClassLabel is replaced with Value("int64")

Expected behavior

Should push the dataset to hub.

Environment info

Copy-and-paste the text below in your GitHub issue.

  • datasets version: 3.2.0
  • Platform: Linux-5.15.0-126-generic-x86_64-with-glibc2.35
  • Python version: 3.12.8
  • huggingface_hub version: 0.28.1
  • PyArrow version: 19.0.0
  • Pandas version: 2.2.3
  • fsspec version: 2024.9.0
@DavidRConnell
Copy link
Author

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant