Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimized sequence encoding for scalars #7393

Merged
merged 1 commit into from
Feb 13, 2025

Conversation

lukasgd
Copy link
Contributor

@lukasgd lukasgd commented Feb 11, 2025

The change in #3197 introduced redundant list-comprehensions when obj is a long sequence of scalars. This becomes a noticeable overhead when loading data from an IterableDataset in the function _apply_feature_types_on_example and can be eliminated by adding a check for scalars in encode_nested_example proposed here.

In the following code example

import time
from datasets.features import Sequence, Value
from datasets.features.features import encode_nested_example

schema = Sequence(Value("int32"))
obj = list(range(100000))

start = time.perf_counter()
result = encode_nested_example(schema, obj)
stop = time.perf_counter()

print(f"Time spent is {stop-start} sec")

encode_nested_example becomes 492x faster (from 0.0769 to 0.0002 sec), respectively 322x (from 0.00814 to 0.00003 sec) for a list of length 10000, on a GH200 system, making it unnoticeable when loading data with tokenization.

Another change is made to avoid creating arrays from scalars and afterwards re-extracting them during casting to python (obj == obj.__array__()[()] in that case), which avoids a regression in the array write benchmarks.

@HuggingFaceDocBuilderDev

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

Copy link
Member

@lhoestq lhoestq left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool ! LGTM :)

@lhoestq lhoestq merged commit a7cb715 into huggingface:main Feb 13, 2025
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants