Optimized sequence encoding for scalars #7393

lukasgd · 2025-02-11T20:30:44Z

The change in #3197 introduced redundant list-comprehensions when obj is a long sequence of scalars. This becomes a noticeable overhead when loading data from an IterableDataset in the function _apply_feature_types_on_example and can be eliminated by adding a check for scalars in encode_nested_example proposed here.

In the following code example

import time
from datasets.features import Sequence, Value
from datasets.features.features import encode_nested_example

schema = Sequence(Value("int32"))
obj = list(range(100000))

start = time.perf_counter()
result = encode_nested_example(schema, obj)
stop = time.perf_counter()

print(f"Time spent is {stop-start} sec")

encode_nested_example becomes 492x faster (from 0.0769 to 0.0002 sec), respectively 322x (from 0.00814 to 0.00003 sec) for a list of length 10000, on a GH200 system, making it unnoticeable when loading data with tokenization.

Another change is made to avoid creating arrays from scalars and afterwards re-extracting them during casting to python (obj == obj.__array__()[()] in that case), which avoids a regression in the array write benchmarks.

HuggingFaceDocBuilderDev · 2025-02-13T16:46:53Z

The docs for this PR live here. All of your documentation changes will be reflected on that endpoint. The docs are available until 30 days after the last update.

lhoestq

Cool ! LGTM :)

Optimized sequence encoding for scalars

6a3984a

lhoestq approved these changes Feb 13, 2025

View reviewed changes

lhoestq merged commit a7cb715 into huggingface:main Feb 13, 2025
14 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimized sequence encoding for scalars #7393

Optimized sequence encoding for scalars #7393

lukasgd commented Feb 11, 2025

HuggingFaceDocBuilderDev commented Feb 13, 2025

lhoestq left a comment

Optimized sequence encoding for scalars #7393

Optimized sequence encoding for scalars #7393

Conversation

lukasgd commented Feb 11, 2025

HuggingFaceDocBuilderDev commented Feb 13, 2025

lhoestq left a comment

Choose a reason for hiding this comment