You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When instantiating the dataloader, there are two inefficiencies that cause very long instantiation times when using large datasets.
The index generation for the packed data comprises a for loop over all samples. Since block_size and num_samples is known, we can replace the for loop with a vector operation.
The Dataloader implementation uses a ResumableBatchSampler to allow for skipping samples (e.g., for warm starts)
The index in the ResumableBatchSampler is created in the constructor
System Info
modalities v0.02
🐛 Describe the bug
When instantiating the dataloader, there are two inefficiencies that cause very long instantiation times when using large datasets.
https://github.com/Modalities/modalities/blob/c9b4aabd1d216931c33cbf2b10e227a2502767f2/src/modalities/dataloader/dataset.py#L331-#L334
ResumableBatchSampler
to allow for skipping samples (e.g., for warm starts)The index in the
ResumableBatchSampler
is created in the constructormodalities/src/modalities/dataloader/samplers.py
Line 28 in c9b4aab
and the samples are later skipped via
modalities/src/modalities/dataloader/samplers.py
Line 41 in c9b4aab
Since we only have an iterable coming from the
DistributedSampler
, we can only have this for loop to build a copy of the index.A solution would be, to adapt the original
DistributedSampler
with sample skipping functionality here:https://github.com/pytorch/pytorch/blob/e248c1d7ebe437094d42d6cad0acf5ffd0a27cad/torch/utils/data/distributed.py#L114
Skipping samples directly in
DistributedSampler
would allow us to remove the for-loop.The text was updated successfully, but these errors were encountered: