Inefficiencies in DataLoader Instantiation #258

le1nux · 2024-09-17T12:48:35Z

modalities v0.02

When instantiating the dataloader, there are two inefficiencies that cause very long instantiation times when using large datasets.

The index generation for the packed data comprises a for loop over all samples. Since block_size and num_samples is known, we can replace the for loop with a vector operation.

and the samples are later skipped via

Line 41 in c9b4aab

return iter(self.indices[self.start_index :])

Since we only have an iterable coming from the DistributedSampler, we can only have this for loop to build a copy of the index.

Skipping samples directly in DistributedSampler would allow us to remove the for-loop.

The text was updated successfully, but these errors were encountered:

le1nux added the bug Something isn't working label Sep 17, 2024

le1nux mentioned this issue Sep 17, 2024

Combined dataset feature #256

Closed

6 tasks

le1nux mentioned this issue Sep 24, 2024

Combined dataset feature #261

Merged

6 tasks

Provide feedback