-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Data] iter_torch_batches very slow on video data #50128
Comments
I found some relevant issues:
|
I realized my entire dataset would fit in RAM, so I just circumvented the problem by loading everything into pinned memory and then using cuda streams I load the next four items onto the GPU. Here's my code if it helps anyone: def iter_torch_batches_fast(dataset: ray.data.Dataset, batch_size: int):
def pin_batch(batch):
for key in batch:
if isinstance(batch[key], torch.Tensor):
batch[key] = batch[key].pin_memory()
return batch
pinned = [pin_batch(batch) for batch in dataset.iter_torch_batches(
batch_size=batch_size,
prefetch_batches=4,
device="cpu",
dtypes=torch.float32,
)]
class PinnedDatasetStreamer():
def __init__(self, pinned):
self.pinned = iter(pinned)
self.queue = []
for _ in range(4):
self.add_to_queue()
def add_to_queue(self):
device = ray.train.torch.get_device()
try:
batch = next(self.pinned)
s = torch.cuda.Stream()
with torch.cuda.stream(s):
b = {}
for key in batch:
if isinstance(batch[key], torch.Tensor):
b[key] = batch[key].to(device, non_blocking=True)
else:
b[key] = batch[key]
self.queue.append((b, s))
except StopIteration:
pass
def __next__(self):
if len(self.queue) == 0:
raise StopIteration
(batch, s) = self.queue.pop(0)
self.add_to_queue()
torch.cuda.current_stream().wait_stream(s)
return batch
class PinnedDataset():
def __init__(self, pinned):
self.pinned = pinned
def __iter__(self):
return PinnedDatasetStreamer(self.pinned)
return PinnedDataset(pinned) With this basically all time is spent in training (data loading is now 0.01 sec per epoch, training is 5.4 sec) |
Hey @FredrikNoren , we've noticed this issue recently as well. |
@raulchen I tried tweaking the prefetch_batches first, but didn't see any performance improvements in my case. I also tried enabling actor_prefetcher_enabled but didn't see any gain either. I think in my case there are multiple bottlenecks:
|
Very insightful observations @FredrikNoren! Can you share a bit more info about the shape of your data? For ex,
That step is happening when we convert from internal representation (Arrow) to Numpy -- by default NP requires contiguous slab of memory for its batch and hence the concatenation of the chunks produced by PyArrow. This step could obviously circumvented but is use-case dependent and could have performance repercussions. |
@alexeykudinkin Sure!
|
@FredrikNoren thanks for the insights!
|
What happened + What you expected to happen
I'm training a model on video data, and I noticed that 95% of the time it was training it was just waiting for the next batch (the time spent training an epoch was around 14sec, but 13 of those were spent just waiting for the next batch). So I did some digging, and here's the flame graph for CPU usage:
This made me suspicious of the batching code, so I changed my batch_size from 16 to 1, and it took my training time from 14sec to 10sec, and the time it took to fetch a batch from 13 sec to 5 sec. I profiled it again and here's the new flame graph:
Still not great, but a bit better. But I'm using torchcodec to load the video data which gives me a torch tensor back, so it feels a bit unnecessary for it to be transformed to numpy and then back to torch again. So my questions are:
Other notes:
Versions / Dependencies
ray 2.41.0
torch 2.5.1
Reproduction script
None
Issue Severity
None
The text was updated successfully, but these errors were encountered: