-
Notifications
You must be signed in to change notification settings - Fork 72
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Suggestion for improving parallel data loading speed #1249
Comments
@dfalbel let me know what you think. This should not be too hard to implement I guess. |
The approach sounds reasonable, my questions are:
I would argue that serialization is not that expensive in torch because with safetensors, it's essensitally a |
Yes, I am pretty sure.
I made the assumption that only the batch dimension is varying (and where we know the maximum value). But I now realize that Transformers or RNNs can also have variable sequence length. This would mean we have to require from the user to specify the maximum dimension
I thought we could still communicate those from the processes through a connection. Because these objects are much smaller I thought this would not be an issue anymore, but this might be wrong? But yeah, maybe this is not as simple as I thought first...
But we can't fully rely on the |
Also, can you maybe explain again (if you know) why this is not a problem for the pytorch dataloader? |
AFAICT PyTorch uses similar ideas, eg: https://pytorch.org/docs/stable/generated/torch.Tensor.share_memory_.html |
I was looking at Pytorch source code to make use of You'll also need a Then use To create a tensor pointing to the shared memory location. Unfortunatelly, most of this is Python specific C++ code, so we would need to re-implement in the Lantern side. |
Following up on a discussion we had previously. The main bottleneck when using a parallel dataloader is currently the serialization-deserialization roundtrip that happens every thime a batch is sent from one of the workers to the main process.
My suggestion for solving this problem would be to rely on the
SharedObject
library, which allows to share matrices between processes.I think we could do the following (assuming for simplicity the dataloader returns only a single 2d tensor but this can easily be generalized):
num_workers
workers, we createnum_workers
shared matrices, i.e. matrix M1 for worker W1, matrix M2 for worker W2, ... .I think we can even do without the copying, but things can go wrong here, so it should be made opt-in:
To avoid copying the buffer from M1 to create tensor T1, we could use
torch_tensor_from_buffer
so M1 and T2 also share memory.What is tricky here is now to determine when worker W1 is allowed again to write into matrix M1.
One reasonable heuristic here would be to say it can write into M1 again when another batch was loaded from a worker Wi with i != 1.
This can still lead to issues when the batches are saved somewhere and not only used for a forward pass.
I still think it would be useful to offer this option as this might be okay in many use-cases
The text was updated successfully, but these errors were encountered: