-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement cluster-aware dataloaders #13
Comments
Here is my opening suggestion for how to do it. Don't hesitate to criticize it :) Code organization
Basically, Details
path = milatools.datasets.resolve(
# Places where the data can be found
template={"name": "X", "tmp": os.environ["SLURM_TMPDIR"]},
sources=[
"local://../data/{name}",
"mila:///network/datasets/{name}",
"mila:///network/scratch/u/user/{name}",
"https://download.site",
"s3://blah",
"magnet:?xt=abcdef",
],
# Place where the data can be cached (if downloaded)
cache=["mila:///network/scratch/u/user/{name}"],
# Places where the data must go
destinations=["local://../data/{name}", "mila://{tmp}/{name}"],
# Whether the data should be untarred, unzipped, etc.
unpack=True,
# Optionally, md5 hash for verification
md5="xyz"
)
data = ImageFolder(path) I am sure the interface can be improved, but the basic idea is to let the user specify many paths for the data, separated into "sources" where it can be found, "cache" where the download can be cached, and "destinations" where it must be copied or unpacked for actual processing. In a local setting, the same path can be used for all roles. Each environment could fill in some paths by default, so that e.g. the |
I would add a |
I agree with the general ideas described above. Yes I think we want to use While I think it makes sense to return torchvision / pytorch datasets instances whenever we can, we should not commit on that feature for all datasets. A lot of datasets do not follow pytorch's standard structure and some are sufficiently complex for me not to want to understand how the different auxiliaries interact with each others and which one is used in what case. So we should also allow to retrieve the location of the dataset's root in destination for custom datasets classes to be implemented by the users when necessary. We could allow users to then submit as PRs the classes but this would add an other layer of complexity to handle as it likely that some datasets classes would not generalize on all use cases, specifically when a datasets contains multiple data types (audio, video, images, sensors data, ...). In those case, an array of datasets classes well documented could be necessary. So I see the return of the pytorch's datasets as a strong feature but still something that could be implemented in a second step. |
Wouldn't the above be cleaner? Later the users can go as far as: get_current() is anyway constant and we can do the get_builder behind the scenes the first time. |
|
About multi framework support we can have something of the kind milatools/datasets.py try:
import torch
except ImportError as e:
warnings.warn("`pytorch` not installed To have pytorch support in milatools.datasets please `pip install pytorch` or whaever")
else:
from milatools.datasets.torch import * So that we just don't load what is not available, if they want to import the module anyway, they can use the full path, but that's on them. |
I'm not sure what you mean. Also I may have expressed myself poorly, what I meant by " |
I'm thinking that we should support the range of ways people typically use to load datasets. If some people use torchvision, we should have a compatible wrapper, if others do something custom with the data on disk, we should have something to help them ensure the data is in the proper place and return the dataset's final location, and so on. |
Possibly, but it depends on how the environments declare these things, e.g. if they are attributes or keys in a dictionary. It's an implementation detail.
True, that would be ideal. Or just
True, it could be done at the module level.
It would be hidden away for sure, but it's best to get it right the first time ;) The main point I want to make about environments is that I would prefer to test for which cluster we are on, and use the result of that test to switch in the appropriate set of dataloaders wholesale from
That would suggest that the symbols available in I think it's better to require the user to be explicit about what they want. |
I suggest that we first focus our efforts on providing a drop-in replacement for the torchvision.datasets module, as was mentioned above, and make it public as a single, simple package on PyPI. I suggest the following as design requirements:
I stared coding up / brainstorming a few things around this, here's what I have so far: Please do let me know what you think. |
I am somewhat agnostic about whether we should have a separate package or an omnibus one. I will note that putting everything together still won't amount to a large amount of code and that we can separate optional dependencies as Also, torchvision is one thing, but we also have to address the general problem of dataset management: if I want to use data that can be downloaded from URI X (regardless of its format), how do I get it where it needs to be in all compute environments I want to run experiments in? Can I cache it? Where can I cache it? This is more or less what a milavision package has to do with the torchvision datasets, but it is more general than torchvision: someone who is using tf or jax for their project should have access to this functionality without having a torch dependency. But to avoid code duplication, wherever it is that we put this generic functionality, it's going to be a dependency of milavision. So now we have two packages.
Even if we don't have a given dataset, we should still override the default behavior to download it to SLURM_TMPDIR and possibly cache it in the user's scratch (possibly using a flag). |
I agree, but I'd argue that we need to nail the first problem before moving on to the second. |
I'm more concern about the commitment that we aim at, not that we should not implement a the loaders for the datasets we can. We could maybe have a look into the
Writing a loader for those that would not be specific to a particular use case looks out of my reach as I don't even know which files contains what data and if they all need to be processed (extracted, copied, loader, ...) at any time. With this in mind, I would be in favour of implementing a good base focused on providing the dataset in a location that make sense (with cache and other stuff), which looks to me where we have the most gain in value as the next step of integrating the path in the existing torchvision datasets/loader is not much more work for the user (or us to implement a layer on top of it). Once the data is in place, writing a dataset class for it should also not be much more work for the user (or us). That is until we start using a better dataset format when applicable (with maybe something like bcachefs ;)). |
Fair point, but in my mind the second problem is sort of a prerequisite to solve the first one. Possibly, though, I might be underestimating its relative complexity. I might try my hand at the general, non-torchvision problem and see how it goes. You are quite right that making loaders for some datasets is beyond our expertise and/or may fail to match user requirements. In these situations, I agree with your assessment: what we want is pure filesystem/http tooling that merely takes care of moving the "stuff" to the right place. I think that is both the least and the most we can do for these datasets. One thing apparent in your examples that I hadn't thought of is that the user may only need a subset of the files, in which case we may only want to copy these over. I have some ideas about this that I might draft as code somewhere. |
Purpose
Provide an interface, akin to
from milatools.datasets.torch import ImageNet
, to acquire and use certain datasets in a location aware way. For example, when running on the Mila cluster, it should copy ImageNet from its standard location in/networks/datasets
to the local scratch. Elsewhere, it should use the user provided path (if necessary, download it).Given existing attempts in both #8 and #10 by @manuel-delverme and @Delaunay I'm going to take a step back and create this issue so that we have a central way to discuss this. I want to make sure we are all on the same page.
Requisites
torchvision
ortorchtext
interface, so all the user needs to do is swap out the import:Supports important locations: the user's local machine, the Mila cluster, the CC cluster, possibly others in the future. For this reason I believe the code should be modular with respect to the concept of "location" or "environment".
Supports important frameworks: for now this is arguably just PyTorch, but we need to be careful not to induce unnecessary dependencies on a single framework. If e.g. a large number of researchers move to Jax we should be able to gracefully support it without clashing with PyTorch support.
Anyone is welcome to contribute to this discussion.
The text was updated successfully, but these errors were encountered: