-
Notifications
You must be signed in to change notification settings - Fork 341
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Added new parameter 'compute_key' #390
base: main
Are you sure you want to change the base?
Conversation
…default function for computing a sample key. Added information on this parameter to the README.
Can you say more on how you build these uuids? I have been thinking to simply generate some uuid during download instead of using these shard id prefixed numbers |
As an example, using the thread safe uuid library: import uuid
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
unique_id = uuid.uuid4()
return f"{unique_id}" Or combining this with an additional column: pairs = {}
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
unique_id = uuid.uuid4()
return f"{additional_columns['someColumn']}_{unique_id}" I think the point is too allow 'advanced' users to decide what their approach to this is.
Yes, pre-computing the uuids in this case is most appropriate. Storing them within the input file (csv etc), then passing them through additional_parameters. This helps avoid the case of race conditions, for people unfamiliar with it.
Yes, that could be an appropriate solution, I do think however your current approach works well, and is clear. In saying this a true UUID would confirm better to webdataset standards, as two separate runs of img2dataset into distinct folders do run the risk of of overlapping basename + key pairs (something I have come across), hence the PR. If this pr goes ahead, I do think that the documentation should mention the function (compute_key) needs to be thread safe. This could complicate it for users, so an alternative solution would be to pass an additional parameter "suffix" or "prefix" to add to the output keys, or 'uuid' to override the keys, but ultimately this is not as optimal as allowing full key changes. |
Hey @rom1504, when you get the chance. Any feedback on this PR? |
Added a new parameter 'compute_key', that allows users to override the default function for computing a sample key (compute_key). This should allow finer control of the output format of the downloaded dataset.
An example use case is the following:
If the dataset had some additional_data which was specified, one of which was a uid across the dataset, a user could simply do the following:
Then pass this function to the downloader. Hence changing the default output from:
To:
As far as I am aware this customization still aligns with Web-dataset principles.