Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added new parameter 'compute_key' #390

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

edwardguil
Copy link
Contributor

@edwardguil edwardguil commented Jan 14, 2024

Added a new parameter 'compute_key', that allows users to override the default function for computing a sample key (compute_key). This should allow finer control of the output format of the downloaded dataset.

An example use case is the following:

If the dataset had some additional_data which was specified, one of which was a uid across the dataset, a user could simply do the following:

def compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    return str(additional_columns['uid'])

Then pass this function to the downloader. Hence changing the default output from:

  • output_folder
    • 00000
      • 000000000.jpg
      • 000000000.txt
      • 000000001.jpg
      • ...

To:

  • output_folder
    • 00000
      • some_uid_1.jpg
      • some_uid_1.txt
      • some_uid_2.jpg
      • ...

As far as I am aware this customization still aligns with Web-dataset principles.

…default function for computing a sample key. Added information on this parameter to the README.
@rom1504
Copy link
Owner

rom1504 commented Jan 14, 2024

Can you say more on how you build these uuids?
Is it useful to you to precompute them in advance?

I have been thinking to simply generate some uuid during download instead of using these shard id prefixed numbers

@edwardguil
Copy link
Contributor Author

edwardguil commented Jan 14, 2024

Can you say more on how you build these uuids?

As an example, using the thread safe uuid library:

import uuid
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    unique_id = uuid.uuid4()
    return f"{unique_id}"

Or combining this with an additional column:

pairs = {}
compute_key(key, shard_id, oom_sample_per_shard, oom_shard_count, additional_columns):
    unique_id = uuid.uuid4()
    return f"{additional_columns['someColumn']}_{unique_id}"

I think the point is too allow 'advanced' users to decide what their approach to this is.

Is it useful to you to precompute them in advance?

Yes, pre-computing the uuids in this case is most appropriate. Storing them within the input file (csv etc), then passing them through additional_parameters. This helps avoid the case of race conditions, for people unfamiliar with it.

I have been thinking to simply generate some uuid during download instead of using these shard id prefixed numbers

Yes, that could be an appropriate solution, I do think however your current approach works well, and is clear. In saying this a true UUID would confirm better to webdataset standards, as two separate runs of img2dataset into distinct folders do run the risk of of overlapping basename + key pairs (something I have come across), hence the PR.

If this pr goes ahead, I do think that the documentation should mention the function (compute_key) needs to be thread safe. This could complicate it for users, so an alternative solution would be to pass an additional parameter "suffix" or "prefix" to add to the output keys, or 'uuid' to override the keys, but ultimately this is not as optimal as allowing full key changes.

@edwardguil
Copy link
Contributor Author

Hey @rom1504, when you get the chance. Any feedback on this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants