Replace specific opt-out support with datadiligence package for more general opt-out support #312
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The logic currently implemented to respect opt outs (checking HTTP headers) is insufficient for the advancing opt-out landscape. Other methods should also be respected (e.g. HaveIBeenTrained (HIBT), Content Authenticity Initiative (CAI), ArtStation opt-outs, etc). The datadiligence package aims to encapsulate and manage these methods as standards are introduced and evolve.
This repository would benefit from shifting the responsibility of respecting opt-outs to a dedicated package. This PR replaces the existing opt-out logic in this repository with calls to the datadiligence package. These changes should free the maintainers of
img2dataset
to focus on their goals and priorities without needing to revisit this logic in the future.This PR primarily does a few things:
respect_optouts
, which is also controllable via thedisallowed_header_directives
argument for backwards-compatibility. The new argument default is set toTrue
to maintain parity with current behavior.Performance
In the previous PR for these changes, #218, performance metrics were requested, so I will provide them here as well. This test was run with 1m records from the laion-art dataset. These tests were run in parallel on two separate machines, AWS EC2 m6a.2xlarge (8 vCPU, 32GB RAM).
The command ran on both machines was identical:
With Spawning API enabled
This test was run at ~9:20 AM EST May 15th, 2023.
This PR:
real 48m21.964s
user 297m26.488s
sys 22m21.457s
Current head:
real 47m54.071s
user 299m17.610s
sys 22m44.699s
This is less than a 2% increase in overall runtime with the Spawning API (preprocessing step) enabled. As this step only executes when the required environment variable is set, I believe the performance impact is acceptable. A developer must intentionally use this feature to experience the (relatively small) impact.
Without Spawning API
This test was run at ~noon EST May 15th, 2023. In this test, the Spawning API environment variable was not set, and thus the preprocessing step was skipped.
This PR:
real 44m19.208s
user 299m5.399s
sys 22m28.679s
Current head:
real 44m14.070s
user 302m57.648s
sys 23m3.228s
The difference in runtime is negligible, as it's largely performing the same logic as it is now.