Replace specific opt-out support with datadiligence package for more general opt-out support #312

Padge91 · 2023-05-16T18:08:42Z

The logic currently implemented to respect opt outs (checking HTTP headers) is insufficient for the advancing opt-out landscape. Other methods should also be respected (e.g. HaveIBeenTrained (HIBT), Content Authenticity Initiative (CAI), ArtStation opt-outs, etc). The datadiligence package aims to encapsulate and manage these methods as standards are introduced and evolve.

This repository would benefit from shifting the responsibility of respecting opt-outs to a dedicated package. This PR replaces the existing opt-out logic in this repository with calls to the datadiligence package. These changes should free the maintainers of img2dataset to focus on their goals and priorities without needing to revisit this logic in the future.

This PR primarily does a few things:

Replaces the HTTP header validation logic with function calls to the datadiligence package which perform similar logic.
Adds a pre-processing step to optionally call the Spawning API to filter opt-outs (those made through HIBT, Spawning content partners such as ArtStation, etc).
Adds a more general and consistent argument to respect opt-outs, respect_optouts, which is also controllable via the disallowed_header_directives argument for backwards-compatibility. The new argument default is set to True to maintain parity with current behavior.

Performance

In the previous PR for these changes, #218, performance metrics were requested, so I will provide them here as well. This test was run with 1m records from the laion-art dataset. These tests were run in parallel on two separate machines, AWS EC2 m6a.2xlarge (8 vCPU, 32GB RAM).

The command ran on both machines was identical:

time python3 -m img2dataset.main --url_list ./1m.parquet --input-format "parquet" --url_col "URL" \
--caption_col "TEXT" --output_format webdataset --output_folder test --processes_count 8 \
--thread_count 32 --image_size 384 --resize_only_if_bigger=True --resize_mode="keep_ratio" \
 --skip_reencode=True

With Spawning API enabled

This test was run at ~9:20 AM EST May 15th, 2023.

This PR:
real 48m21.964s
user 297m26.488s
sys 22m21.457s

Current head:
real 47m54.071s
user 299m17.610s
sys 22m44.699s

This is less than a 2% increase in overall runtime with the Spawning API (preprocessing step) enabled. As this step only executes when the required environment variable is set, I believe the performance impact is acceptable. A developer must intentionally use this feature to experience the (relatively small) impact.

Without Spawning API

This test was run at ~noon EST May 15th, 2023. In this test, the Spawning API environment variable was not set, and thus the preprocessing step was skipped.

This PR:
real 44m19.208s
user 299m5.399s
sys 22m28.679s

Current head:
real 44m14.070s
user 302m57.648s
sys 23m3.228s

The difference in runtime is negligible, as it's largely performing the same logic as it is now.

…rectives and include preprocessing with Spawning API.

…ownloader.

… to digest file metadata.

…port endpoints. Also added logic to propertly handle redirects.

…ust checking the first index.

Padge91 added 6 commits May 15, 2023 08:25

Integrating datadiligence package to replace the disallowed_header_di…

5394256

…rectives and include preprocessing with Spawning API.

Updating count of records to use length of remaining urls.

5ca0d34

Moving datadiligence function call to inside respect_optouts check.

9c95243

Increasing version of datadiligence package. Fixing linter issue in d…

a3bded3

…ownloader.

Fixing black formatting issues.

9cd4b2c

Updated argument to pass in response object, so body can also be used…

dba4308

… to digest file metadata.

This was referenced Jul 11, 2023

Implement the W3C TDM Reservation Protocol and enable a more standard opt-out mechanism #308

Open

Updated img2dataset to pull from the Spawning-Inc fork mlfoundations/datacomp#29

Open

rom1504 added the filtering label Jan 9, 2024

Added logic to accept source.plus API keys and forward them to the ex…

4a91987

…port endpoints. Also added logic to propertly handle redirects.

solonovamax mentioned this pull request Oct 16, 2024

Use a Reasonable User Agent #436

Open

Fixing simple mistake. Could use in, but that should be slower than j…

3fa8cce

…ust checking the first index.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace specific opt-out support with datadiligence package for more general opt-out support #312

Replace specific opt-out support with datadiligence package for more general opt-out support #312

Padge91 commented May 16, 2023

Replace specific opt-out support with datadiligence package for more general opt-out support #312

Are you sure you want to change the base?

Replace specific opt-out support with datadiligence package for more general opt-out support #312

Conversation

Padge91 commented May 16, 2023

Performance

With Spawning API enabled

Without Spawning API