Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

refactor hashing/dedupe #299

Merged
merged 36 commits into from
Nov 25, 2024
Merged

Conversation

datawhores
Copy link
Collaborator

@datawhores datawhores commented Nov 17, 2024

This pull requests is a revamp of the hashing/dedupe system

I think it can be split into 2-3 major changes

  1. redesigning the database tables around supporting multiple hash types per file. The hash types supported currently by the script are xxhash,md5,sha256 but more hashes could be supported.
  • supporting other hash types will allow preventing downloads from sites like coomer, if we for example have the download from another site like bunkrr

  • add date column: This can be used to remove keep_previous and keep_current , and basically simplify the deduping process for the user

  1. Revamp the settings

@datawhores datawhores marked this pull request as draft November 17, 2024 19:24
Copy link
Collaborator

@NTFSvolume NTFSvolume left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do think the options are too convoluted

In my option hashing and deduplication should be independent. User should be able to hash files without actually taking any action over them, just store the hashes on the database

Using Enums for the options should be better. These could be the options:

Hashing options

  • --hashing (type: Enum):
Value Enum Default Description
OFF 0 YES Do not hash any file
IN_PLACE 1   Hash a file immediately after its download finishes
POST_DOWNLOAD 2   Hashes all files at once, after all the downloads have finished
  • --add_md5_hash (type: bool) (default: False)

  • --add_sha256_hash (type: bool) (default: False)

Dedupe options

  • --delete-if-hash-seen-before (type: bool) (default: False):

This will delete any new downloaded file if the hash was already on the database. It will compare the file by all the enabled hashes, specified with the hashing option. This will do nothing if hashing is OFF

This option will only take effect while CDL is on a run and it will only delete files downloaded in the current run cause realistically, there is never a case were you would want to delete files that you already downloaded. That is what the manual dedupe option is for.

  • --dedupe (type: Enum):
Value Enum Default Description
OFF 0 YES Do nothing
KEEP_OLDEST 1   Keep only one copy, the first one downloaded. The date to compare should be the file created date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches.
KEEP_NEWEST 2   Keep only one copy, the last one downloaded. The date to compare should be the file creation date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches.
KEEP_OLDEST_ALL 3   Keep only one copy, the first one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL
KEEP_NEWEST_ALL 4   Keep only one copy, the last one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL

This option will only take effect when the user manually chooses to run the Scan folder and hash files option on the UI. This will do nothing if hashing is OFF. Whenever hashing is IN_PLACE or POST_DOWNLOAD does not matter in this case, it just needs to be enabled.

  • --send-to-trash (type: bool) (default: True):

Whenever to send deleted file to trash or delete them permanently.
This option will be taken into account for both --delete-if-hash-seen-before and --dedupe


I'm not familiar with the hashing section of the codebase so I'm not sure if this covers all the currently supported cases but these should covert all the scenarios that an user may actually want to use

@datawhores datawhores changed the title Depended fix refactor hashing/dedupe Nov 19, 2024
@datawhores
Copy link
Collaborator Author

datawhores commented Nov 19, 2024

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

KEEP_OLDEST
KEEP_NEWEST

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL
KEEP_NEWEST_ALL

same as above, but with the check for file existence removed
KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

@datawhores datawhores marked this pull request as ready for review November 19, 2024 07:57
@NTFSvolume
Copy link
Collaborator

NTFSvolume commented Nov 19, 2024

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how do you know its hash?

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL KEEP_NEWEST_ALL

same as above, but with the check for file existence removed KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

If its path, size and date matches an entry on the items table, them it was downloaded by CDL. If this will not be taken into account, KEEP_NEWEST_ALL is reduntant and should be removed

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

cyberdrop_dl/clients/hash_client.py Outdated Show resolved Hide resolved
cyberdrop_dl/clients/hash_client.py Outdated Show resolved Hide resolved
cyberdrop_dl/managers/config_manager.py Outdated Show resolved Hide resolved
cyberdrop_dl/managers/manager.py Outdated Show resolved Hide resolved
cyberdrop_dl/utils/args/browser_cookie_extraction.py Outdated Show resolved Hide resolved
cyberdrop_dl/utils/data_enums_classes/hash.py Outdated Show resolved Hide resolved
cyberdrop_dl/utils/data_enums_classes/hash.py Outdated Show resolved Hide resolved
cyberdrop_dl/utils/data_enums_classes/hash.py Outdated Show resolved Hide resolved
cyberdrop_dl/utils/data_enums_classes/hash.py Outdated Show resolved Hide resolved
@jbsparrow
Copy link
Owner

jbsparrow commented Nov 19, 2024

I think overall the hashing is very confusing. Even after the proposed changes. Additionally, if the files are identical, why does it matter which one we keep? Keeping the oldest would probably make sense. I also don't understand how we can track the files' movements.

I think most users wouldn't be using all of the hashing & deduplication options, and may have trouble understanding them.

On the topic of multiple hash columns, I was thinking, we can categorize downloads by the type of porn which appears to often correspond to the hash used. Hentai seems to usually be hashed with md5, and normal pornography seems to be sha256.

Therefore, we can set a forum and host map for what they usually host, and also set a referrer map. E.g. gofile or bunkr may usually host normal porn, but if referred by f95zone for example, consider them to be hosting hentai, and calculate the md5 hash.

I'm not completely sure that makes sense, but I figure the likelihood of a duplicate between hentai and normal porn is extremely unlikely, and as a result we can skip some hashing.

@datawhores
Copy link
Collaborator Author

datawhores commented Nov 19, 2024

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

Yeah I think you were right the user would want to delete any new downloads, if the hash was already in a previous run. The case I was referring to was if the same hash appears in the same run. In this case you would just pick one file to keep, and it probably wouldn't matter which one.

The only thing is if the user deletes a file, and for some reason wants to get it back. There is no way to get the file again without using --ignore-history and turning off hashing

And if we have a file existence check then the user would not be allowed to move files without rehashing

I also don't understand how we can track the files' movements.

We can only assume that it was moved or deleted. So it would be up to the user to decide if they want to do some sort of existence check before removing files

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how do you know its hash?

We can search the database by hashes and if the hash appears again, then we know that a file was downloaded with the hash at one point. Even if the file doesn't exist, it might have be at new location.

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

I think I can remove these from the main download script. The only difference was that KEEP_OLDEST_ALL was going to ignore if a match existed on the system. As long as it was in the db, it was only going to keep the oldest entry, and delete the rest

If its path, size and date matches an entry on the items table, them it was downloaded by CDL

Yeah that makes sense, the only thing is if the file was sorted by the script. Then we would not know the new location

Also I'm not sure if we can do better then the solutions in the market for something manual. The whole advantage of deduping or preventing downloads from the script would be that it would be automatic

@NTFSvolume
Copy link
Collaborator

I think overall the hashing is very confusing. Even after the proposed changes. Additionally, if the files are identical, why does it matter which one we keep? Keeping the oldest would probably make sense. I also don't understand how we can track the files' movements.

I think most users wouldn't be using all of the hashing & deduplication options, and may have trouble understanding them.

Keeping the oldest one if almost always better cause the user may already have organized the file and its current path is the "permanent" path.

On the topic of multiple hash columns, I was thinking, we can categorize downloads by the type of porn which appears to often correspond to the hash used. Hentai seems to usually be hashed with md5, and normal pornography seems to be sha256.

Therefore, we can set a forum and host map for what they usually host, and also set a referrer map. E.g. gofile or bunkr may usually host normal porn, but if referred by f95zone for example, consider them to be hosting hentai, and calculate the md5 hash.

I'm not completely sure that makes sense, but I figure the likelihood of a duplicate between hentai and normal porn is extremely unlikely, and as a result we can skip some hashing.

Thinking about this, md5 actually needs to be enabled by default on every file cause that is the current hash that almost everyone is using. Disabling md5 would render their current hashes table useless.

Performance hit of running xxh128 + md5 is basically the same as running just md5 cause xxhash is really fast. sha256 on the other hand is even slower that md5 so it could be optional, but even if it is optional, it should be applied to every file, does not matter were it came from cause then no match will ever be found. Ex: if sha256 is only computed when downloading from coomer, it will never find a duplicate match cause all files on coomer have a unique hash. The URL itself has the hash so skipping by history is enough to not download duplicate files.

The only thing is if the user deletes a file, and for some reason wants to get it back. There is no way to get the file again without using --ignore-history and turning off hashing

--ignore-history should be enough to download the file again. I think the main problem is that the hash is the current unique key of the table. But the unique key should be the file path, not the hash.

And if we have a file existence check then the user would not be allowed to move files without rehashing

Moving the files should require re-hashing. I don't know how we are tracking files movement right now but best case scenario we are guessing cause it's impossible to know

Also I'm not sure if we can do better then the solutions in the market for something manual. The whole advantage of deduping or preventing downloads from the script would be that it would be automatic.

I think is better to just remove manual deduplication altogether. Only keep hashing (OFF, IN_PLACE and POST_DOWNLOAD) and --delete-if-hash-seen-before.

CDL will skip files by URL and delete files by hash if they were previously downloaded, does not matter if the original file exists on the file system or not. The file was downloaded by CDL at some point, that's why is being deleted. It will only delete files downloaded in the current run. Using --ignore-history will also disable --delete-if-hash-seen-before.

The user can use an external dedupe tool to manage their files if they want more control but CDL will have a clear propose:

Do NOT download files downloaded before

CDL will acomplish this either skipping by URL or deleting by hash

@datawhores
Copy link
Collaborator Author

datawhores commented Nov 21, 2024

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how you know its hash?

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL KEEP_NEWEST_ALL

same as above, but with the check for file existence removed KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

If its path, size and date matches an entry on the items table, them it was downloaded by CDL. If this will not be taken into account, KEEP_NEWEST_ALL is reduntant and should be removed

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

There is an option right now to manually scan. However it does not remove entries just adds them.

I think the automatic hashing and removal process is much simpler, and covers a main stream use case.

@NTFSvolume NTFSvolume merged commit 04effd5 into jbsparrow:master Nov 25, 2024
2 of 3 checks passed
@datawhores datawhores deleted the depended-fix branch December 1, 2024 22:07
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants