refactor hashing/dedupe #299

datawhores · 2024-11-17T19:23:42Z

This pull requests is a revamp of the hashing/dedupe system

I think it can be split into 2-3 major changes

redesigning the database tables around supporting multiple hash types per file. The hash types supported currently by the script are xxhash,md5,sha256 but more hashes could be supported.

supporting other hash types will allow preventing downloads from sites like coomer, if we for example have the download from another site like bunkrr
add date column: This can be used to remove keep_previous and keep_current , and basically simplify the deduping process for the user

Revamp the settings

…nd to support new database schema

also add hash_type to client as class props

remove hash after downloading

… individual downloads

NTFSvolume

I do think the options are too convoluted

In my option hashing and deduplication should be independent. User should be able to hash files without actually taking any action over them, just store the hashes on the database

Using Enums for the options should be better. These could be the options:

Hashing options

--hashing (type: Enum):

Value	Enum	Default	Description
`OFF`	0	YES	Do not hash any file
`IN_PLACE`	1		Hash a file immediately after its download finishes
`POST_DOWNLOAD`	2		Hashes all files at once, after all the downloads have finished

--add_md5_hash (type: bool) (default: False)
--add_sha256_hash (type: bool) (default: False)

Dedupe options

--delete-if-hash-seen-before (type: bool) (default: False):

This will delete any new downloaded file if the hash was already on the database. It will compare the file by all the enabled hashes, specified with the hashing option. This will do nothing if hashing is OFF

This option will only take effect while CDL is on a run and it will only delete files downloaded in the current run cause realistically, there is never a case were you would want to delete files that you already downloaded. That is what the manual dedupe option is for.

--dedupe (type: Enum):

Value	Enum	Default	Description
`OFF`	0	YES	Do nothing
`KEEP_OLDEST`	1		Keep only one copy, the first one downloaded. The date to compare should be the file created date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches.
`KEEP_NEWEST`	2		Keep only one copy, the last one downloaded. The date to compare should be the file creation date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches.
`KEEP_OLDEST_ALL`	3		Keep only one copy, the first one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL
`KEEP_NEWEST_ALL`	4		Keep only one copy, the last one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL

This option will only take effect when the user manually chooses to run the Scan folder and hash files option on the UI. This will do nothing if hashing is OFF. Whenever hashing is IN_PLACE or POST_DOWNLOAD does not matter in this case, it just needs to be enabled.

--send-to-trash (type: bool) (default: True):

Whenever to send deleted file to trash or delete them permanently.
This option will be taken into account for both --delete-if-hash-seen-before and --dedupe

I'm not familiar with the hashing section of the codebase so I'm not sure if this covers all the currently supported cases but these should covert all the scenarios that an user may actually want to use

move set default settings for config enable disabling of file deletion reword some options for hashing/deletion :

…ath setting

also add enum to check if hash should be generated after each download

datawhores · 2024-11-19T04:34:10Z

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

KEEP_OLDEST
KEEP_NEWEST

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL
KEEP_NEWEST_ALL

same as above, but with the check for file existence removed
KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

NTFSvolume · 2024-11-19T15:46:54Z

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how do you know its hash?

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL KEEP_NEWEST_ALL

same as above, but with the check for file existence removed KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

If its path, size and date matches an entry on the items table, them it was downloaded by CDL. If this will not be taken into account, KEEP_NEWEST_ALL is reduntant and should be removed

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

cyberdrop_dl/clients/hash_client.py

cyberdrop_dl/managers/config_manager.py

cyberdrop_dl/managers/manager.py

cyberdrop_dl/utils/args/browser_cookie_extraction.py

cyberdrop_dl/utils/data_enums_classes/hash.py

jbsparrow · 2024-11-19T18:31:30Z

I think overall the hashing is very confusing. Even after the proposed changes. Additionally, if the files are identical, why does it matter which one we keep? Keeping the oldest would probably make sense. I also don't understand how we can track the files' movements.

I think most users wouldn't be using all of the hashing & deduplication options, and may have trouble understanding them.

On the topic of multiple hash columns, I was thinking, we can categorize downloads by the type of porn which appears to often correspond to the hash used. Hentai seems to usually be hashed with md5, and normal pornography seems to be sha256.

Therefore, we can set a forum and host map for what they usually host, and also set a referrer map. E.g. gofile or bunkr may usually host normal porn, but if referred by f95zone for example, consider them to be hosting hentai, and calculate the md5 hash.

I'm not completely sure that makes sense, but I figure the likelihood of a duplicate between hentai and normal porn is extremely unlikely, and as a result we can skip some hashing.

datawhores · 2024-11-19T23:16:19Z

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

Yeah I think you were right the user would want to delete any new downloads, if the hash was already in a previous run. The case I was referring to was if the same hash appears in the same run. In this case you would just pick one file to keep, and it probably wouldn't matter which one.

The only thing is if the user deletes a file, and for some reason wants to get it back. There is no way to get the file again without using --ignore-history and turning off hashing

And if we have a file existence check then the user would not be allowed to move files without rehashing

I also don't understand how we can track the files' movements.

We can only assume that it was moved or deleted. So it would be up to the user to decide if they want to do some sort of existence check before removing files

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how do you know its hash?

We can search the database by hashes and if the hash appears again, then we know that a file was downloaded with the hash at one point. Even if the file doesn't exist, it might have be at new location.

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

I think I can remove these from the main download script. The only difference was that KEEP_OLDEST_ALL was going to ignore if a match existed on the system. As long as it was in the db, it was only going to keep the oldest entry, and delete the rest

If its path, size and date matches an entry on the items table, them it was downloaded by CDL

Yeah that makes sense, the only thing is if the file was sorted by the script. Then we would not know the new location

Also I'm not sure if we can do better then the solutions in the market for something manual. The whole advantage of deduping or preventing downloads from the script would be that it would be automatic

NTFSvolume · 2024-11-20T20:41:04Z

I think overall the hashing is very confusing. Even after the proposed changes. Additionally, if the files are identical, why does it matter which one we keep? Keeping the oldest would probably make sense. I also don't understand how we can track the files' movements.

I think most users wouldn't be using all of the hashing & deduplication options, and may have trouble understanding them.

Keeping the oldest one if almost always better cause the user may already have organized the file and its current path is the "permanent" path.

On the topic of multiple hash columns, I was thinking, we can categorize downloads by the type of porn which appears to often correspond to the hash used. Hentai seems to usually be hashed with md5, and normal pornography seems to be sha256.

Therefore, we can set a forum and host map for what they usually host, and also set a referrer map. E.g. gofile or bunkr may usually host normal porn, but if referred by f95zone for example, consider them to be hosting hentai, and calculate the md5 hash.

I'm not completely sure that makes sense, but I figure the likelihood of a duplicate between hentai and normal porn is extremely unlikely, and as a result we can skip some hashing.

Thinking about this, md5 actually needs to be enabled by default on every file cause that is the current hash that almost everyone is using. Disabling md5 would render their current hashes table useless.

Performance hit of running xxh128 + md5 is basically the same as running just md5 cause xxhash is really fast. sha256 on the other hand is even slower that md5 so it could be optional, but even if it is optional, it should be applied to every file, does not matter were it came from cause then no match will ever be found. Ex: if sha256 is only computed when downloading from coomer, it will never find a duplicate match cause all files on coomer have a unique hash. The URL itself has the hash so skipping by history is enough to not download duplicate files.

The only thing is if the user deletes a file, and for some reason wants to get it back. There is no way to get the file again without using --ignore-history and turning off hashing

--ignore-history should be enough to download the file again. I think the main problem is that the hash is the current unique key of the table. But the unique key should be the file path, not the hash.

And if we have a file existence check then the user would not be allowed to move files without rehashing

Moving the files should require re-hashing. I don't know how we are tracking files movement right now but best case scenario we are guessing cause it's impossible to know

Also I'm not sure if we can do better then the solutions in the market for something manual. The whole advantage of deduping or preventing downloads from the script would be that it would be automatic.

I think is better to just remove manual deduplication altogether. Only keep hashing (OFF, IN_PLACE and POST_DOWNLOAD) and --delete-if-hash-seen-before.

CDL will skip files by URL and delete files by hash if they were previously downloaded, does not matter if the original file exists on the file system or not. The file was downloaded by CDL at some point, that's why is being deleted. It will only delete files downloaded in the current run. Using --ignore-history will also disable --delete-if-hash-seen-before.

The user can use an external dedupe tool to manage their files if they want more control but CDL will have a clear propose:

Do NOT download files downloaded before

CDL will acomplish this either skipping by URL or deleting by hash

datawhores · 2024-11-21T12:34:32Z

The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror

Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better

I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location.

I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how you know its hash?

With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list

KEEP_OLDEST_ALL KEEP_NEWEST_ALL

same as above, but with the check for file existence removed KEEP_NEWEST_ALL and KEEP_NEWEST might end up being the same

If its path, size and date matches an entry on the items table, them it was downloaded by CDL. If this will not be taken into account, KEEP_NEWEST_ALL is reduntant and should be removed

but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them

I don't undertand this part. How would KEEP_OLDEST and KEEP_OLDEST_ALL be different?

There is an option right now to manually scan. However it does not remove entries just adds them.

I think the automatic hashing and removal process is much simpler, and covers a main stream use case.

remove dedupe enum, use bool compare with enum class

Co-authored-by: NTFSvolume <[email protected]>

also print name and not value

…pDownloader into depended-fix

datawhores added 14 commits November 16, 2024 14:53

allow for multiple hash types

0f63401

remove add_columns_hash

97bedd0

changes to hashclient and manager to allow for multiple hash types, a…

581e3c4

…nd to support new database schema

return specific hashtype in get_unique_hashes

cace26e

support hash_type when getting hashes from file and folder

7721ae5

also add hash_type to client as class props

fix: changes to hash_client to support multiple hash types

025a458

refactor :use download_filename for hash_table

c417ce6

refactor: move dupe_cleanup_options to settings and add hashing_options

8ad0f3c

docs: update docs for hashing + dedupe settings

d2fa988

docs: update docs for hashing + dedupe by combining

8299b7b

remove hash after downloading

refactor: remove hash_options

50c5e95

refactor: only hash if delete after download is on

94278a6

refactor: fix docs by explaining settings better, also re-add disable…

a5715ef

… individual downloads

docs: fix hashing docs again

8fcf36d

datawhores marked this pull request as draft November 17, 2024 19:24

datawhores requested review from NTFSvolume and jbsparrow November 17, 2024 19:24

split settings into sub dicts

755fbb9

NTFSvolume reviewed Nov 17, 2024

View reviewed changes

datawhores added 11 commits November 17, 2024 18:29

refactor: fix issues with last commit

c76e14a

move set default settings for config enable disabling of file deletion reword some options for hashing/deletion :

add folder for storing data and enum classes

c2fbe15

fix: move paths confirmation loop out of nested loop, as it is repeated

b459691

fix: redo last commit use a tuple set to check for current key as a p…

035cce1

…ath setting

refactor: redo dupe cleanup config def also add enums for dupe

c728ac3

fix: lower case sub keys

1810b42

fix: lower case sub keys

571cf1f

refactor: rename implace as in_place

9d50e3f

also add enum to check if hash should be generated after each download

refactor: finish most refactoring of deduping based on discussion

1c46c67

merge master

2b3b76c

refactor: fix prompts for deduping changes

657dacb

datawhores added 2 commits November 18, 2024 19:15

fix: remove prompt files

1f10ab9

fix docs

06d49f0

datawhores changed the title ~~Depended fix~~ refactor hashing/dedupe Nov 19, 2024

datawhores requested a review from NTFSvolume November 19, 2024 04:41

datawhores marked this pull request as ready for review November 19, 2024 07:57

ran ruff

f697553

NTFSvolume reviewed Nov 19, 2024

View reviewed changes

NTFSvolume mentioned this pull request Nov 19, 2024

docs: wiki and changelog updates for 5.7.2 release #303

Merged

NTFSvolume mentioned this pull request Nov 24, 2024

refactor: use pydantic models for config validation #316

Merged

7 tasks

datawhores and others added 7 commits November 24, 2024 18:12

refactor:remove keep_newest, keep_oldest switch to a simple bool

352785c

remove dedupe enum, use bool compare with enum class

Update cyberdrop_dl/managers/config_manager.py

23a949d

Co-authored-by: NTFSvolume <[email protected]>

refactor: remove unneeded _eq_ function

e4ed925

also print name and not value

fix: throw error if key is not right

ff0f51e

sync: remove browser cookie file

a84a4fa

Merge branch 'depended-fix' of https://github.com/datawhores/CyberDro…

aed6b5b

…pDownloader into depended-fix

Merge branch 'master' into depended-fix

6298de9

datawhores requested a review from NTFSvolume November 25, 2024 01:18

NTFSvolume approved these changes Nov 25, 2024

View reviewed changes

NTFSvolume merged commit 04effd5 into jbsparrow:master Nov 25, 2024
2 of 3 checks passed

datawhores deleted the depended-fix branch December 1, 2024 22:07

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

refactor hashing/dedupe #299

refactor hashing/dedupe #299

datawhores commented Nov 17, 2024 •

edited

Loading

NTFSvolume left a comment •

edited

Loading

datawhores commented Nov 19, 2024 •

edited

Loading

NTFSvolume commented Nov 19, 2024 •

edited

Loading

jbsparrow commented Nov 19, 2024 •

edited

Loading

datawhores commented Nov 19, 2024 •

edited

Loading

NTFSvolume commented Nov 20, 2024

datawhores commented Nov 21, 2024 •

edited

Loading

refactor hashing/dedupe #299

refactor hashing/dedupe #299

Conversation

datawhores commented Nov 17, 2024 • edited Loading

NTFSvolume left a comment • edited Loading

Choose a reason for hiding this comment

Hashing options

Dedupe options

datawhores commented Nov 19, 2024 • edited Loading

NTFSvolume commented Nov 19, 2024 • edited Loading

jbsparrow commented Nov 19, 2024 • edited Loading

datawhores commented Nov 19, 2024 • edited Loading

NTFSvolume commented Nov 20, 2024

datawhores commented Nov 21, 2024 • edited Loading

datawhores commented Nov 17, 2024 •

edited

Loading

NTFSvolume left a comment •

edited

Loading

datawhores commented Nov 19, 2024 •

edited

Loading

NTFSvolume commented Nov 19, 2024 •

edited

Loading

jbsparrow commented Nov 19, 2024 •

edited

Loading

datawhores commented Nov 19, 2024 •

edited

Loading

datawhores commented Nov 21, 2024 •

edited

Loading