-
-
Notifications
You must be signed in to change notification settings - Fork 26
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
refactor hashing/dedupe #299
Conversation
…nd to support new database schema
also add hash_type to client as class props
remove hash after downloading
… individual downloads
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do think the options are too convoluted
In my option hashing and deduplication should be independent. User should be able to hash files without actually taking any action over them, just store the hashes on the database
Using Enums
for the options should be better. These could be the options:
Hashing options
--hashing
(type:Enum
):
Value | Enum | Default | Description |
---|---|---|---|
OFF |
0 | YES | Do not hash any file |
IN_PLACE |
1 | Hash a file immediately after its download finishes | |
POST_DOWNLOAD |
2 | Hashes all files at once, after all the downloads have finished |
-
--add_md5_hash
(type:bool
) (default:False
) -
--add_sha256_hash
(type:bool
) (default:False
)
Dedupe options
--delete-if-hash-seen-before
(type:bool
) (default:False
):
This will delete any new downloaded file if the hash was already on the database. It will compare the file by all the enabled hashes, specified with the hashing
option. This will do nothing if hashing
is OFF
This option will only take effect while CDL is on a run and it will only delete files downloaded in the current run cause realistically, there is never a case were you would want to delete files that you already downloaded. That is what the manual dedupe option is for.
--dedupe
(type:Enum
):
Value | Enum | Default | Description |
---|---|---|---|
OFF |
0 | YES | Do nothing |
KEEP_OLDEST |
1 | Keep only one copy, the first one downloaded. The date to compare should be the file created date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches. | |
KEEP_NEWEST |
2 | Keep only one copy, the last one downloaded. The date to compare should be the file creation date. Any file that was not downloaded by CDL, will be skipped. They will not be taken into account for matches. | |
KEEP_OLDEST_ALL |
3 | Keep only one copy, the first one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL | |
KEEP_NEWEST_ALL |
4 | Keep only one copy, the last one downloaded. The date to compare should be the file creation date. It will take into account every single file in the folder, even ones not downloaded by CDL |
This option will only take effect when the user manually chooses to run the Scan folder and hash files
option on the UI. This will do nothing if hashing
is OFF
. Whenever hashing is IN_PLACE
or POST_DOWNLOAD
does not matter in this case, it just needs to be enabled.
--send-to-trash
(type:bool
) (default:True
):
Whenever to send deleted file to trash or delete them permanently.
This option will be taken into account for both --delete-if-hash-seen-before
and --dedupe
I'm not familiar with the hashing section of the codebase so I'm not sure if this covers all the currently supported cases but these should covert all the scenarios that an user may actually want to use
move set default settings for config enable disabling of file deletion reword some options for hashing/deletion :
also add enum to check if hash should be generated after each download
The reason for deleting files already downloaded is that it is common to get duplicate files when downloading from forums. Users will upload the same content to another host to serve as a mirror KEEP_OLDEST I think these two options would serve many people well. Figuring out which files where downloaded by CDL would be challenging, and the original idea behind some of the dedupe settings were to allow files to be deduped even if they were moved from the original location. With that said for KEEP_OLDEST and KEEP_NEWEST there is a check filter the initial list to ensure that all the files being removed exist on the system. Then the first element is skipped, before going through the rest of the sorted list KEEP_OLDEST_ALL same as above, but with the check for file existence removed but for KEEP_OLDEST_ALL this could prevent the same download from being seen again, so this way only content that is fresh to the user will be seen. It will also allow the user to move files to a new location without having to rehash them |
Don't really undestand this. If I download a new copy from a new mirror in a forum, i would not want to keep that file, i would want to keep the file i already had. If i could skip the download altogether by the hash, even better
I didn't know this. I thought the file path was an unique key for hashes lookup. If the file was moved, how do you know its hash?
If its
I don't undertand this part. How would |
I think overall the hashing is very confusing. Even after the proposed changes. Additionally, if the files are identical, why does it matter which one we keep? Keeping the oldest would probably make sense. I also don't understand how we can track the files' movements. I think most users wouldn't be using all of the hashing & deduplication options, and may have trouble understanding them. On the topic of multiple hash columns, I was thinking, we can categorize downloads by the type of porn which appears to often correspond to the hash used. Hentai seems to usually be hashed with md5, and normal pornography seems to be sha256. Therefore, we can set a forum and host map for what they usually host, and also set a referrer map. E.g. gofile or bunkr may usually host normal porn, but if referred by f95zone for example, consider them to be hosting hentai, and calculate the md5 hash. I'm not completely sure that makes sense, but I figure the likelihood of a duplicate between hentai and normal porn is extremely unlikely, and as a result we can skip some hashing. |
Yeah I think you were right the user would want to delete any new downloads, if the hash was already in a previous run. The case I was referring to was if the same hash appears in the same run. In this case you would just pick one file to keep, and it probably wouldn't matter which one. The only thing is if the user deletes a file, and for some reason wants to get it back. There is no way to get the file again without using --ignore-history and turning off hashing And if we have a file existence check then the user would not be allowed to move files without rehashing
We can only assume that it was moved or deleted. So it would be up to the user to decide if they want to do some sort of existence check before removing files
We can search the database by hashes and if the hash appears again, then we know that a file was downloaded with the hash at one point. Even if the file doesn't exist, it might have be at new location.
I think I can remove these from the main download script. The only difference was that KEEP_OLDEST_ALL was going to ignore if a match existed on the system. As long as it was in the db, it was only going to keep the oldest entry, and delete the rest
Yeah that makes sense, the only thing is if the file was sorted by the script. Then we would not know the new location Also I'm not sure if we can do better then the solutions in the market for something manual. The whole advantage of deduping or preventing downloads from the script would be that it would be automatic |
Keeping the oldest one if almost always better cause the user may already have organized the file and its current path is the "permanent" path.
Thinking about this, Performance hit of running
Moving the files should require re-hashing. I don't know how we are tracking files movement right now but best case scenario we are guessing cause it's impossible to know
I think is better to just remove manual deduplication altogether. Only keep hashing ( CDL will skip files by URL and delete files by hash if they were previously downloaded, does not matter if the original file exists on the file system or not. The file was downloaded by CDL at some point, that's why is being deleted. It will only delete files downloaded in the current run. Using The user can use an external dedupe tool to manage their files if they want more control but CDL will have a clear propose:
CDL will acomplish this either skipping by URL or deleting by hash |
There is an option right now to manually scan. However it does not remove entries just adds them. I think the automatic hashing and removal process is much simpler, and covers a main stream use case. |
remove dedupe enum, use bool compare with enum class
Co-authored-by: NTFSvolume <[email protected]>
also print name and not value
This pull requests is a revamp of the hashing/dedupe system
I think it can be split into 2-3 major changes
supporting other hash types will allow preventing downloads from sites like coomer, if we for example have the download from another site like bunkrr
add date column: This can be used to remove keep_previous and keep_current , and basically simplify the deduping process for the user