Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fork cdxj indexer #428

Open
wants to merge 3 commits into
base: main
Choose a base branch
from
Open

Fork cdxj indexer #428

wants to merge 3 commits into from

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Dec 20, 2024

cdxj_indexer maintenance is laggy and complex. It becomes to be a concern for warc2zim.

By laggy I mean for instance webrecorder/cdxj-indexer#26 which highlight that as-of-today, using cdxj-indexer forces us to use an old idna version (< 3, which means < 2021).

By complex I mean for instance webrecorder/cdxj-indexer#24 which highlight that as-of-today, using cdxj-indexer forces is to use PyAMF library which is not maintained anymore (this lib is not maintained by webrecorder team).

I doubt these two issues can be quickly solved due to their potentially complex consequences for a lib widely shared and probably running on various Python versions.

I want to insist that this is not really a problem of lack of maintenance effort by cdxj_indexer maintainer.

Since the pace of cdxj_indexer changes are very limited, and our usage of the codebase quite modest, I propose to hence "fork" cdxj_indexer in warc2zim, solve these two issues (and few others) on our side, and if this proves to work well then we will be in a better position to suggest to fix upstream issues, but again confirming impact on a significant ecosystem is not necessarily easy.

Note that this not a total "fork", rather of "duplication" of useful cdxj_indexer code in our own codebase.

@benoit74 benoit74 self-assigned this Dec 20, 2024
@benoit74 benoit74 force-pushed the fork_cdxj_indexer branch 2 times, most recently from 114963c to c23af69 Compare December 20, 2024 14:15
@benoit74 benoit74 changed the base branch from main to scraperlib_4_1 December 20, 2024 14:15
@benoit74 benoit74 force-pushed the scraperlib_4_1 branch 2 times, most recently from e069c8b to 1218df0 Compare January 7, 2025 15:53
Base automatically changed from scraperlib_4_1 to main January 7, 2025 16:00
@benoit74
Copy link
Collaborator Author

benoit74 commented Jan 7, 2025

@kelson42 do you have any concern about this move? Maybe we should discuss it live

@benoit74 benoit74 marked this pull request as ready for review January 7, 2025 16:12
@benoit74 benoit74 requested a review from rgaudin January 7, 2025 16:13
@benoit74 benoit74 added this to the 2.3.0 milestone Jan 10, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant