Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid being blocked by Cloudflare #2134

Open
benoit74 opened this issue Jan 13, 2025 · 4 comments
Open

Avoid being blocked by Cloudflare #2134

benoit74 opened this issue Jan 13, 2025 · 4 comments

Comments

@benoit74
Copy link
Contributor

mwoffliner version : 1.14.0

Task: https://farm.openzim.org/pipeline/1e755f21-4805-4cf8-8fa1-63fd5a5dc9d5/debug
Recipe: https://farm.openzim.org/recipes/cyclowiki.org_rus_all
Request: openzim/zim-requests#9

Log:

[error] [2025-01-13T13:46:08.126Z] Failed to run mwoffliner after [1s]: {
	"stack": "Error: mwUrl [https://cyclowiki.org] is not valid.\n    at file:///tmp/mwoffliner/lib/sanitize-argument.js:134:15\n    at process.processTicksAndRejections (node:internal/process/task_queues:95:5)\n    at async sanitize_mwUrl (file:///tmp/mwoffliner/lib/sanitize-argument.js:133:5)\n    at async sanitize_all (file:///tmp/mwoffliner/lib/sanitize-argument.js:55:5)",
	"message": "mwUrl [https://cyclowiki.org] is not valid."
}
[error] [2025-01-13T13:46:08.127Z] 

**********

mwUrl [https://cyclowiki.org] is not valid.

**********

Explanation: first check of mwUrl seems to be failing. Could be caused by the fact that Cloudflare is protecting this website. To be investigated.

@audiodude
Copy link
Member

Yes, we have had this exact problem before with cloudflare.

@audiodude
Copy link
Member

See #2039

@kelson42 kelson42 added this to the 1.15.0 milestone Jan 14, 2025
@benoit74
Copy link
Contributor Author

Yeah, I saw this other issue, where we just skipped the test to solve it. Now that we have repro, I suspect we might be able to do something by passing a proper User-Agent. At least this is what we achieved to do in other scrapers. Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.

@benoit74 benoit74 changed the title cyclowiki is failing with mwUrl [https://cyclowiki.org] is not valid. Avoid being blocked by Cloudflare Jan 14, 2025
@kelson42
Copy link
Collaborator

Not bullet-proof, but a "bad" User-Agent triggers much more easily Cloudflare protections. By "bad", I mean something which does not look at all like a browser.

Worth a try indeed. If it works, we should create an option for that.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants