-
-
Notifications
You must be signed in to change notification settings - Fork 12
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Any reason valid links to pdf files might raise false alarms #105
Comments
A pdf file is not a text machine readable file, so you should not ask the checker to parse it (or add to ignore). |
Hmm...did you follow the link to the failed tests? I am not using it to check links in pdf files. It is failing on links to PDF files which I can browse to fine. |
My apologies - I did not! It looks like it has nothing to do with the PDF files, those servers have bad certificates: HTTPSConnectionPool(host='iitj.ac.in', port=443): Max retries exceeded with url: /uploaded_docs/cc/HPC_training/mcmuserguide.pdf (Caused by SSLError(SSLCertVerificationError(1, '[SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1129)'))) You can reproduce with two lines of python: import requests
requests.get('https://iitj.ac.in/uploaded_docs/cc/HPC_training/mcmuserguide.pdf') You can ask the webmasters to update their certs, and if you don't have that control, you'll have to add them to the skiplist. |
Any chance you'd be willing to add a feature to ignore bad certs? (maybe even make it the default). Its a common scenario and asking people to populate skip lists for such a common scenario seems onerous. And, its confusing why my browser is able to follow links fine that the |
The browser, and actually depending on the browser, does a lot of wonky things to "just make the page load." If you use command line / core tools that enforce best practices to check certificates, you tend to see the truth. And actually, we go to some lengths to try to emulate a web driver, but it's not perfect. We can definitely consider that feature. You'll still have the timeout issue on the second PDF, however. |
Here you go! Please test this out locally, and let me know if the new option works. urlstechie/urlchecker-python#89
|
@vsoch thanks so much! Lemme give this a try. |
Thank you!! Heads up I'm breaking for dinner, but will be back later. |
Am running into ssl version issues...
|
You need to add the |
I am using that flag though the command and error I pasted didn't include it. Its an issue with macOS ssl, python and virtual env. |
I don't have a Mac that I use for programming, but I'd follow that GitHub link and see if you can track down the issue. This is unrelated to urlchecker and the PR - it seems like it's an issue with the Python/ssl versions on your system. |
Ok, well it might help if I was in the correct branch of the clone. I've done that now. And, I built a docker ubuntu container..., but strangley, I am getting cert errors...
|
Ok, so I think
|
I would leave out branch and just run with |
I don't have confidence I am running the correct version in my docker container. I am still messing with it. |
Let me know if you want some help to write a Dockerfile for it. |
Ok, I am quite confident I've got it installed and am using the correct branch/version of
|
A few things:
|
Right, I gave all scenarios I tried which included with and without All that being said, still not working (could be my container setup). I don't think its doing the task-launch in the loop over files.
|
Try removing file types? I did test it on a directory in tmp with one markdown file and a link (in markdown too, that's important) and it worked, but I since added a raw string and that might have broken it. We also have some bug that the regex is not working as it did before - pinging @SuperKogito he was going to look into that today. |
no change...
|
Let me try removing the raw string I added and I'll let you know, repull install and try again. |
okay pushed. |
Ok, its going now. Getting a ton of error messages...
Anyway to silence that. I mean, maybe one at the beginning or end would be good...but its echoing on every link. Not urgent. Put it on the todo list. |
Ok, that worked...now trying with certs enabled to confirm a difference in behavior. |
Yeah no worries about that - this is a non-work, for fun open source project, so I'm good to prioritize based on that! I usually can add comments like this during the day and then actual work during non work hours. |
Ok, what I am seeing withOUT
|
@vsoch by the way...if you need a proj/task to charge for some time on this, I think I can accomodate. Lemme know. |
@markcmiller86 that might be reflecting the setup on your Mac? I appreciate that, but this project has a FUNDING.yml meaning folks can find it with GitHub sponsors, and is clearly scoped outside of lab work. I have this registered as an outside business agreement and I set a pretty clear line between lab work and these projects, so I don't think that would work. I'm pretty good at getting stuff done, so I can say I will be able to work on the underlying issues sooner than later, but absolutely not on lab time (I'm taking a quick break and drinking hot chocolate right now). ☕ |
Also double check you installed |
I checked
I installed But, I get your point. Maybe the container is misconfigured. I certainly don't have much experience with them and I didn't launch it to use |
Ok, I gave up on docker. Installed on pascal. Asked ChatGPT for known sites with bad certs...
Created this file
With |
Also, not sure what you are doing on back-end as far as testing Yes, for testing tools that check the validity and functionality of URLs in text files, it's helpful to use a variety of test websites and addresses that simulate different scenarios. Here are several categories and examples:
When using these resources, it's important to consider the impact of your testing on third-party services. Ensure that your testing complies with any usage policies or terms of service to avoid causing undue load or other issues. |
Apparently the action does not work with URls pointing at content. See also: - urlstechie/urlchecker-action#105
I am getting false positives both of which have to do with
.pdf
files, https://github.com/betterscientificsoftware/bssw.io/actions/runs/7734371239/job/21088244489?pr=1633Any reason to suspect the checker has trouble with
.pdf
files?The text was updated successfully, but these errors were encountered: