Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fatal error, been suspended #48

Open
dvlp123456 opened this issue Jan 22, 2025 · 3 comments
Open

fatal error, been suspended #48

dvlp123456 opened this issue Jan 22, 2025 · 3 comments

Comments

@dvlp123456
Copy link

Hi, my friend,
I'm new crawler, I had crawled some website with siteone-crawler and got correct report,
However, there are three websites that did not get the correct report, error message like this:

Progress report           | URL                                                   | Status | Type     | Time   | Size   | Cache  | Access.  | Best pr.
-------------------------------------------------------------------------------------------------------------------------------------------------------    
1/1     | 100% |>>>>>>>>>>| https://www.XXXXXX.com/+%random-query% | -1:CON | Other    | 5 s    | 0 B    | none    |          |
The analysis has been suspended because no working URL could be found. Please check the URL/domain.
Progress report           | URL                                                   | Status | Type     | Time   | Size   | Cache  | Access.  | Best pr.
-------------------------------------------------------------------------------------------------------------------------------------------------------    

Fatal error: Uncaught TypeError: str_contains(): Argument #1 ($haystack) must be of type string, array given in /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php:262
Stack trace:
#0 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(262): str_contains(Array, 'max-age=0')
#1 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(171): Crawler\Analysis\SecurityAnalyzer->checkStrictTransportSecurity(Array, Object(Crawler\Analysis\Result\UrlAnalysisResult))
#2 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(130): Crawler\Analysis\SecurityAnalyzer->checkHeaders(Array, true, Object(Crawler\Analysis\Result\UrlAnalysisResult))
#3 /XXXXX/siteone-crawler/src/Crawler/Analysis/Manager.php(169): Crawler\Analysis\SecurityAnalyzer->analyzeVisitedUrl(Object(Crawler\Result\VisitedUrl), '\n<!DOCTYPE html...', Object(DOMDocument), Array)
#4 /XXXXX/siteone-crawler/src/Crawler/Manager.php(197): Crawler\Analysis\Manager->analyzeVisitedUrl(Object(Crawler\Result\VisitedUrl), '\n<!DOCTYPE html...', Array)
#5 [internal function]: Crawler\Manager->visitedUrlCallback(Object(Crawler\Result\VisitedUrl), '\n<!DOCTYPE html...', Array)
#6 /XXXX/siteone-crawler/src/Crawler/Crawler.php(585): call_user_func(Array, Object(Crawler\Result\VisitedUrl), '\n<!DOCTYPE html...', Array)
#7 [internal function]: Crawler\Crawler->processNextUrl()
#8 {main}
  thrown in /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php on line 262
Progress report           | URL                                                   | Status | Type     | Time   | Size   | Cache  | Access.  | Best pr.
-------------------------------------------------------------------------------------------------------------------------------------------------------    
1/2     | 50%  |>>>>>     | https://XXXXX.XXXXX.cn/+%random-query% | 302    | Redirect | 826 ms | 83 B   | 0 s     |          |

Fatal error: Uncaught TypeError: str_contains(): Argument #1 ($haystack) must be of type string, array given in /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php:262
Stack trace:
#0 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(262): str_contains(Array, 'max-age=0')
#1 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(171): Crawler\Analysis\SecurityAnalyzer->checkStrictTransportSecurity(Array, Object(Crawler\Analysis\Result\UrlAnalysisResult))
#2 /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php(130): Crawler\Analysis\SecurityAnalyzer->checkHeaders(Array, true, Object(Crawler\Analysis\Result\UrlAnalysisResult))
#3 /XXXXX/siteone-crawler/src/Crawler/Analysis/Manager.php(169): Crawler\Analysis\SecurityAnalyzer->analyzeVisitedUrl(Object(Crawler\Result\VisitedUrl), '<!DOCTYPE html>...', Object(DOMDocument), Array)
#4 /XXXXX/siteone-crawler/src/Crawler/Manager.php(197): Crawler\Analysis\Manager->analyzeVisitedUrl(Object(Crawler\Result\VisitedUrl), '<!DOCTYPE html>...', Array)
#5 [internal function]: Crawler\Manager->visitedUrlCallback(Object(Crawler\Result\VisitedUrl), '<!DOCTYPE html>...', Array)
#6 /XXXXX/siteone-crawler/src/Crawler/Crawler.php(585): call_user_func(Array, Object(Crawler\Result\VisitedUrl), '<!DOCTYPE html>...', Array)
#7 [internal function]: Crawler\Crawler->processNextUrl()
#8 {main}
  thrown in /XXXXX/siteone-crawler/src/Crawler/Analysis/SecurityAnalyzer.php on line 262

my environment:

  1. SiteOne Crawler, v1.0.8.20240824
  2. Windows WSL 2.3.26.0 Debian 12.7 bookworm
  3. OpenSSL 3.0.14 4 Jun 2024 (Library: OpenSSL 3.0.14 4 Jun 2024)

I would be very grateful if you could help solve this problem!

janreges added a commit that referenced this issue Jan 22, 2025
…nto a concatenated string, instead of an array, refs #48
@janreges
Copy link
Owner

Hi @dvlp123456,

to fix the last 2 cases, please try the current version from the main branch where I just pushed the fix. If you are using crawler in WSL, it's straightforward - see the tutorial https://crawler.siteone.io/installation-and-requirements/manual-installation/#linux-x64-or-wsl-on-windows

In the first case I see that the request timeouted after 5 seconds. Since you're using random parameter generation in the GET query, this may have bypassed the cache and if the test site is dynamically generated, it may be very slow. Try setting --timeout=30 for example.

@dvlp123456
Copy link
Author

Hello, @janreges , thank you very much! After I updated crawler, case 2 and case 3 were solved perfectly, case 1 still has the same error message, although I set timeout to 120 and remove --add-random-query-params parameter

./crawler --url=https://www.XXXXX.com/ --output=text --workers=2 --memory-limit=1024M --timeout=120 --max-queue-length=3000 --max-visited-urls=10000 --max-url-length=5000 --max-non200-responses-per-basename=10 --remove-query-params --show-scheme-and-host --do-not-truncate-url --output-html-report=tmp/myreport.html --output-json-file=wr_test_dir/report.json --output-text-file=wr_test_dir/report.txt --add-timestamp-to-output-file --add-host-to-output-file --ignore-store-file-error --sitemap-xml-file=wr_test_dir/sitemap.xml --sitemap-txt-file=wr_test_dir/sitemap.txt --sitemap-base-priority=0.5 --sitemap-priority-increase=0.1

@janreges
Copy link
Owner

Please try using ping yourdomain.xyz to find out the IP address of this site (e.g. 1.2.3.4) and then add --resolve='yourdomain.xyz:443:1.2.3.4'.

This will bypass possible problems related to DNS resolving.

More info about --resolve: https://github.com/janreges/siteone-crawler?tab=readme-ov-file#advanced-crawler-settings

If the problem persists, it is possible that the target site or its firewall detects the use of a crawler and DROPs the connection. In this case, try to force a custom user-agent, e.g. using --user-agent='Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/132.0.0.0 Safari/537.36!'. Using ! at the end will cause the siteone-crawler/version sign to not be inserted into the user-agent, which is the default behavior.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants