-
-
Notifications
You must be signed in to change notification settings - Fork 85
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Crawler does not include external files (pdf) #1057
Comments
Thanks for reporting this, the PDFs issues are often a matter of configuration, so hard to debug. Could you check if this is reproducible with the Crawler Devbox? |
Hi again, I understand time is valuable. As you have probably already noticed, we are doing this for our work. I very much doubt that trying to reproduce the issue on a linux container with ddev is going to help us. I also guess sponsoring you with a one time payment of 25€ is not going to cut it. We are exploring a few other options, like ke_search. If, and that's a big IF, we could support you by paying you for your support, I see a few options here:
Would be really great if there would be another way to contact you, other than via Github. |
Hi @zillion42 I cannot take on work atm due to personal reasons, but you could ask in the #TYPO3 #Crawler chat on Slack. Perhaps someone could help you better. I can click the release button, if a fix is provided, but I cannot do much more currently. https://typo3.org/community/meet/chat-slack You can also contact me via Slack but have a slow response time, for the same reasons as above. |
Hi @zillion42 I know It's been a while, but have tested this in the Crawler devbox (ddev). If I don't add other items on the page with the PDFs it doesn't get index, so there need to be additional text on the page, not just a header and links. Try to see if that changes anything for you. The pages and PDFs are indexed correctly in my setup. |
Bug Report
Current Behavior
The crawler builds its queue, with:
c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:buildQueue 1 reindex --depth=20 --mode=queue
but omits any external files. All other Html content is queued, processed & indexed just fine.
If we enable frontend indexing by
'disableFrontendIndexing' => '0'
and browse a page containing pdf's (via local file collections), pdf's are added to the queue. We're using xpdf-tools-win-4.05, 32-bit binaries, pdftotext.exe is tested and works. Pdf's can be processed in the queue after they have been added through frontend indexing. Pdf content is successfully found using typo3 search, after queue has been processed.Expected behavior/output
Building the queue should add external files, since
'useCrawlerForExternalFiles' => '1'
is enabled.Steps to reproduce
Build the queue with settings as described (below), check the queue in table
tx_crawler_queue
. No external files are present.Environment
Possible Solution
Unfortunately only workaround is enable frontend indexing with
'disableFrontendIndexing' => '0'
and adding all external files to queue manually. Not really a working solution.Additional context
Edit:
Enabling frontend indexing and browsing a page which contains external files does not immediately add those files to the queue. We found that before external files can be added to the queue, we first have to go to the indexing module in the backend and delete all previously queued content by clicking the trash icon on top (have to click the trash icon several times and make sure whole page is not indexed). After that, reloading the page in the frontend adds the external files.
Edit2:
Building the queue multiple consecutive times from console, info module, or scheduler, like other people have reported, does not help.
Edit3:
It might help to start with clean tables in the database. We have quite large indexing tables, unfortunately before we mirror our current environment to a testing environment, truncating those tables is not an option.
Edit4:
We recently installed a https certificate for our site and had trouble building or processing anything. This has been resolved by using Protocol for crawling - Force HTTPS for all pages and setting the correct Base URL in the crawling configuration on PID 1.
Edit5:
This is how the filepath backslaches are escaped in windows, screenshot directly from HeidiSQL tx_crawler_queue table. Maybe there is some problem with all the backslash escaping that only occurs on windows?
Edit6:
While processing PDF's (manually added) we are often getting following output:
Edit7:
Also this can occur:
Edit8:
Knowing what I know now, I have just indexed 1277 PDF's (2554, because they were queued twice, idk why), all searchable by content, going back all the way to the year 2016. It is very unfortunate that the crawler can not do, what can be done manually, which makes it unreliable and unpractical to use.
We have a quite big site, with almost daily edits, so it would be really great if we could figure out the problem.
The text was updated successfully, but these errors were encountered: