Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Crawler does not include external files (pdf) #1057

Open
zillion42 opened this issue Apr 3, 2024 · 4 comments
Open

Crawler does not include external files (pdf) #1057

zillion42 opened this issue Apr 3, 2024 · 4 comments
Labels
3rd party ext Issue related to 3rd party extension e.g. News TYPO3v10 v11.x v12.x

Comments

@zillion42
Copy link

zillion42 commented Apr 3, 2024

Bug Report

Current Behavior
The crawler builds its queue, with:
c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:buildQueue 1 reindex --depth=20 --mode=queue
but omits any external files. All other Html content is queued, processed & indexed just fine.
If we enable frontend indexing by 'disableFrontendIndexing' => '0' and browse a page containing pdf's (via local file collections), pdf's are added to the queue. We're using xpdf-tools-win-4.05, 32-bit binaries, pdftotext.exe is tested and works. Pdf's can be processed in the queue after they have been added through frontend indexing. Pdf content is successfully found using typo3 search, after queue has been processed.

Expected behavior/output
Building the queue should add external files, since 'useCrawlerForExternalFiles' => '1' is enabled.

Steps to reproduce
Build the queue with settings as described (below), check the queue in table tx_crawler_queue . No external files are present.

Environment

  • Windows Server 2022 Standard 21H2 (VMWare)
  • Crawler version(s): 11.0.7
  • TYPO3 version(s): 10.4.37
  • Is your TYPO3 installation set up with Composer (Composer Mode): no
'crawler' => [
            'cleanUpOldQueueEntries' => '1',
            'cleanUpProcessedAge' => '2',
            'cleanUpScheduledAge' => '7',
            'countInARun' => '1000',
            'crawlHiddenPages' => '0',
            'enableTimeslot' => '1',
            'frontendBasePath' => '/',
            'makeDirectRequests' => '1',
            'maxCompileUrls' => '10000',
            'phpBinary' => 'php',
            'phpPath' => 'C:/php/php.exe',
            'processDebug' => '0',
            'processLimit' => '20',
            'processMaxRunTime' => '1000',
            'processVerbose' => '0',
            'purgeQueueDays' => '14',
            'sleepAfterFinish' => '0',
            'sleepTime' => '0',
        ],
'indexed_search' => [
            'catdoc' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'debugMode' => '0',
            'disableFrontendIndexing' => '1',
            'enableMetaphoneSearch' => '1',
            'flagBitMask' => '192',
            'fullTextDataLength' => '0',
            'ignoreExtensions' => '',
            'indexExternalURLs' => '0',
            'maxAge' => '0',
            'maxExternalFiles' => '250',
            'minAge' => '0',
            'pdf_mode' => '20',
            'pdftools' => 'C:\\httpd\\Apache24\\bin\\pdf2txt',
            'ppthtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
            'trackIpInStatistic' => '2',
            'unrtf' => '',
            'unzip' => '',
            'useCrawlerForExternalFiles' => '1',
            'useMysqlFulltext' => '0',
            'xlhtml' => 'C:\\httpd\\Apache24\\bin\\catdoc',
        ],

Possible Solution
Unfortunately only workaround is enable frontend indexing with 'disableFrontendIndexing' => '0' and adding all external files to queue manually. Not really a working solution.

Additional context
Edit:
Enabling frontend indexing and browsing a page which contains external files does not immediately add those files to the queue. We found that before external files can be added to the queue, we first have to go to the indexing module in the backend and delete all previously queued content by clicking the trash icon on top (have to click the trash icon several times and make sure whole page is not indexed). After that, reloading the page in the frontend adds the external files.
Edit2:
Building the queue multiple consecutive times from console, info module, or scheduler, like other people have reported, does not help.
Edit3:
It might help to start with clean tables in the database. We have quite large indexing tables, unfortunately before we mirror our current environment to a testing environment, truncating those tables is not an option.
2024-04-03 17_14_55-Intranetsrv - intranetsrv - Remotedesktopverbindung
Edit4:
We recently installed a https certificate for our site and had trouble building or processing anything. This has been resolved by using Protocol for crawling - Force HTTPS for all pages and setting the correct Base URL in the crawling configuration on PID 1.
image
Edit5:
This is how the filepath backslaches are escaped in windows, screenshot directly from HeidiSQL tx_crawler_queue table. Maybe there is some problem with all the backslash escaping that only occurs on windows?
image
Edit6:
While processing PDF's (manually added) we are often getting following output:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Cannot load charset cp1251 - file not found
Unprocessed Items remaining:1570 (67c43d15a5)
3

Edit7:
Also this can occur:

C:\Windows\system32>c:\php\php.exe C:\httpd\Apache24\htdocs\ourSite\typo3\sysext\core\bin\typo3 crawler:processQueue --amount=1000
<warning>Doctrine\DBAL\Exception\UniqueConstraintViolationException: An exception occurred while executing 'INSERT INTO `index_words` (`wid`, `baseword`, `metaphone`) VALUES (?, ?, ?)' with params [207539845, "ziel\/e", "122391892"]:

Duplicate entry '207539845' for key 'PRIMARY'</warning>
Unprocessed Items remaining:1002 (5ed3e4899a)
5

Edit8:
Knowing what I know now, I have just indexed 1277 PDF's (2554, because they were queued twice, idk why), all searchable by content, going back all the way to the year 2016. It is very unfortunate that the crawler can not do, what can be done manually, which makes it unreliable and unpractical to use.

We have a quite big site, with almost daily edits, so it would be really great if we could figure out the problem.

@tomasnorre
Copy link
Owner

Thanks for reporting this, the PDFs issues are often a matter of configuration, so hard to debug.

Could you check if this is reproducible with the Crawler Devbox?
https://github.com/tomasnorre/crawler/blob/main/CONTRIBUTING.md#devbox

@tomasnorre tomasnorre added TYPO3v10 3rd party ext Issue related to 3rd party extension e.g. News v11.x labels Apr 3, 2024
@zillion42
Copy link
Author

zillion42 commented Apr 3, 2024

Hi again,

I understand time is valuable. As you have probably already noticed, we are doing this for our work. I very much doubt that trying to reproduce the issue on a linux container with ddev is going to help us. I also guess sponsoring you with a one time payment of 25€ is not going to cut it.

We are exploring a few other options, like ke_search.

If, and that's a big IF, we could support you by paying you for your support, I see a few options here:

  • We could pay you for giving us support on our current system, which will for sure be messy and hard to debug. You would have to sign a order processing contract (Auftragsverarbeitungsvertrag).
  • We could pay you for giving us support on a windows VM with a dummy Site, this would take some time to setup beforehand. Unfortunately this solution would not take into account the complexity of the issue at hand, thousands of pdf's which have to be indexed reliably on a daily basis (incrementally).

Would be really great if there would be another way to contact you, other than via Github.

@tomasnorre
Copy link
Owner

tomasnorre commented Apr 4, 2024

Hi @zillion42

I cannot take on work atm due to personal reasons, but you could ask in the #TYPO3 #Crawler chat on Slack. Perhaps someone could help you better. I can click the release button, if a fix is provided, but I cannot do much more currently.

https://typo3.org/community/meet/chat-slack

You can also contact me via Slack but have a slow response time, for the same reasons as above.

@tomasnorre
Copy link
Owner

Hi @zillion42

I know It's been a while, but have tested this in the Crawler devbox (ddev).

If I don't add other items on the page with the PDFs it doesn't get index, so there need to be additional text on the page, not just a header and links.

Try to see if that changes anything for you. The pages and PDFs are indexed correctly in my setup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3rd party ext Issue related to 3rd party extension e.g. News TYPO3v10 v11.x v12.x
Projects
None yet
Development

No branches or pull requests

2 participants