-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Processor tesserocr-segment-line terminates with exception (TopologyException: Input geom 1 is invalid) #149
Comments
The error occured with standard workflow and urn:nbn:de:bsz:180-digad-8419. |
The error still occurs with revision 974459e. |
Since Tesseract only gives us bboxes here, the invalid polygon must be from the region. I need to know the exact workflow – what do you mean by standard workflow? Also, this might be another instance of "won't fix because PAGE coordinates must be correct on the input side" (we cannot make all processors robust to all sorts of coordinate invalidities/inconsistencies). So be prepared to wait for a fix in the page segmenter instead... |
"Standard" means one of the workflows suggested at https://ocr-d.de/en/workflows. I use this script:
|
Thanks @stweil for the neatly encapsulated script. Unfortunately though, I cannot reproduce the problem. Which versions of ocrd_anybaseocr, ocrd_cis and ocrd_segment have you been running? |
I used latest ocrd_all with ocrd_tesserocr updated to latest git release. |
A fresh run reproduced the problem ... All data is available here. |
I have tried again with (Dockerized) OCR-D/ocrd_all@dd35c37 (built at 2020-08-28T18:02:22Z) and ocrd_tesserocr 5761661 (that's your 974459e plus the release commit) – it runs smoothly. Perhaps it's an effect of differences between Ubuntu 18.04 (Docker, my host) and Debian (your host) in Shapely's base libraries? |
Can you compare the generated files on your side with my data (see link above) to see where they differ? |
I'll repeat the test as soon as @kba has finished a new |
The error still occurs. Tested with ocrd_all branch OCR-D/update-2020-09-07 on Debian buster. |
BTW, your script cannot have worked like that on the previous ocrd_all release (based on core 2.15), because that was not able to cope with OAI-PMH responses. And it does not work verbatim with the current version either, because you output to
Unfortunately, I have no permissions for your |
I am sorry. That's a known problem (see OCR-D/core#403). Access should work now. |
I have created |
@bertsky, I get the same error on another host with Debian bullseye and a local build of Python 3.7.9 for a different book using this script:
|
@stweil could you please repeat from |
Here is the result from a fresh run:
|
Does from numpy import np fix that? Could be, I was too thorough in cleaning up imports in the last round of refactoring... |
Sorry, I had forgotten to include that change in the commit. |
But the ordeal is not over yet: there is still one case I can see that can fail – when a polygon is invalid but simplification does not change anything, regardless of the tolerance level. (I have to detect that and re-order the point sequence...) |
6bbe873 should suffice. |
The workflow for PPN1024726142 now passes - nearly. There is a new problem when creating the ALTO files which is caused by a negative x coodinate. See issue #153 for more details. |
This issue was fixed in the latest code. |
The text was updated successfully, but these errors were encountered: