-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve handling of LZW decoder table overflow #3032
Comments
Stumbled upon the same problem. I can't provide a test file for the usual reasons, but just as your test file mine are produced by an Esko product, maybe that's an early indicator. |
It seems https://bugs.freedesktop.org/show_bug.cgi?id=103174 is connected -- apparently, there are files in the wild with LZW tables larger than the 4096 entries allowed by the standard. I'm not quite sure how I can confirm my files have larger tables, but I'm sure I'll figure it out. Either way, it seems there needs to be some support for that. |
If you get the list assignment issue mentioned above, you most likely have a larger table. The PDF 2.0 specification states in section 7.4.4.2:
https://bugs.freedesktop.org/attachment.cgi?id=134916&action=edit, the patch from your linked report, basically skips adding table entries which would lead to an overflow. I briefly checked this when preparing the initial report above, but would get some strange results. Further analysis might take some time. |
My experience with LZW and compression algorithms in general does not reach farther than the last hour, but isn't skipping table entries basically ignoring data, i.e. not decoding correctly? (whatever "correctly" means in case of standard-violating encoding) |
It might depend on the actual algorithm. LZW basically creates a new table entry for every code word we have decoded. In the PDF case, the table should be reset after a maximum of 4096 entries. Whether there is data loss depends on the compressed stream itself - if there are code words > 4095 to be decompressed, throwing away the out-of-bounds reads would mean data loss (although from my experience, the probability of referencing the last table entries tends to be lower). |
Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an
IndexError
. Check if there is something we can do about this or at least report a proper pypdf-specific exception.Environment
Which environment were you using when you encountered the problem?
Code + PDF
This is a minimal, complete example that shows the issue:
An example file is available here. I apparently do not own any rights on this file.
Traceback
This is the complete traceback I see:
Using another PDF file I cannot link here additionally uses
logger_warning
to report the following (the leading whitespace seems to be a typo in the warning message itself):It is unclear what the actual issue is due to omitting the actual exception message in
pypdf/pypdf/_page.py
Lines 2150 to 2154 in c6dcdc6
The text was updated successfully, but these errors were encountered: