Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve handling of LZW decoder table overflow #3032

Open
stefan6419846 opened this issue Jan 8, 2025 · 5 comments
Open

Improve handling of LZW decoder table overflow #3032

stefan6419846 opened this issue Jan 8, 2025 · 5 comments
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow

Comments

@stefan6419846
Copy link
Collaborator

Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an IndexError. Check if there is something we can do about this or at least report a proper pypdf-specific exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.11.0-108013-tuxedo-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


pdf_file = 'a71cf4dab6840030878d668ae37a9edb10522aec.pdf'
with PdfReader(pdf_file) as reader:
    for index, page in enumerate(reader.pages, start=1):
        page.extract_text()
        list(page.images.items())

An example file is available here. I apparently do not own any rights on this file.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2136, in _extract_text
    text = self.extract_xform_text(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2430, in extract_xform_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1882, in _extract_text
    content = ContentStream(content, pdf, "bytes")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1184, in __init__
    stream_data = stream.get_data()
                  ^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1111, in get_data
    decoded.set_data(decode_stream_data(self))
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 636, in decode_stream_data
    data = LZWDecode._decodeb(data, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 402, in _decodeb
    return LZWDecode.Decoder(data).decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 382, in decode
    return _LzwCodec().decode(self.data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 237, in decode
    self._add_entry_decode(self.decoding_table[old_code], string[0])
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 253, in _add_entry_decode
    self.decoding_table[self._table_index] = new_string
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: list assignment index out of range

Using another PDF file I cannot link here additionally uses logger_warning to report the following (the leading whitespace seems to be a typo in the warning message itself):

 impossible to decode XFormObject /Fm2

It is unclear what the actual issue is due to omitting the actual exception message in

pypdf/pypdf/_page.py

Lines 2150 to 2154 in c6dcdc6

except Exception:
logger_warning(
f" impossible to decode XFormObject {operands[0]}",
__name__,
)
Further analysis shows that is indeed this LZW issue here.

@stefan6419846 stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Jan 8, 2025
@rassie
Copy link

rassie commented Jan 13, 2025

Stumbled upon the same problem. I can't provide a test file for the usual reasons, but just as your test file mine are produced by an Esko product, maybe that's an early indicator. qpdf cannot decode the file either (error decoding stream data for object 78 0: LZWDecoder: table full). rups and pdfbox-debugger do not produce an error, neither do evince and ghostscript, however I can't really say if their output is correct. In particular Ghostscript can rewrite the file (with the pdfwrite device) without problems.

@rassie
Copy link

rassie commented Jan 13, 2025

It seems https://bugs.freedesktop.org/show_bug.cgi?id=103174 is connected -- apparently, there are files in the wild with LZW tables larger than the 4096 entries allowed by the standard. I'm not quite sure how I can confirm my files have larger tables, but I'm sure I'll figure it out. Either way, it seems there needs to be some support for that.

@stefan6419846
Copy link
Collaborator Author

If you get the list assignment issue mentioned above, you most likely have a larger table. The PDF 2.0 specification states in section 7.4.4.2:

Codes shall never be longer than 12 bits; therefore, entry 4095 is the last entry of the LZW table.

https://bugs.freedesktop.org/attachment.cgi?id=134916&action=edit, the patch from your linked report, basically skips adding table entries which would lead to an overflow. I briefly checked this when preparing the initial report above, but would get some strange results. Further analysis might take some time.

@rassie
Copy link

rassie commented Jan 13, 2025

the patch from your linked report, basically skips adding table entries which would lead to an overflow.

My experience with LZW and compression algorithms in general does not reach farther than the last hour, but isn't skipping table entries basically ignoring data, i.e. not decoding correctly? (whatever "correctly" means in case of standard-violating encoding)

@stefan6419846
Copy link
Collaborator Author

It might depend on the actual algorithm. LZW basically creates a new table entry for every code word we have decoded. In the PDF case, the table should be reset after a maximum of 4096 entries. Whether there is data loss depends on the compressed stream itself - if there are code words > 4095 to be decompressed, throwing away the out-of-bounds reads would mean data loss (although from my experience, the probability of referencing the last table entries tends to be lower).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
is-robustness-issue From a users perspective, this is about robustness workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow
Projects
None yet
Development

No branches or pull requests

2 participants