Improve handling of LZW decoder table overflow #3032

stefan6419846 · 2025-01-08T18:42:51Z

Extracting the text of a given PDF file indicates that the LZW decoding table would overflow by raising an IndexError. Check if there is something we can do about this or at least report a proper pypdf-specific exception.

Environment

Which environment were you using when you encountered the problem?

$ python -m platform
Linux-6.11.0-108013-tuxedo-x86_64-with-glibc2.39

$ python -c "import pypdf;print(pypdf._debug_versions)"
pypdf==5.1.0, crypt_provider=('cryptography', '44.0.0'), PIL=11.0.0

Code + PDF

This is a minimal, complete example that shows the issue:

from pypdf import PdfReader


pdf_file = 'a71cf4dab6840030878d668ae37a9edb10522aec.pdf'
with PdfReader(pdf_file) as reader:
    for index, page in enumerate(reader.pages, start=1):
        page.extract_text()
        list(page.images.items())

An example file is available here. I apparently do not own any rights on this file.

Traceback

This is the complete traceback I see:

Traceback (most recent call last):
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2136, in _extract_text
    text = self.extract_xform_text(
           ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 2430, in extract_xform_text
    return self._extract_text(
           ^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_page.py", line 1882, in _extract_text
    content = ContentStream(content, pdf, "bytes")
              ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1184, in __init__
    stream_data = stream.get_data()
                  ^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/generic/_data_structures.py", line 1111, in get_data
    decoded.set_data(decode_stream_data(self))
                     ^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 636, in decode_stream_data
    data = LZWDecode._decodeb(data, params)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 402, in _decodeb
    return LZWDecode.Decoder(data).decode()
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/filters.py", line 382, in decode
    return _LzwCodec().decode(self.data)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 237, in decode
    self._add_entry_decode(self.decoding_table[old_code], string[0])
  File "/tmp/venv/lib/python3.12/site-packages/pypdf/_codecs/_codecs.py", line 253, in _add_entry_decode
    self.decoding_table[self._table_index] = new_string
    ~~~~~~~~~~~~~~~~~~~^^^^^^^^^^^^^^^^^^^
IndexError: list assignment index out of range

Using another PDF file I cannot link here additionally uses logger_warning to report the following (the leading whitespace seems to be a typo in the warning message itself):

 impossible to decode XFormObject /Fm2

It is unclear what the actual issue is due to omitting the actual exception message in

pypdf/pypdf/_page.py

Lines 2150 to 2154 in c6dcdc6

    
           except Exception: 
        
               logger_warning( 
        
                   f" impossible to decode XFormObject {operands[0]}", 
        
                   __name__, 
        
               )

Further analysis shows that is indeed this LZW issue here.

The text was updated successfully, but these errors were encountered:

rassie · 2025-01-13T10:57:31Z

Stumbled upon the same problem. I can't provide a test file for the usual reasons, but just as your test file mine are produced by an Esko product, maybe that's an early indicator. qpdf cannot decode the file either (error decoding stream data for object 78 0: LZWDecoder: table full). rups and pdfbox-debugger do not produce an error, neither do evince and ghostscript, however I can't really say if their output is correct. In particular Ghostscript can rewrite the file (with the pdfwrite device) without problems.

rassie · 2025-01-13T11:39:39Z

It seems https://bugs.freedesktop.org/show_bug.cgi?id=103174 is connected -- apparently, there are files in the wild with LZW tables larger than the 4096 entries allowed by the standard. I'm not quite sure how I can confirm my files have larger tables, but I'm sure I'll figure it out. Either way, it seems there needs to be some support for that.

stefan6419846 · 2025-01-13T11:45:47Z

If you get the list assignment issue mentioned above, you most likely have a larger table. The PDF 2.0 specification states in section 7.4.4.2:

Codes shall never be longer than 12 bits; therefore, entry 4095 is the last entry of the LZW table.

https://bugs.freedesktop.org/attachment.cgi?id=134916&action=edit, the patch from your linked report, basically skips adding table entries which would lead to an overflow. I briefly checked this when preparing the initial report above, but would get some strange results. Further analysis might take some time.

rassie · 2025-01-13T11:55:30Z

the patch from your linked report, basically skips adding table entries which would lead to an overflow.

My experience with LZW and compression algorithms in general does not reach farther than the last hour, but isn't skipping table entries basically ignoring data, i.e. not decoding correctly? (whatever "correctly" means in case of standard-violating encoding)

stefan6419846 · 2025-01-13T12:07:04Z

It might depend on the actual algorithm. LZW basically creates a new table entry for every code word we have decoded. In the PDF case, the table should be reset after a maximum of 4096 entries. Whether there is data loss depends on the compressed stream itself - if there are code words > 4095 to be decompressed, throwing away the out-of-bounds reads would mean data loss (although from my experience, the probability of referencing the last table entries tends to be lower).

Relates to py-pdf#3032.

#3076) Relates to #3032.

stefan6419846 added workflow-text-extraction From a users perspective, text extraction is the affected feature/workflow is-robustness-issue From a users perspective, this is about robustness labels Jan 8, 2025

stefan6419846 added a commit to stefan6419846/pypdf that referenced this issue Jan 24, 2025

MAINT: Fix formatting of warning message and include exception message

4146936

Relates to py-pdf#3032.

stefan6419846 mentioned this issue Jan 24, 2025

MAINT: Fix formatting of warning message and include exception message #3076

Merged

stefan6419846 added a commit that referenced this issue Jan 24, 2025

MAINT: Fix formatting of warning message and include exception message (

afd7004

#3076) Relates to #3032.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve handling of LZW decoder table overflow #3032

Improve handling of LZW decoder table overflow #3032

stefan6419846 commented Jan 8, 2025

rassie commented Jan 13, 2025

rassie commented Jan 13, 2025

stefan6419846 commented Jan 13, 2025

rassie commented Jan 13, 2025

stefan6419846 commented Jan 13, 2025

Improve handling of LZW decoder table overflow #3032

Improve handling of LZW decoder table overflow #3032

Comments

stefan6419846 commented Jan 8, 2025

Environment

Code + PDF

Traceback

rassie commented Jan 13, 2025

rassie commented Jan 13, 2025

stefan6419846 commented Jan 13, 2025

rassie commented Jan 13, 2025

stefan6419846 commented Jan 13, 2025