PSSyntax error when trying to extract text from a specific pdf #1080

LB207 · 2025-02-17T20:01:59Z

Bug report

A description of the bug

When I tried to extract page 69 from this Lloyds pdf document I recieved a PSSyntax error. After investigation I discovered that it is due to the the #885 bug fix. By passing all keywords that end code stream we end up having some true and false booleans split - see the dictionary below.

Steps to reproduce the bug.

pdf: https://www.lloydsbankinggroup.com/assets/pdfs/investors/financial-performance/lloyds-banking-group-plc/2023/q4/2023-lbg-annual-report.pdf

>>from pdfminer.high_level import extract_text

>>text = extract_text("2023-lbg-annual-report.pdf", page_numbers=[69])
>>print(text)

If relevant, include the output and/or error stacktrace.

There's a very long error message but this is the final line

PSSyntaxError: Invalid dictionary construct: [/'CS', <PDFObjRef:113318>, /'I', False, /'K', /b'tr', /b'ue', /'S', /'Transparency', /'Type', /'Group']

If you need anymore infomation please feel free to contact me,
LB207

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PSSyntax error when trying to extract text from a specific pdf #1080

PSSyntax error when trying to extract text from a specific pdf #1080

LB207 commented Feb 17, 2025

PSSyntax error when trying to extract text from a specific pdf #1080

PSSyntax error when trying to extract text from a specific pdf #1080

Comments

LB207 commented Feb 17, 2025