Won´t open some PDFs #15

marco-zanon · 2021-04-18T15:53:35Z

While the package opens normally most of the PDFs files, it encounters problems opening some files, instead returning a "panic: malformed PDF: reading at offset 0: stream not present" error.

For example, the file "SP 10-2019 Relatório Analítico de Composições de Custos.pdf" (which you can get in the url "https://www.gov.br/dnit/pt-br/assuntos/planejamento-e-pesquisa/custos-e-pagamentos/custos-e-pagamentos-dnit/sistemas-de-custos/sicro/sudeste/espirito-santo/2019/outubro-1/es-outubro-2019.zip", after extracting the zip file) won´t open with your "github.com/ledongthuc/pdf" package, but opens normally with any PDF reader (like Adobe Reader, for instance).

FWIW, the entire error message that I get while trying to open the file is:

panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).reload(0xc04c7db790, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:95 +0x1fe
github.com/ledongthuc/pdf.(*buffer).readByte(0xc04c7db790, 0xc0003ff9d0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:71 +0x67
github.com/ledongthuc/pdf.(*buffer).readToken(0xc04c7db790, 0xc0732d6260, 0x1000)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:135 +0x47
github.com/ledongthuc/pdf.Interpret(0x0, 0x0, 0x0, 0x0, 0xc04c7db930)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x1ae
github.com/ledongthuc/pdf.Page.Content(0xc04f7395c0, 0x48, 0x4dad60, 0xc073356000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2db
main.extraiPDFAnalitico(0x539921, 0x49, 0x0)
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/pdf_analitico.go:50 +0x165
main.main()
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/main.go:18 +0xa8
exit status 2

mikaello · 2022-10-05T05:30:26Z

A workaround is to clean the PDF with mupdf before using this package:

$ sudo apt-get install mupdf-tools

$ mutool clean -s dirty.pdf clean.pdf

clean.pdf should now work with this package.

Credit @YspCoder

Another option is to use a Go wrapper around MuPDF to extract text from PDF: https://github.com/gen2brain/go-fitz

mikaello mentioned this issue Oct 5, 2022

parse pdf error #24

Open

oxisto mentioned this issue Oct 4, 2023

Support single-item array in page Contents key #32

Closed

Vanclief mentioned this issue Nov 5, 2023

PDF Loader fails with some PDFs tmc/langchaingo#348

Open

romanpickl mentioned this issue Mar 12, 2024

support array of content streams and parse them as a single stream #36

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Won´t open some PDFs #15

Won´t open some PDFs #15

marco-zanon commented Apr 18, 2021

mikaello commented Oct 5, 2022 •

edited

Loading

Won´t open some PDFs #15

Won´t open some PDFs #15

Comments

marco-zanon commented Apr 18, 2021

mikaello commented Oct 5, 2022 • edited Loading

mikaello commented Oct 5, 2022 •

edited

Loading