Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Won´t open some PDFs #15

Open
marco-zanon opened this issue Apr 18, 2021 · 1 comment
Open

Won´t open some PDFs #15

marco-zanon opened this issue Apr 18, 2021 · 1 comment

Comments

@marco-zanon
Copy link

While the package opens normally most of the PDFs files, it encounters problems opening some files, instead returning a "panic: malformed PDF: reading at offset 0: stream not present" error.

For example, the file "SP 10-2019 Relatório Analítico de Composições de Custos.pdf" (which you can get in the url "https://www.gov.br/dnit/pt-br/assuntos/planejamento-e-pesquisa/custos-e-pagamentos/custos-e-pagamentos-dnit/sistemas-de-custos/sicro/sudeste/espirito-santo/2019/outubro-1/es-outubro-2019.zip", after extracting the zip file) won´t open with your "github.com/ledongthuc/pdf" package, but opens normally with any PDF reader (like Adobe Reader, for instance).

FWIW, the entire error message that I get while trying to open the file is:

panic: malformed PDF: reading at offset 0: stream not present

goroutine 1 [running]:
github.com/ledongthuc/pdf.(*buffer).errorf(...)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:82
github.com/ledongthuc/pdf.(*buffer).reload(0xc04c7db790, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:95 +0x1fe
github.com/ledongthuc/pdf.(*buffer).readByte(0xc04c7db790, 0xc0003ff9d0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:71 +0x67
github.com/ledongthuc/pdf.(*buffer).readToken(0xc04c7db790, 0xc0732d6260, 0x1000)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/lex.go:135 +0x47
github.com/ledongthuc/pdf.Interpret(0x0, 0x0, 0x0, 0x0, 0xc04c7db930)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/ps.go:64 +0x1ae
github.com/ledongthuc/pdf.Page.Content(0xc04f7395c0, 0x48, 0x4dad60, 0xc073356000, 0x0, 0x0, 0x0, 0x0, 0x0, 0x0)
C:/Users/mazrodrigues/go/pkg/mod/github.com/ledongthuc/[email protected]/page.go:816 +0x2db
main.extraiPDFAnalitico(0x539921, 0x49, 0x0)
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/pdf_analitico.go:50 +0x165
main.main()
C:/Users/mazrodrigues/Documents/04_Golang/04_Faro/02_Tabela_Referencia/01_SICRO/04_trata_revisionais/v02/main.go:18 +0xa8
exit status 2

@mikaello
Copy link

mikaello commented Oct 5, 2022

A workaround is to clean the PDF with mupdf before using this package:

$ sudo apt-get install mupdf-tools

$ mutool clean -s dirty.pdf clean.pdf

clean.pdf should now work with this package.

Credit @YspCoder

Another option is to use a Go wrapper around MuPDF to extract text from PDF: https://github.com/gen2brain/go-fitz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants