Skip to content
This repository has been archived by the owner on Jul 7, 2020. It is now read-only.

Support DCTDecodeFilter, which is no-op, actually #15

Open
wants to merge 4 commits into
base: master
Choose a base branch
from

Conversation

fawick
Copy link

@fawick fawick commented Jul 9, 2017

DCTDecode means that the blob is a raw jpeg which can be read right away.

cf. https://blog.idrsolutions.com/2011/07/extract-raw-jpeg-images-from-a-pdf-file/

@mpl
Copy link

mpl commented Aug 1, 2017

I'd suggest adding a tiny PDF in a testdata dir, with such an embedded image, plus a decoding test, but I see the lib does not have any tests at all, so maybe that's not wanted...

@@ -839,6 +839,8 @@ func applyFilter(rd io.Reader, name string, param Value) io.Reader {
case 12:
return &pngUpReader{r: zr, hist: make([]byte, 1+columns), tmp: make([]byte, 1+columns)}
}
case "DCTDecode":
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

// DCTDecode indicates that the Image XObject data is a full JPEG encoded image, so we return the original reader as is, and leave it up to the caller to decode the image.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea. thanks. Done.

@mpl
Copy link

mpl commented Aug 1, 2017

well, @bradfitz suggested a test too, so maybe that change can also be the one that introduces the first test.

@mpl
Copy link

mpl commented Aug 1, 2017

@josharian can we humbly ping you to have a look at this PR please?

@josharian
Copy link
Collaborator

Tests are definitely welcome. I looked at the change, and it seems fine to me, but I really know ~0 about PDFs, and I'm not at a point now where I want to learn enough to take ownership of this repo. Sorry.

Comment was suggested and worded by @mpl.
@fawick
Copy link
Author

fawick commented Aug 2, 2017

Okay, I will supply a test tonight.

@bradfitz, @mpl Given all the open PRs, is it worth considering to fork and maintain this package under go4.org?


func TestReaderExtractXObjectDCTDecode(t *testing.T) {
const (
testscan = "testdata/testscan.pdf"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

50K seems a bit large for such a basic test imho. It does not have to be a real scan coming from your scanner, as long as it is structured the same. Is it difficult to forge a pdf with a an image of less than 1K in it?
Examples of tiny images: camlistore.org/pkg/images/testdata

if err != nil {
t.Fatalf("could not decode embedded JPEG: %v", err)
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably check that the decoded image is what we expect. we'd usually check it pixel by pixel but that might be overkill here, so it's probably ok to e.g. check the hashsum of its bytes against a hardcoded hashsum of what we know to be the initial image.

@mpl
Copy link

mpl commented Aug 11, 2017

@fawick to answer your question about forking: given that nobody seems to be willing to own that repo, then yes, it probably will come to this. I doubt it'll be on go4.org though. As far as Camlistore is concerned, simply vendoring github.com/fawick/pdf (which is in effect similar to what you were proposing initially) is probably the way to go.

Let's finish the review on here first though.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants