Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hex (encoded images) in \pict control groups is not removed. #24

Open
jtkiley opened this issue Aug 8, 2014 · 5 comments · May be fixed by #25
Open

Hex (encoded images) in \pict control groups is not removed. #24

jtkiley opened this issue Aug 8, 2014 · 5 comments · May be fixed by #25

Comments

@jtkiley
Copy link

jtkiley commented Aug 8, 2014

When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict control group itself is gone.

At the moment, I'm preprocessing these files to wipe out the pict group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.

I can send you an example, if needed.

@brendonh
Copy link
Owner

brendonh commented Aug 8, 2014

Alright, so this is from this pull request, which I perhaps didn't think hard enough about: https://github.com/brendonh/pyth/pull/19/files

The stated purpose of that PR is to make it easier to filter out image data, by identifying it as such in the document object. But then none of the writers actually filter it out. Sigh.

I figure the right fix is to make Image a top-level type (instead of a Paragraph subclass), and then update the writers to ignore it. Or something.

@jtkiley
Copy link
Author

jtkiley commented Aug 8, 2014

Having looked again, it looks like the images are getting recognized. I see image objects inside of paragraphs, so it may be as easy just having the writers ignore it.

@brendonh
Copy link
Owner

brendonh commented Aug 8, 2014

Right. So there are two bugs:

  1. Image is a Paragraph subclass, leading writers to interpret it as text instead of (correctly) crashing on an unknown type
  2. Writers don't know to skip it.

Someone should fix that! ;-)

@watercrossing
Copy link
Contributor

@brendonh You should have highlighted me, since I wrote that original image support... I am not sure making an Image a top level class is the correct approach, since images actually appear in the flow of the paragraphs, and currently pyth checks religiously that a Document only contains a Paragraph.

Before my patch #19 Image data would just be interpreted as plain text, and so the output of all the writers hasn't changed. I currently think adding functionality to the writers to ignore/handle the images is the best way forward.

@brendonh
Copy link
Owner

Ugh you are right on both counts. Okay, I'll do something about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants