Hex (encoded images) in `\pict` control groups is not removed. #24

jtkiley · 2014-08-08T13:29:02Z

When I read an RTF and write it out as plain text (both with pyth), all of the hex for embedded images is included in the document. As expected, the \pict control group itself is gone.

At the moment, I'm preprocessing these files to wipe out the pict group (hex included) before using pyth, but, of course, it would be nice to avoid that. I'm not familiar enough with RTF versions to know if this is part of the 1.5 spec or a later one. However, these files run perfectly otherwise.

I can send you an example, if needed.

The text was updated successfully, but these errors were encountered:

brendonh · 2014-08-08T13:42:52Z

Alright, so this is from this pull request, which I perhaps didn't think hard enough about: https://github.com/brendonh/pyth/pull/19/files

The stated purpose of that PR is to make it easier to filter out image data, by identifying it as such in the document object. But then none of the writers actually filter it out. Sigh.

I figure the right fix is to make Image a top-level type (instead of a Paragraph subclass), and then update the writers to ignore it. Or something.

jtkiley · 2014-08-08T14:15:34Z

Having looked again, it looks like the images are getting recognized. I see image objects inside of paragraphs, so it may be as easy just having the writers ignore it.

brendonh · 2014-08-08T14:17:06Z

Right. So there are two bugs:

Image is a Paragraph subclass, leading writers to interpret it as text instead of (correctly) crashing on an unknown type
Writers don't know to skip it.

Someone should fix that! ;-)

watercrossing · 2014-08-19T18:25:14Z

@brendonh You should have highlighted me, since I wrote that original image support... I am not sure making an Image a top level class is the correct approach, since images actually appear in the flow of the paragraphs, and currently pyth checks religiously that a Document only contains a Paragraph.

Before my patch #19 Image data would just be interpreted as plain text, and so the output of all the writers hasn't changed. I currently think adding functionality to the writers to ignore/handle the images is the best way forward.

brendonh · 2014-08-20T04:38:19Z

Ugh you are right on both counts. Okay, I'll do something about it.

jtkiley linked a pull request Aug 8, 2014 that will close this issue

Excludes Image objects when assembling plaintext content to write. #25

Open

brendonh mentioned this issue Aug 21, 2014

Add support for strikethrough in the RTF reader. #28

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Hex (encoded images) in `\pict` control groups is not removed. #24

Hex (encoded images) in `\pict` control groups is not removed. #24

jtkiley commented Aug 8, 2014

brendonh commented Aug 8, 2014

jtkiley commented Aug 8, 2014

brendonh commented Aug 8, 2014

watercrossing commented Aug 19, 2014

brendonh commented Aug 20, 2014

Hex (encoded images) in \pict control groups is not removed. #24

Hex (encoded images) in \pict control groups is not removed. #24

Comments

jtkiley commented Aug 8, 2014

brendonh commented Aug 8, 2014

jtkiley commented Aug 8, 2014

brendonh commented Aug 8, 2014

watercrossing commented Aug 19, 2014

brendonh commented Aug 20, 2014

Hex (encoded images) in `\pict` control groups is not removed. #24

Hex (encoded images) in `\pict` control groups is not removed. #24