Add img to replaced tags which get preserved in HTML from slicing. #26

andreip · 2018-07-05T22:44:10Z

fixes #22

Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags.

See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it?

Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).

afzalIbnSH · 2021-09-15T07:22:53Z

Can this please be merged?

afzalIbnSH · 2021-09-15T09:51:28Z

I tested this PR in my project and it works nicely. Would be great if this can be merged. @wojcikstefan

@andreip

Originally solved by @andreip in closeio#26 In his words: "Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it)."

@andreip

Image in beginning of reply is incorrectly ignored. Fix. Originally reported in closeio#22 and solved by @andreip in closeio#26 In his words: "Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it)."

@andreip

Image in beginning of reply is incorrectly ignored. Fix. This is a copy of [PR](closeio#26) by @andreip In his words: > Couldn't think of a different approach, since an img isn't really a block, so it'll never have a text within it, so no point in generating a different html in get_line_info functions. Instead, what was missing was it being treated as a special case: don't want to slice a line from the HTML by just looking at the plain text lines, since that could slice an img, need to also look at the start/end refs for replaced tags. > > See more about a replaced element (https://developer.mozilla.org/en-US/docs/Web/CSS/Replaced_element). I think it might be worth adding a few more things to the list? e.g. video, embed etc. ; not sure about iframe and how that'd be treated in lxml parsing though, but I suppose you could have an iframe with just an image in it, in which case you'd still want to keep it? > > Full list would be a total of 9 replaced elements (or 10 if we also count input; although I'm not sure of all examples where that'd generate sth even if it apparently has no text in it).

thomasst

This looks like a good start, but is still incomplete and not catching all of the cases:

For example, the following test string

html = "Test 2. <img src=\"top-image\"> On Jun 05, 2018, at 09:56 AM, John Doe <[email protected]> wrote: <blockquote><img src=\"https://example.com\" class=\"fr-fic fr-dib\"> Some text 1. Bart</blockquote><img src=\"bottom-image\">"

should be parsed into:

  {'date': 'Jun 05, 2018, at 09:56 AM',
   'from': 'John Doe <[email protected]>',
   'html': '<div><img src="https://example.com" class="fr-fic fr-dib"><br>Some '
           'text 1.<br><br>Bart</div>',
   'html_bottom': '<img src="bottom-image">',
   'html_top': 'Test 2.<br><img src="top-image">',
   'type': 'reply'}

In this PR, it ignores the top-image (rather than having it in html_top) and includes the bottom-image as part of the html (rather than html_bottom).

Add img to replaced tags which get preserved in HTML from slicing.

b929ea6

afzalIbnSH mentioned this pull request Sep 15, 2021

Image in beginning of reply is incorrectly ignored #22

Open

afzalIbnSH mentioned this pull request Sep 16, 2021

Don't ignore images freightwalla/quotequail#2

Merged

afzalIbnSH mentioned this pull request Sep 16, 2021

Don't ignore images freightwalla/quotequail#3

Merged

AlecRosenbaum requested a review from thomasst September 16, 2021 12:53

thomasst reviewed Oct 15, 2021

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add img to replaced tags which get preserved in HTML from slicing. #26

Add img to replaced tags which get preserved in HTML from slicing. #26

andreip commented Jul 5, 2018 •

edited

Loading

afzalIbnSH commented Sep 15, 2021

afzalIbnSH commented Sep 15, 2021

thomasst left a comment

Add img to replaced tags which get preserved in HTML from slicing. #26

Are you sure you want to change the base?

Add img to replaced tags which get preserved in HTML from slicing. #26

Conversation

andreip commented Jul 5, 2018 • edited Loading

afzalIbnSH commented Sep 15, 2021

afzalIbnSH commented Sep 15, 2021

thomasst left a comment

Choose a reason for hiding this comment

andreip commented Jul 5, 2018 •

edited

Loading