invalid continuation byte #26

MattPeterson0 · 2020-08-08T03:07:40Z

Hello. This program is great. Setting it all up on Windows with no previous Python experience was an adventure, but once I got everything in place, it's fantastic. Thank you very much for making this.

Recently I've been getting a problem with write_html.py with a particular subredddit capture. Here's the error:

Traceback (most recent call last):
File "write_html.py", line 774, in
generate_html(args.min_score, args.min_comments, hide_deleted_comments)
File "write_html.py", line 119, in generate_html
write_link_page(subs, l, sub, hide_deleted_comments)
File "write_html.py", line 288, in write_link_page
'###BODY###': snudown.markdown(c['body'].replace('>','>')),

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 62: invalid
continuation byte

It actually creates the posts in /r/ right up to where it crashes, and with a little trial and error work I was able to isolate the problem to a specific line in a specific .csv file, which was a comment that used a "U+2019 Right Single Quotation Mark" (UTF-8 Encoding: 0xE2 0x80 0x99) as an apostrophe. When I replaced that character with a normal straight single quotation mark in the .csv file, it parsed the file fine. (I don't quite get "position 62" though, the apostrophe was the 45th character on the line.) The really puzzling thing is other comments from the same user have the same character elsewhere in the same .csv file, but those don't cause a problem.

Well, it crashed out again after I fixed that, but in a different place from eight months later and "position 159". I guess I have another buggy character to hunt down. Don't have time right now. I will update later if this second one reveals any further clues.

libertysoft3 · 2020-08-15T04:15:52Z

did you try the windows thing here? https://github.com/libertysoft3/reddit-html-archiver#install

libertysoft3 · 2020-08-15T04:17:00Z

potential duplicate of #23

MattPeterson0 · 2020-08-15T04:36:40Z

Oh, this bit?

chcp 65001
set PYTHONIOENCODING=utf-8

I should have said that. Yes, with or without doing that, same issue. I even re-fetch_links.py'd the entire thing because I hadn't done the 65001/utf-8 thing the first time. Didn't help.

I also came back here and grabbed the current copy of write_html.py in case some update since my original download changed things. Nope, same problem.

It's very mysterious! Haven't had time to poke at it more, maybe next week.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

invalid continuation byte #26

invalid continuation byte #26

MattPeterson0 commented Aug 8, 2020

libertysoft3 commented Aug 15, 2020

libertysoft3 commented Aug 15, 2020

MattPeterson0 commented Aug 15, 2020

invalid continuation byte #26

invalid continuation byte #26

Comments

MattPeterson0 commented Aug 8, 2020

libertysoft3 commented Aug 15, 2020

libertysoft3 commented Aug 15, 2020

MattPeterson0 commented Aug 15, 2020