Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

invalid continuation byte #26

Open
MattPeterson0 opened this issue Aug 8, 2020 · 3 comments
Open

invalid continuation byte #26

MattPeterson0 opened this issue Aug 8, 2020 · 3 comments

Comments

@MattPeterson0
Copy link

Hello. This program is great. Setting it all up on Windows with no previous Python experience was an adventure, but once I got everything in place, it's fantastic. Thank you very much for making this.

Recently I've been getting a problem with write_html.py with a particular subredddit capture. Here's the error:

Traceback (most recent call last):
File "write_html.py", line 774, in
generate_html(args.min_score, args.min_comments, hide_deleted_comments)
File "write_html.py", line 119, in generate_html
write_link_page(subs, l, sub, hide_deleted_comments)
File "write_html.py", line 288, in write_link_page
'###BODY###': snudown.markdown(c['body'].replace('>','>')),

UnicodeDecodeError: 'utf-8' codec can't decode byte 0xe2 in position 62: invalid
continuation byte

It actually creates the posts in /r/ right up to where it crashes, and with a little trial and error work I was able to isolate the problem to a specific line in a specific .csv file, which was a comment that used a "U+2019 Right Single Quotation Mark" (UTF-8 Encoding: 0xE2 0x80 0x99) as an apostrophe. When I replaced that character with a normal straight single quotation mark in the .csv file, it parsed the file fine. (I don't quite get "position 62" though, the apostrophe was the 45th character on the line.) The really puzzling thing is other comments from the same user have the same character elsewhere in the same .csv file, but those don't cause a problem.

Well, it crashed out again after I fixed that, but in a different place from eight months later and "position 159". I guess I have another buggy character to hunt down. Don't have time right now. I will update later if this second one reveals any further clues.

@libertysoft3
Copy link
Owner

did you try the windows thing here? https://github.com/libertysoft3/reddit-html-archiver#install

@libertysoft3
Copy link
Owner

potential duplicate of #23

@MattPeterson0
Copy link
Author

Oh, this bit?

chcp 65001
set PYTHONIOENCODING=utf-8

I should have said that. Yes, with or without doing that, same issue. I even re-fetch_links.py'd the entire thing because I hadn't done the 65001/utf-8 thing the first time. Didn't help.

I also came back here and grabbed the current copy of write_html.py in case some update since my original download changed things. Nope, same problem.

It's very mysterious! Haven't had time to poke at it more, maybe next week.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants