Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding issues with many popular sites #434

Closed
pirate opened this issue Jan 23, 2019 · 11 comments
Closed

Encoding issues with many popular sites #434

pirate opened this issue Jan 23, 2019 · 11 comments
Milestone

Comments

@pirate
Copy link

pirate commented Jan 23, 2019

Describe the bug

When using wayback in both --proxy and --proxy-record mode, many sites seem to display gibberish, possibly because of an incorrect encoding being forced somewhere.

Working sites:

Broken sites:

I'm not sure what makes some sites work and not others.

Steps to reproduce the bug

  1. Use Python 3.7.2 and either pywb v2.1.0 or pywb-2.2.0.dev0 (latest develop branch)
  2. Using an empty config.yaml, run:
wb-manager init demo
wayback --proxy-record --proxy demo
  1. Open a page in a browser:
google-chrome --proxy-server=http://localhost:8080 --ignore-certificate-errors --disable-web-security https://cloudflare.com

Expected behavior

A readable site is recorded and replayed instead of ��������

Screenshots

image

Environment

  • OS: macOS 10.14
  • Browser: Google Chrome 72.0.3626.64 beta
  • Version: Python 3.7.2 and pywb v2.1.0 [e.g. 22]
@N0taN3rd
Copy link
Contributor

Have you tried using a non-empty config file and ensured that pywb auto-generated a CA (docs) and or the dev branch?

I was not able to reproduce the issue using the config below when running from the dev branch using 3.7.2 and linux.

collections:
    all: $all
    live: $live

# Settings for each collection
use_js_obj_proxy: true

# Memento support, enable
enable_memento: true

# Replay content in an iframe
framed_replay: true

@pirate
Copy link
Author

pirate commented Jan 24, 2019

I just pulled and installed the dev branch, and it's still happening. I'm using 9597a63 aka pywb-2.2.0.dev0 now for this test.

It correctly autogenerated a CA, and the issue persists with the config you gave. I checked in both proxy and non-proxy mode and the issue is present:

screen shot 2019-01-24 at 4 33 11 pm

screen shot 2019-01-24 at 4 35 02 pm

And here are the logs while running it:

2019-01-24 16:39:06,229: [INFO]: Proxy recording into collection "squash"
2019-01-24 16:39:06,339: [INFO]: Starting Gevent Server on 8080
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fcloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 1220 0.046708
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fcloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 1242 0.064246
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Faccounts.google.com%2FListAccounts%3Fgpsia%3D1%26source%3DChromiumBrowser%26json%3Dstandard&closest=now&matchType=exact HTTP/1.1" 200 2166 0.070288
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Faccounts.google.com%2FListAccounts%3Fgpsia%3D1%26source%3DChromiumBrowser%26json%3Dstandard&closest=now&matchType=exact HTTP/1.1" 200 2181 0.086463
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fclientservices.googleapis.com%2Fchrome-variations%2Fseed%3Fosname%3Dmac%26channel%3Dcanary%26milestone%3D73&closest=now&matchType=exact HTTP/1.1" 200 1417 0.106623
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fclientservices.googleapis.com%2Fchrome-variations%2Fseed%3Fosname%3Dmac%26channel%3Dcanary%26milestone%3D73&closest=now&matchType=exact HTTP/1.1" 200 1439 0.124798
127.0.0.1 - - [2019-01-24 16:39:13] "CONNECT clientservices.googleapis.com:443 HTTP/1.1" 200 - 0.471312
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fwww.cloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 17505 0.059444
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fwww.cloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 17527 0.068256

@pirate
Copy link
Author

pirate commented Jan 25, 2019

Here's an upload of my entire archive folder in case you want to see if it replays properly on a different machine:

wayback.zip

It's a fresh demo collection created with the config.yml you gave and the latest version of develop branch.

Thanks for helping debug this 👍

@ikreymer
Copy link
Member

Thanks for sending the zip, unfortunately, can't repro the issue that you're seeing. The WARC in demo does contains valid HTML, but contains only the html for cloudflare, not any of the stylesheets (presumably since they didn't load in the browser).
I've tried both with latest Chrome and Chrome Canary, everything records and replays correctly.

Perhaps there is some other service running on your system that could interfere? It might be worth trying a different port than 8080, just in case. You can specify it with -p flag to change the port.
The HTML is Brotli encoded, but it should be getting decoded for rewriting, and pywb auto-decodes if the browser doesn't support Brotli anyway.

Other things to try: try upgrading the gevent library with pip install -U gevent in case there is an issue with that. The logs look a bit sparse, there should be more log entries made then that if everything is working.

@pirate
Copy link
Author

pirate commented Jan 26, 2019

Very strange, maybe it's a macOS thing? Or an encoding problem with my shell? What's your LC_ALL and LANG?

Is there a more verbose log mode I can use? I'll try upgrading gevent and running it in a fresh macOS user account and then a linux VM.

@ikreymer
Copy link
Member

I'm on a mac as well.. LC_ALL is unset and LANG=en_US.UTF-8

@ikreymer
Copy link
Member

ikreymer commented Jan 26, 2019

This happens in url rewriting/non-proxy mode as well? What happens if you load http://localhost:8080/live/https://cloudflare.com/ either in the browser, or curl -L http://localhost:8080/live/mp_/https://cloudflare.com/ ?
Even without recording? and then with http://localhost:8080/demo/record/https://cloudflare.com/ ?

@pirate
Copy link
Author

pirate commented Jan 28, 2019

Two new findings: it only happens in Chrome, and I don't think it's an encoding issue. I tried decoding the page in a number of different encodings with Chrome, and none of them looked like real text. It's also not a font issue because copy pasting the output into a different program with a different font doesn't fix it. Could it be that it's not ungzipping the page for some reason in Chrome, and trying to show the compressed bytes as UTF-8 anyway?

While it appears to only happen in Chrome, it seems to happen reliably in all versions of Chrome (tested with fresh sessions & no extensions on Chrome Stable, Beta, and Canary).

It works fine in Safari & curl for both replaying and viewing, and in both proxy and rewriting modes. I would love to just use Safari, but unfortunately the primary reason I'm setting up pywb is to use it as a proxy recorder for scripted Chrome headless web archiving.

@pirate
Copy link
Author

pirate commented Feb 6, 2019

Update for anyone else following this issue, it turned out to be an issue with Brotli not being imported, so all pages using Brotli compression in chrome failed to decompress before playback/archiving, causing gibberish to show up.

Blocked by: python-hyper/brotlicffi#146

ikreymer added a commit that referenced this issue Feb 19, 2019
and also remove 'br' from any Accept-Encoding to avoid recording with brotli, addresses #434
ikreymer added a commit that referenced this issue Feb 20, 2019
* brotli: if the brotli module can not be loaded, print warning
and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
@ikreymer ikreymer added this to the 2.2.0 milestone Feb 20, 2019
@ikreymer
Copy link
Member

The core issue was brotli not being installed properly. But, now pywb will not record in brotli if it can't decode it. Fixed in 2.2 release.

@pirate
Copy link
Author

pirate commented Feb 28, 2019

Thanks @ikreymer !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants