-
-
Notifications
You must be signed in to change notification settings - Fork 217
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Encoding issues with many popular sites #434
Comments
Have you tried using a non-empty config file and ensured that pywb auto-generated a CA (docs) and or the dev branch? I was not able to reproduce the issue using the config below when running from the dev branch using 3.7.2 and linux. collections:
all: $all
live: $live
# Settings for each collection
use_js_obj_proxy: true
# Memento support, enable
enable_memento: true
# Replay content in an iframe
framed_replay: true |
I just pulled and installed the It correctly autogenerated a CA, and the issue persists with the config you gave. I checked in both proxy and non-proxy mode and the issue is present: And here are the logs while running it: 2019-01-24 16:39:06,229: [INFO]: Proxy recording into collection "squash"
2019-01-24 16:39:06,339: [INFO]: Starting Gevent Server on 8080
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fcloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 1220 0.046708
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fcloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 1242 0.064246
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Faccounts.google.com%2FListAccounts%3Fgpsia%3D1%26source%3DChromiumBrowser%26json%3Dstandard&closest=now&matchType=exact HTTP/1.1" 200 2166 0.070288
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Faccounts.google.com%2FListAccounts%3Fgpsia%3D1%26source%3DChromiumBrowser%26json%3Dstandard&closest=now&matchType=exact HTTP/1.1" 200 2181 0.086463
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fclientservices.googleapis.com%2Fchrome-variations%2Fseed%3Fosname%3Dmac%26channel%3Dcanary%26milestone%3D73&closest=now&matchType=exact HTTP/1.1" 200 1417 0.106623
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fclientservices.googleapis.com%2Fchrome-variations%2Fseed%3Fosname%3Dmac%26channel%3Dcanary%26milestone%3D73&closest=now&matchType=exact HTTP/1.1" 200 1439 0.124798
127.0.0.1 - - [2019-01-24 16:39:13] "CONNECT clientservices.googleapis.com:443 HTTP/1.1" 200 - 0.471312
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fwww.cloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 17505 0.059444
127.0.0.1 - - [2019-01-24 16:39:13] "POST /live/resource/postreq?param.recorder.coll=squash&url=https%3A%2F%2Fwww.cloudflare.com%2F&closest=now&matchType=exact HTTP/1.1" 200 17527 0.068256 |
Here's an upload of my entire archive folder in case you want to see if it replays properly on a different machine: It's a fresh demo collection created with the Thanks for helping debug this 👍 |
Thanks for sending the zip, unfortunately, can't repro the issue that you're seeing. The WARC in demo does contains valid HTML, but contains only the html for cloudflare, not any of the stylesheets (presumably since they didn't load in the browser). Perhaps there is some other service running on your system that could interfere? It might be worth trying a different port than 8080, just in case. You can specify it with Other things to try: try upgrading the gevent library with |
Very strange, maybe it's a macOS thing? Or an encoding problem with my shell? What's your Is there a more verbose log mode I can use? I'll try upgrading gevent and running it in a fresh macOS user account and then a linux VM. |
I'm on a mac as well.. |
This happens in url rewriting/non-proxy mode as well? What happens if you load |
Two new findings: it only happens in Chrome, and I don't think it's an encoding issue. I tried decoding the page in a number of different encodings with Chrome, and none of them looked like real text. It's also not a font issue because copy pasting the output into a different program with a different font doesn't fix it. Could it be that it's not ungzipping the page for some reason in Chrome, and trying to show the compressed bytes as UTF-8 anyway? While it appears to only happen in Chrome, it seems to happen reliably in all versions of Chrome (tested with fresh sessions & no extensions on Chrome Stable, Beta, and Canary). It works fine in Safari & curl for both replaying and viewing, and in both proxy and rewriting modes. I would love to just use Safari, but unfortunately the primary reason I'm setting up pywb is to use it as a proxy recorder for scripted Chrome headless web archiving. |
Update for anyone else following this issue, it turned out to be an issue with Brotli not being imported, so all pages using Brotli compression in chrome failed to decompress before playback/archiving, causing gibberish to show up. Blocked by: python-hyper/brotlicffi#146 |
and also remove 'br' from any Accept-Encoding to avoid recording with brotli, addresses #434
* brotli: if the brotli module can not be loaded, print warning and also remove `br` from any Accept-Encoding header to avoid recording with brotli, addresses #434
The core issue was brotli not being installed properly. But, now pywb will not record in brotli if it can't decode it. Fixed in 2.2 release. |
Thanks @ikreymer ! |
Describe the bug
When using
wayback
in both--proxy
and--proxy-record
mode, many sites seem to display gibberish, possibly because of an incorrect encoding being forced somewhere.Working sites:
Broken sites:
I'm not sure what makes some sites work and not others.
Steps to reproduce the bug
develop
branch)config.yaml
, run:Expected behavior
A readable site is recorded and replayed instead of ��������
Screenshots
Environment
The text was updated successfully, but these errors were encountered: