-
-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Javascript rewriting: module detection + wombat JS URL rewriting / enhance Vimeo support #228
Conversation
c3dee0b
to
ee3fa4b
Compare
Codecov ReportAttention: Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## warc2zim2 #228 +/- ##
=============================================
+ Coverage 85.98% 86.03% +0.05%
=============================================
Files 13 13
Lines 1049 1060 +11
Branches 195 199 +4
=============================================
+ Hits 902 912 +10
Misses 116 116
- Partials 31 32 +1 ☔ View full report in Codecov by Sentry. |
4da3252
to
83f4dad
Compare
- JS code used to setup wombat.js now lives in a dedicated JS subproject - JS code is compiled by rollup - Fuzzy rules are defined in a data-driven JSON file - This JSON file is used to generate both the Python and JS code that will use them
fcfc8aa
to
2adb325
Compare
@rgaudin @mgautierfr good luck, I tried to made meaningful commits and to not break everything at once, but this is still a huge PR (but result is stunning, so definitely worth it I hope) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Impressive work. I have few remarks but mostly good.
I think your solution (registering js module as we found them in the same time we rewrite content) works because we handle content in the same order it as been recorded (at browsertrix load js after they are "requested" in parsed content).
It think it would be good to mention this somewhere.
I thought I did, but it looks like you did not found it and I do not find it either now, so ... But anyway, yes, you are right. |
@rgaudin please review the code as-is, in fact there isn't much changes expected for now and changing them now could even make your review harder. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ; I've noted somme minor suggestions.
e8e8e71
to
48beaf0
Compare
f0d8e80
to
52b1eef
Compare
@mgautierfr @rgaudin this is ready for a new review. I still hope it is not going to be too painful, but I know the truth is probably harder than that. As usual, I've left conversations opened when I consider you might wanna check what has been done since your last review. I hope I did not forgot anything. Tell me if it is just too painful and you consider you might merge and see what's going on in real test cases. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you ; easier to read with those improvements. Spotted a couple of minor stuff that you should look into but LGTM.
renaud comments have been fixed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a small comment on code. Mostly a matter of style.
Will not block for that.
Else we are good so it is an approval.
It happens that JS code builds the whole URL, i.e. already containing the rewritten bits from prefix ; in such a situation, we want to remove the prefix before rewriting the URL, so that we are back with a non-rewriten URL
We also need to rewrite Vimeo preview image because it looks like the resolution required is dynamic. Current rewrite expression is not perfect because we never know which resolution will be queried first, so the item added to the ZIM might be a low-res image ; but at least, we have an image displayed in ZIM UI thanks to this fuzzy rule which leads to a ZIM item under all conditions
- Fix the JS template to be as much in-line with prettier requirement - Ignore however the generated file, we do not care about remaining issues (lines too long) / it would be hard to fix
This change is needed to properly handle cases where the link is relative, already rewritten and containing the "original hostname" in the path
Fix #189
Fix #227
This also significantly enhance the situation regarding support of Vimeo videos: #165 (but there is still some black magic from my PoV around DS rules + there is still a zimit/crawler issue to properly retrieve the whole video)
Changes:
wombat_setup.js
to a "standalone" JS codebasewombatSetup.js
(renamed to match JS conventions) is now generated withrollup
bundler, same bundler than webrecorder is using to bundlewombat.js
wombatSetup.js
, mainly around special characters in path and in hostnameA–Z a–z 0–9 - _ . ~
A–Z a–z 0–9 - _ . ~ ! * ' ( )
doc-um)ent.html
will be accessed fromdocu-m%29ent.html
if URL is statically rewritten by Python code and fromdocu-m)ent.html
if URL is dynamically rewritten bywombat.js
and our JS rewriting function<script type="module">
and<link rel="modulepreload">
are both supported (even if the modulepreload makes probably no sense if the script is not used somewhere afterimport ...
statements are used to detect "child" JS modulesgcs-vimeo.akamaized.net
,vod.akamaized.net
andvod-progressive.akamaized.net
, our test case on test-website is usingvod-adaptive.akamaized.net
i.vimeocdn.com
domain where resolution is dynamically adapted based on browser viewport size most probablyOpen issue left for later: #230 (wombat issue with Javascript importmap, not something we can fix on our own)
Test WARC for the Javascript part: https://tmp.kiwix.org/ci/javascript-and-vimeo-fixes/crawl-javascript-20240412.warc.gz
Test ZIMs for the Javascript part:
Test WARCs for Vimeo part:
Test ZIMs for Vimeo part: