Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix Javascript rewriting: module detection + wombat JS URL rewriting / enhance Vimeo support #228

Merged
merged 21 commits into from
May 2, 2024

Conversation

benoit74
Copy link
Collaborator

@benoit74 benoit74 commented Apr 11, 2024

Fix #189
Fix #227

This also significantly enhance the situation regarding support of Vimeo videos: #165 (but there is still some black magic from my PoV around DS rules + there is still a zimit/crawler issue to properly retrieve the whole video)

Changes:

  • Move JS code in wombat_setup.js to a "standalone" JS codebase
    • This allows to apply traditional JS tooling on the codebase (prettier and eslint for now)
    • This allows to test and lint the codebase
    • wombatSetup.js (renamed to match JS conventions) is now generated with rollup bundler, same bundler than webrecorder is using to bundle wombat.js
  • Fuzzy rules are generated (via a Python script) from a JSON data source to both Python and JS code
    • The main driver was the move to a "standalone" JS codebase
    • This is however also a nice move to simplify logic (it is now way clearer where these rules are and where they are used)
    • This is also a nice move to allow to test them in the near future in JS as well (important to confirm they have the same effect in Python and in JS, since it must be kept in sync)
  • Fix a significant number of bugs in the JS URL rewriting function from wombatSetup.js, mainly around special characters in path and in hostname
    • This also showed the need for more tests in Python codebase around URL rewriting due to edge case detected while debugging JS code
    • A very attentive reader might note that Python and JS URL rewriting are not totally aligned due to small difference between Python and JS considerations of what needs to be encoded
      • Python quotes everything but A–Z a–z 0–9 - _ . ~
      • JS quotes everything but A–Z a–z 0–9 - _ . ~ ! * ' ( )
      • e.g. a document at ZIM path doc-um)ent.html will be accessed from docu-m%29ent.html if URL is statically rewritten by Python code and from docu-m)ent.html if URL is dynamically rewritten by wombat.js and our JS rewriting function
      • We probably do not care, ZIM path is never encoded anyway, and readers are expected to be sufficiently robust to support both encoded and unencoded (or more exactly less-encoded) URLs
  • JS modules are now detected by interpreting HTML and JS code rather than relying on regular expressions
    • in HTML code, <script type="module"> and <link rel="modulepreload"> are both supported (even if the modulepreload makes probably no sense if the script is not used somewhere after
    • in JS code, import ... statements are used to detect "child" JS modules
      • we assume a single JS file cannot be at the same time a module and not a module
    • Both HTML and JS rewriter are notifying that they've found a JS module based on its ZimPath
      • when notifying, the scraper store this information in a in-memory set
    • When rewriting JS, we check if current element is JS module based on the in-memory set
    • This change needed some code adaptations in HTMLRewriter so that utility function of this file have access to the current HTMLRewriter members
  • Vimeo fuzzy rules have been adapted as well
    • While wabac.js are supporting Vimeo URLs like gcs-vimeo.akamaized.net, vod.akamaized.net and vod-progressive.akamaized.net, our test case on test-website is using vod-adaptive.akamaized.net
    • I've detected another issue with image placeholders on i.vimeocdn.com domain where resolution is dynamically adapted based on browser viewport size most probably
  • Many JS test cases have been deployed on the test-website (most have already been merged before this PR)

Open issue left for later: #230 (wombat issue with Javascript importmap, not something we can fix on our own)

Test WARC for the Javascript part: https://tmp.kiwix.org/ci/javascript-and-vimeo-fixes/crawl-javascript-20240412.warc.gz

Test ZIMs for the Javascript part:

Test WARCs for Vimeo part:

Test ZIMs for Vimeo part:

@benoit74 benoit74 self-assigned this Apr 11, 2024
@benoit74 benoit74 force-pushed the fix_js_rewriting branch 2 times, most recently from c3dee0b to ee3fa4b Compare April 11, 2024 13:07
Copy link

codecov bot commented Apr 11, 2024

Codecov Report

Attention: Patch coverage is 93.33333% with 6 lines in your changes are missing coverage. Please review.

Project coverage is 86.03%. Comparing base (3de8ab8) to head (2adb325).
Report is 1 commits behind head on warc2zim2.

❗ Current head 2adb325 differs from pull request most recent head fe62acb. Consider uploading reports for the commit fe62acb to get more accurate results

Files Patch % Lines
src/warc2zim/content_rewriting/generic.py 57.14% 2 Missing and 1 partial ⚠️
src/warc2zim/content_rewriting/html.py 96.66% 1 Missing and 1 partial ⚠️
src/warc2zim/converter.py 80.00% 0 Missing and 1 partial ⚠️
Additional details and impacted files
@@              Coverage Diff              @@
##           warc2zim2     #228      +/-   ##
=============================================
+ Coverage      85.98%   86.03%   +0.05%     
=============================================
  Files             13       13              
  Lines           1049     1060      +11     
  Branches         195      199       +4     
=============================================
+ Hits             902      912      +10     
  Misses           116      116              
- Partials          31       32       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

- JS code used to setup wombat.js now lives in a dedicated JS subproject
- JS code is compiled by rollup
- Fuzzy rules are defined in a data-driven JSON file
- This JSON file is used to generate both the Python and JS code that will
  use them
@benoit74 benoit74 force-pushed the fix_js_rewriting branch 5 times, most recently from fcfc8aa to 2adb325 Compare April 12, 2024 11:47
@benoit74 benoit74 changed the title Fix Vimeo and Javascript rewriting: module detection + wombat JS URL rewriting Fix Javascript rewriting: module detection + wombat JS URL rewriting / enhance Vimeo support Apr 12, 2024
@benoit74 benoit74 marked this pull request as ready for review April 12, 2024 11:49
@benoit74 benoit74 marked this pull request as draft April 12, 2024 11:50
@benoit74 benoit74 marked this pull request as ready for review April 12, 2024 13:14
@benoit74 benoit74 requested review from mgautierfr and rgaudin April 12, 2024 13:15
@benoit74
Copy link
Collaborator Author

@rgaudin @mgautierfr good luck, I tried to made meaningful commits and to not break everything at once, but this is still a huge PR (but result is stunning, so definitely worth it I hope)

Copy link
Contributor

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Impressive work. I have few remarks but mostly good.

I think your solution (registering js module as we found them in the same time we rewrite content) works because we handle content in the same order it as been recorded (at browsertrix load js after they are "requested" in parsed content).

It think it would be good to mention this somewhere.

.gitignore Outdated Show resolved Hide resolved
javascript/test/wombatUrlRewriting.js Outdated Show resolved Hide resolved
javascript/test/wombatUrlRewriting.js Show resolved Hide resolved
javascript/src/wombatSetup.js Outdated Show resolved Hide resolved
javascript/test/wombatUrlRewriting.js Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting/html.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting/html.py Outdated Show resolved Hide resolved
rules/rules.json Show resolved Hide resolved
javascript/src/wombatSetup.js Show resolved Hide resolved
rules/rules.json Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

I think your solution (registering js module as we found them in the same time we rewrite content) works because we handle content in the same order it as been recorded (at browsertrix load js after they are "requested" in parsed content).

It think it would be good to mention this somewhere.

I thought I did, but it looks like you did not found it and I do not find it either now, so ... But anyway, yes, you are right.

@benoit74
Copy link
Collaborator Author

@rgaudin please review the code as-is, in fact there isn't much changes expected for now and changing them now could even make your review harder.

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ; I've noted somme minor suggestions.

docs/technical_architecture.md Outdated Show resolved Hide resolved
javascript/src/wombatSetup.js Show resolved Hide resolved
javascript/src/wombatSetup.js Outdated Show resolved Hide resolved
javascript/src/wombatSetup.js Outdated Show resolved Hide resolved
javascript/src/wombatSetup.js Outdated Show resolved Hide resolved
tests/conftest.py Show resolved Hide resolved
src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting/html.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting/html.py Outdated Show resolved Hide resolved
src/warc2zim/content_rewriting/html.py Show resolved Hide resolved
@benoit74 benoit74 force-pushed the fix_js_rewriting branch 3 times, most recently from f0d8e80 to 52b1eef Compare April 30, 2024 09:19
@benoit74 benoit74 requested review from mgautierfr and rgaudin April 30, 2024 09:29
@benoit74
Copy link
Collaborator Author

@mgautierfr @rgaudin this is ready for a new review. I still hope it is not going to be too painful, but I know the truth is probably harder than that. As usual, I've left conversations opened when I consider you might wanna check what has been done since your last review. I hope I did not forgot anything.

Tell me if it is just too painful and you consider you might merge and see what's going on in real test cases.

Copy link
Member

@rgaudin rgaudin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you ; easier to read with those improvements. Spotted a couple of minor stuff that you should look into but LGTM.

javascript/src/wombatSetup.js Outdated Show resolved Hide resolved
javascript/src/wombatSetup.js Show resolved Hide resolved
rules/generate_rules.py Outdated Show resolved Hide resolved
@benoit74
Copy link
Collaborator Author

renaud comments have been fixed

Copy link
Contributor

@mgautierfr mgautierfr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a small comment on code. Mostly a matter of style.
Will not block for that.

Else we are good so it is an approval.

src/warc2zim/url_rewriting.py Outdated Show resolved Hide resolved
benoit74 added 15 commits May 2, 2024 19:49
It happens that JS code builds the whole URL, i.e. already containing
the rewritten bits from prefix ; in such a situation, we want to remove
the prefix before rewriting the URL, so that we are back with a
non-rewriten URL
We also need to rewrite Vimeo preview image because it looks like the
resolution required is dynamic.

Current rewrite expression is not perfect because we never know which
resolution will be queried first, so the item added to the ZIM might be
a low-res image ; but at least, we have an image displayed in ZIM UI
thanks to this fuzzy rule which leads to a ZIM item under all conditions
- Fix the JS template to be as much in-line with prettier requirement
- Ignore however the generated file, we do not care about remaining
  issues (lines too long) / it would be hard to fix
This change is needed to properly handle cases where the link is
relative, already rewritten and containing the "original hostname" in
the path
@benoit74 benoit74 force-pushed the fix_js_rewriting branch from 1f07b64 to fe62acb Compare May 2, 2024 19:49
@benoit74 benoit74 merged commit c97832e into warc2zim2 May 2, 2024
4 checks passed
@benoit74 benoit74 deleted the fix_js_rewriting branch May 2, 2024 19:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants