Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adblock based extractor does not consider the resource types. #165

Open
MRuecklCC opened this issue Sep 21, 2022 · 4 comments
Open

Adblock based extractor does not consider the resource types. #165

MRuecklCC opened this issue Sep 21, 2022 · 4 comments

Comments

@MRuecklCC
Copy link
Contributor

Currently, the abblock based extractors use the following configuration for all links that are checked:

    self.adblock_parser_options = {
            "script": True,
            "image": True,
            "stylesheet": True,
            "object": True,
            "xmlhttprequest": True,
            "object-subrequest": True,
            "subdocument": True,
            "document": True,
            "elemhide": True,
            "other": True,
            "background": True,
            "xbl": True,
            "ping": True,
            "dtd": True,
            "media": True,
            "third-party": True,
            "match-case": True,
            "collapse": True,
            "donottrack": True,
            "websocket": True,
            # "domain" key must be populated on use
        }

This means, that the following content/filter combination results in wrong matches of the filter rules:

content = "<img href='https://some-domain.com/ads/123.jpg'></a>"
filters = ["/ads/*$script"]

I would expect this not to block the image, because the filter should only apply to script blocks.
However, the current implementation throws all links into one bucket and via the configuration interprets them as "all different resource types at the same time".

Looking at the alternative adblock python package (based on rust implementation):

The filter engine API here expects the resource type (sort of a context in which the link appears) as additional argument.

@MRuecklCC MRuecklCC changed the title Adblock based extractor does consider the resource types. Adblock based extractor does not consider the resource types. Sep 21, 2022
@MRuecklCC
Copy link
Contributor Author

See https://adblockplus.org/filter-cheatsheet for explanation of filter rule syntax

@MRuecklCC
Copy link
Contributor Author

MRuecklCC commented Sep 21, 2022

After a little bit of research... There are two different approaches:

  • From the received HTML (via playwright) extract the links and check if it should be blocked. This means, metalookup would deduce the resource type based on whether the link is part of an <image> or <a> tag, and which other attributes are part of the tag. Essentially we then reimplement a similar logic as present in the browser. Another downside of this approach, is that content loaded via a script block will not be checked at all (as it does not show up in the DOM).
  • The other approach is to use playwright directly to hook into the request/respone stream when the website is loaded. Examples can be found here: https://www.zenrows.com/blog/blocking-resources-in-playwright.
    This approach would avoid reimplementing the "resource-type-deduction" in python (as we get the resource type from the browser before the request is issued).

In the second approach, it may also be possible to drastically reduce the playwright page load time (while still tracking which request were blocked) which would also be beneficial for the user - because extraction is potentially faster.

@MRuecklCC
Copy link
Contributor Author

The current implementation should also allow a quick fix. It already tracks the issued requests:

 # from content.py
...
        async def _task():
            with runtime() as t:
                async with async_playwright() as p:
                    self._responses = []
                    self._requests = []
                    browser = await p.chromium.connect_over_cdp(endpoint_url=PLAYWRIGHT_WS_ENDPOINT)
                    page = await browser.new_page()

                    async def on_request(request: Request):
                        self._requests.append(request)
                        await request.all_headers()
...

This means, instead of analyzing the content.raw_links, we could also analyze the content.requests which each contain the requested url and the resource type. I.e. we could in hindsight decide whether the request should have been blocked.

@MRuecklCC
Copy link
Contributor Author

See #166 for more discussion on the adblocking topic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant