-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adblock based extractor does not consider the resource types. #165
Comments
See https://adblockplus.org/filter-cheatsheet for explanation of filter rule syntax |
After a little bit of research... There are two different approaches:
In the second approach, it may also be possible to drastically reduce the playwright page load time (while still tracking which request were blocked) which would also be beneficial for the user - because extraction is potentially faster. |
The current implementation should also allow a quick fix. It already tracks the issued requests: # from content.py
...
async def _task():
with runtime() as t:
async with async_playwright() as p:
self._responses = []
self._requests = []
browser = await p.chromium.connect_over_cdp(endpoint_url=PLAYWRIGHT_WS_ENDPOINT)
page = await browser.new_page()
async def on_request(request: Request):
self._requests.append(request)
await request.all_headers()
... This means, instead of analyzing the content.raw_links, we could also analyze the |
See #166 for more discussion on the adblocking topic. |
Currently, the abblock based extractors use the following configuration for all links that are checked:
This means, that the following content/filter combination results in wrong matches of the filter rules:
I would expect this not to block the image, because the filter should only apply to script blocks.
However, the current implementation throws all links into one bucket and via the configuration interprets them as "all different resource types at the same time".
Looking at the alternative adblock python package (based on rust implementation):
The filter engine API here expects the resource type (sort of a context in which the link appears) as additional argument.
The text was updated successfully, but these errors were encountered: