Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update the crawlers structure #764

Open
vdusek opened this issue Nov 29, 2024 · 1 comment
Open

Update the crawlers structure #764

vdusek opened this issue Nov 29, 2024 · 1 comment
Assignees
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team. v0.5

Comments

@vdusek
Copy link
Collaborator

vdusek commented Nov 29, 2024

Place all the crawlers into a common directory named crawlers:

$ tree src/crawlee/crawlers/
src/crawlee/crawlers//
├── basic_crawler/
│   ├── _basic_crawler.py
│   ├── _context_pipeline.py
│   └── __init__.py
├── beautifulsoup_crawler/
│   ├── _beautifulsoup_crawler.py
│   ├── _beautifulsoup_crawling_context.py
│   ├── _beautifulsoup_parser.py
│   └── __init__.py
├── http_crawler/
│   ├── _http_crawler.py
│   ├── _http_crawling_context.py
│   └── __init__.py
├── parsel_crawler/
│   ├── __init__.py
│   ├── _parsel_crawler.py
│   ├── _parsel_crawling_context.py
│   └── _parsel_parser.py
├── playwright_crawler/
│   ├── __init__.py
│   ├── _playwright_crawler.py
│   ├── _playwright_crawling_context.py
│   ├── _playwright_pre_navigation_context.py
│   └── _utils.py
├── static_content_crawler/
│   ├── __init__.py
│   ├── _static_content_crawler.py
│   ├── _static_content_parser.py
│   └── _static_crawling_context.py
└── __init__.py

with __init__.py:

from .basic_crawler import BasicCrawler, BasicCrawlingContext
from .beautifulsoup_crawler import BeautifulSoupCrawler, BeautifulSoupCrawlingContext
from .http_crawler import HttpCrawler, HttpCrawlingContext
from .parsel_crawler import ParselCrawler, ParselCrawlingContext
from .playwright_crawler import PlaywrightCrawler, PlaywrightCrawlingContext
from .static_content_crawler import StaticContentCrawler, StaticContentCrawlingContext

To be able to do imports:

from crawlee.crawlers import HttpCrawler, HttpCrawlingContext
@vdusek vdusek added t-tooling Issues with this label are in the ownership of the tooling team. debt Code quality improvement or decrease of technical debt. labels Nov 29, 2024
@vdusek vdusek added the v0.5 label Nov 29, 2024
@janbuchar
Copy link
Collaborator

I'm afraid that the __init__.py file you propose here won't work (without some black magic) - the individual crawlers have optional dependencies and if you don't have those installed, there will be mayhem.

@vdusek vdusek self-assigned this Dec 2, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
debt Code quality improvement or decrease of technical debt. t-tooling Issues with this label are in the ownership of the tooling team. v0.5
Projects
None yet
Development

No branches or pull requests

2 participants