A high-performance parallel web scraper that converts websites into organized Markdown files. Built with Python and crawl4ai
, requests
, psutil
, rich
, and asyncio
, this tool efficiently processes websites by leveraging their sitemaps and supports concurrent scraping with built-in memory monitoring.
- 🚀 Parallel scraping with configurable concurrency
- 📑 Automatic sitemap detection and processing
- 📁 Organized output with clean directory structure
- 💾 Memory-efficient with built-in monitoring
- 🌐 Browser-based scraping using crawl4ai
- 📊 Progress tracking and detailed logging
- 🔍 Preview mode with dry-run option
- Python 3.7+
- crawl4ai
- rich
- psutil
- requests
- Clone the repository:
git clone https://github.com/rkabrick/scrape.git
cd web-scraper
- Install dependencies:
pip install -r requirements.txt
Basic usage:
python scrape https://example.com
scrape [-h] [--max-concurrent MAX_CONCURRENT] [-v] [--dry-run] url
Arguments:
url
: The target URL to scrape (must include http:// or https://)--max-concurrent
: Maximum number of concurrent scrapers (default: 3)-v
: Increase verbosity level-v
: Show file names-vv
: Show browser output-vvv
: Show memory monitoring
--dry-run
: Preview the file structure without performing the scrape
- Basic scraping:
scrape https://example.com
- Scraping with increased concurrency:
scrape --max-concurrent 5 https://example.com
- Preview mode with file structure:
scrape --dry-run https://example.com
- Verbose output with memory monitoring:
scrape -vvv https://example.com
The scraper creates an organized directory structure based on the website's URL paths. For example:
example.com/
├── index.md
├── about/
│ └── index.md
├── blog/
│ ├── post1.md
│ └── post2.md
└── products/
├── category1/
│ └── item1.md
└── category2/
└── item2.md
- Automatically detects and processes XML sitemaps
- Falls back to single URL processing if no sitemap is found
- Supports both simple and nested sitemap structures
- Built-in memory monitoring for resource-intensive operations
- Configurable concurrent scraping to balance performance and resource usage
- Automatic cleanup of browser instances
- Intelligent path handling and file naming
- Duplicate file name resolution
- Clean, SEO-friendly file structure
- Markdown output for compatibility
This project is licensed under the MIT License - see the LICENSE file for details.
- Built with crawl4ai for reliable web scraping
- Uses rich for beautiful terminal output
- Memory monitoring powered by psutil
For issues, questions, or contributions, please open an issue in the GitHub repository.