Indexbot is the primary web crawler of the Open Web Index. It's a crawler designed to crawl a big portion of the internet to create an independent web index, which could be useful for research projects or independent search engines.
This web crawler was made with Scrapy, an open source framework, which is useful for making web crawlers.
Our Indexbot respects robots.txt files. If you wish to block our bot, simply disallow our specific bot in the robots.txt file of your website:
User-agent: indexbot
Disallow: /
If you want to unblock/allow our bot, simply allow our specific bot in the robots.txt file of your website:
User-agent: indexbot
Disallow:
No, the Open Web Index is designed to be a publicly available, free to use, and open source index of the internet. We may ask for support or funding to cover the costs of indexing the web and saving our large datasets.
If our bot captured personal information or illegal/copyrighted/licensed material please contact us immediately. Please list all effected files. Our bot only captures content, which is publicly available on the internet.
- Pull the image and create a container
docker run -d -v $(pwd)/data:/data --name indexbot ghcr.io/openwebindex/indexbot:latest
- Check if the container is running
docker ps
- Manage the container
docker logs/start/stop indexbot
- Retrieve the output
docker cp indexbot:app/output/crawled_data.jl crawled_data.jl
docker cp indexbot:app/output/crawled_sites.txt crawled_data.txt
- Clone the repository and
cd
into it - Create a Python (3.11.x or above) virtual environment and activate it (optional but recommended)
- Install requirements
pip install -r requirements.txt
- Start the crawler
cd indexbot
scrapy crawl indexbot