indexbot

Indexbot is the primary web crawler of the Open Web Index. It's a crawler designed to crawl a big portion of the internet to create an independent web index, which could be useful for research projects or independent search engines.

This web crawler was made with Scrapy, an open source framework, which is useful for making web crawlers.

FAQ – Indexbot

How can I block/allow your bot?

Our Indexbot respects robots.txt files. If you wish to block our bot, simply disallow our specific bot in the robots.txt file of your website:

User-agent: indexbot
Disallow: /

If you want to unblock/allow our bot, simply allow our specific bot in the robots.txt file of your website:

User-agent: indexbot
Disallow:

Is this a commercial project?

No, the Open Web Index is designed to be a publicly available, free to use, and open source index of the internet. We may ask for support or funding to cover the costs of indexing the web and saving our large datasets.

Can I get my data removed?

If our bot captured personal information or illegal/copyrighted/licensed material please contact us immediately. Please list all effected files. Our bot only captures content, which is publicly available on the internet.

Run the container

Pull the image and create a container

docker run -d -v $(pwd)/data:/data --name indexbot ghcr.io/openwebindex/indexbot:latest

Check if the container is running

docker ps

Manage the container

docker logs/start/stop indexbot

Retrieve the output

docker cp indexbot:app/output/crawled_data.jl crawled_data.jl
docker cp indexbot:app/output/crawled_sites.txt crawled_data.txt

Development

Clone the repository and cd into it
Create a Python (3.11.x or above) virtual environment and activate it (optional but recommended)
Install requirements

pip install -r requirements.txt

Start the crawler

cd indexbot
scrapy crawl indexbot

Name		Name	Last commit message	Last commit date
Latest commit History 19 Commits
.github		.github
indexbot		indexbot
.gitattributes		.gitattributes
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

indexbot

FAQ – Indexbot

How can I block/allow your bot?

Is this a commercial project?

Can I get my data removed?

Run the container

Development

About

Releases

Packages

Languages

License

openwebindex/indexbot

Folders and files

Latest commit

History

Repository files navigation

indexbot

FAQ – Indexbot

How can I block/allow your bot?

Is this a commercial project?

Can I get my data removed?

Run the container

Development

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages