This project is a comprehensive real estate data crawler and API service designed to scrape property data from the Bayut website. It uses Selenium for dynamic web scraping, Celery for task scheduling, FastAPI for serving data through a RESTful API, and PostgreSQL for robust data storage. The system is containerized with Docker and orchestrated using Docker Compose.
- Dynamic Web Scraping: Selenium with ChromeDriver enables scraping JavaScript-rendered content.
- Background Task Queue: Celery manages asynchronous tasks for processing and storing scraped data.
- Data Storage: PostgreSQL serves as the database for reliable and structured data storage.
- RESTful API: FastAPI provides endpoints for accessing and aggregating data.
- Dockerized Setup: Docker Compose ensures seamless development and deployment.
- Robust Handling: The scraper intelligently skips irrelevant sections and ensures no duplication in the database.
- Python: Core programming language.
- Selenium: Web scraping with headless Chromium.
- Celery: Task queue for distributed processing.
- PostgreSQL: Relational database for data persistence.
- FastAPI: Framework for building the API.
- Docker/Docker Compose: Containerization and orchestration.
- Redis: Message broker for Celery.
- SQLAlchemy: ORM for database interactions.
real_estate_project/
├── crawler/
│ ├── __init__.py
│ ├── actual_crawler.py # Main script for crawling and queuing tasks
│ ├── tasks.py # Celery tasks for background processing
│ ├── database.py # Database models and functions
│ ├── utils.py # Utility functions for scraping
├── api/
│ ├── __init__.py
│ ├── main.py # FastAPI application
├── docker-compose.yml # Docker Compose configuration
├── Dockerfile # Dockerfile for building containers
├── requirements.txt # Python dependencies
├── README.md # Documentation
- Fetch Listing URLs: The crawler uses Selenium to navigate the website and scrape property URLs dynamically.
- Process Each Listing: For each URL, Celery asynchronously processes the page, extracting detailed property information.
- Store in Database: The scraped data is stored in a PostgreSQL database, avoiding duplicates.
/
: Root endpoint with a welcome message./listings/
: Fetch listings with optional filtering by region and TruCheck status./listings/region_counts/
: Aggregate listing counts by region./listings/trucheck_counts/
: Aggregate TruCheck-verified listings by region.
- Docker and Docker Compose installed on your machine.
- Python 3.12+ (for manual script testing).
-
Clone the Repository
git clone <repository-url> cd real_estate_project
-
Build and Run the Docker Containers
docker compose build docker compose up -d
-
Verify the Setup
- Check the running containers:
sudo docker ps
- Access the FastAPI Swagger UI:
Navigate to
http://localhost:8000/docs
in your browser.
- Check the running containers:
-
Start the Crawler
sudo docker exec -it <crawler_container_name> python crawler/actual_crawler.py
-
Access Data via API
- Example: Fetch all listings:
curl http://localhost:8000/listings/
- Example: Get region counts:
curl http://localhost:8000/listings/region_counts/
- Example: Fetch all listings:
- Add New Fields: Update
crawler/database.py
to modify theListing
model. - Extend Scraping Logic: Update
crawler/utils.py
to scrape additional property details. - Expose New API Endpoints: Add routes in
api/main.py
for custom queries or new data endpoints.
- Use logs from Celery:
sudo docker logs <celery_worker_container>
- Access the PostgreSQL database:
sudo docker exec -it <postgres_container_name> psql -U user -d real_estate
To clear all data and reset the database:
docker compose down -v
docker compose up -d
- Pagination: Add pagination support for API endpoints.
- Authentication: Secure API with user authentication.
- Advanced Analytics: Provide richer aggregated statistics and insights.
- Cloud Deployment: Deploy the service to AWS, GCP, or Azure for broader accessibility.
- Improved Error Handling: Enhance resilience against scraping errors and timeouts.
Special thanks to the open-source libraries and tools that made this project possible: Selenium, Celery, SQLAlchemy, FastAPI, Docker, and PostgreSQL.
This project is licensed under the MIT License. See LICENSE
for details.
For any problem, please create an issue.
Enjoy building and exploring with the Real Estate Data Crawler and API.