This little project aims to:
- get all comments from hacker news website;
- Update such base from time to time;
- Trigger a signal once some custom event occurs.
- Currently, the only custom event is the occurence of specific words.
Change current directory to tutorial
pip install -r requirements.txt
cd tutorial
# Crawl how much pages you prefer
scrapy crawl comments
# For download all comments not yet downloaded since the
# last downloaded comment
scrapy crawl comments -s CLOSESPIDER_PAGECOUNT=0
docker-compose -f local-compose.yml up
Alternatively you can configure your own mongodb server. Check tutorial/tutorial/settings.py for eventual envirnoment variables that must be set.
- Install dependencies
pip install pytest==6.0.2
- Run tests
pytest
How to run on Docker
- Install docker
- Install docker-compose
- Run the application.
docker-compose up -d
- NOTE: if some modification is done on the source and you want to update
the containerzed application then the follow steps are required:
docker-compose down # to stop the application docker-compose build # rebuild the image docker-compose up # start application again
-
Missing tests: could not implement automated tests for the main spider.
- The only automated tests are for the helper.py file
-
Can not access
hn_comments_crawler
through localhost.- Could not configure the network correctly. At the moment it is not possible to access
hn_comments_crawler
container directly from localhost .
- Could not configure the network correctly. At the moment it is not possible to access
-
Currently the comments are crawled taking into account only the id field.
- It would be very good to be able specify manually a lower bound for the IDs or another criteria such as date. However date was not take into account for performance reasons since it would be necessary to send a request for each comment crawled.
- Also it would be interesting to have an option do crawl only specific comments
- The "alarm" consists of dumping the ids of comments containing the linux substring in the linux_ids collection.
- The database name is defined on the settings.py file.
- Hacker news API (github)
- Hacker news API docs
- DODFMiner (for structuring the tests)
- Project template
- Project structuring
- hackeRnews, an R package for getting data from HN
- Leap year
- xpath exact
- xpath and css equivalences cheat sheet
- xpath cheatsheed
- mongodb vs sql
- pymongo docs
- Scrapy architecture overview
- Scrapy+docker
- Docker docs
- docker-cron
- scrapyd
- scrapy-client
- scrapyd-client installation