Overview

This is written in Python. There is still a lot of work to do, so feel free to help out with development. It takes around 1 hour to fetch all inventory , multithreading can improve the performance

COUPLE OF VAUES ARE HARDCODE WHICH ARE NOTHING JUST CLASS NAME OF HTML TAGS, IT WILL CHANGE WHENEVER THEY DO THE NEXT DEPLOYMENT, SO WE HAVE TO CHANGE ACCORDINGLY

How to run?

Project os developed in python 3.x

Install python 3.x and compatible scrapy

https://scrapy.org/download/

Go to main folder of project
Run

scrapy crawl quotes

or

scrapy crawl quotes -a url='https://sg.abc.com/'

or if you want to run in background

scrapy crawl quotes -a url='https://sg.sbc.com' > 404.txt &

output file will be there in main folder with time stamp.

STATS

We get various error during processing, these are details

'downloader/request_count': 25765,

'downloader/request_method_count/GET': 25765,

'downloader/response_bytes': 2694189714,

'downloader/response_count': 25761,

'downloader/response_status_count/200': 25432,

'downloader/response_status_count/404': 73,

'downloader/response_status_count/500': 30,

'downloader/response_status_count/502': 3,

'downloader/response_status_count/503': 14,

'downloader/response_status_count/504': 209,

Important Note

There are 25K inventory in car category and we are making request from same IP, they may disable in future.

******************************************* END ***************************************

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
web_crawler		web_crawler
.gitignore		.gitignore
2019_04_20_22_56_35		2019_04_20_22_56_35
README.md		README.md
scrapy.cfg		scrapy.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Overview

About

Releases

Packages

Languages

ianuragsingh/web_crawler

Folders and files

Latest commit

History

Repository files navigation

Overview

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages