GitHub - rmadhok/scrapy-crawl-asp: heavy-duty scraping framework for crawling ASP.net pages

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Latest Update: Scrapy currently broken. ec_crawler.py performs all scraping operations (described below) using only the requests library. scrape_functions.py contains functions for scraping central data, state data, and getting form data for posting to subsequent pages. Next task is to troubleshoot scrapy version.

Introduction

This project aims to scrape data from all 250+ pages from this webpage. The page is written in the ASP.net environment so there is not a separate URL for each page. Clicking on the next page at the bottom instead triggers a javascript __doPostback function which takes some visible and some hidden arguements. The goal is to use the scrapy framework to iterate through each page, scrape the fields I need, and pipe it to a csv.

Method

Although python cannot interact with the javascript function, we can pass the arguments as form data using the scrapy Formrequest library. To goal is to collect the VIEWSTATE and EVENTVALIDATION hiddent arguements and pass it with the other orther arguements to navigate to each subsequent page. The site should interpret this as a human rather than a bot. If this doesn't work, other options include Seleneium or a headless browser.

Files

items.py: Initiates an item class that acts as a container for the data fields we need
pipelines.py: Sets up a pipeline to take item objects and pipe it to a CSV
ec_spider.py: Main scraping script. Passes each scraped item row-by-row through the pipeline and into the dataset

To-Do

~~Git~~
~~Download scrapy~~
~~Set up container, pipeline, and settings~~
~~Test scraping code for first page~~
Generalize to all pages
Figure out doPostback issue

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
eccrawler		eccrawler
test		test
LICENSE.md		LICENSE.md
README.md		README.md
ec_crawler.py		ec_crawler.py
scrape_functions.py		scrape_functions.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Introduction

Method

Files

To-Do

Sources

About

Releases

Packages

Languages

License

rmadhok/scrapy-crawl-asp

Folders and files

Latest commit

History

Repository files navigation

Heavy-Duty Web Crawling Framework for Scraping Data from ASP.net Pages

Introduction

Method

Files

To-Do

Sources

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages