Latest Update: Scrapy currently broken. ec_crawler.py performs all scraping operations (described below) using only the requests library. scrape_functions.py contains functions for scraping central data, state data, and getting form data for posting to subsequent pages. Next task is to troubleshoot scrapy version.
This project aims to scrape data from all 250+ pages from this webpage. The page is written in the ASP.net environment so there is not a separate URL for each page. Clicking on the next page at the bottom instead triggers a javascript __doPostback
function which takes some visible and some hidden arguements. The goal is to use the scrapy framework to iterate through each page, scrape the fields I need, and pipe it to a csv.
Although python cannot interact with the javascript function, we can pass the arguments as form data using the scrapy Formrequest
library. To goal is to collect the VIEWSTATE
and EVENTVALIDATION
hiddent arguements and pass it with the other orther arguements to navigate to each subsequent page. The site should interpret this as a human rather than a bot. If this doesn't work, other options include Seleneium or a headless browser.
- items.py: Initiates an item class that acts as a container for the data fields we need
- pipelines.py: Sets up a pipeline to take item objects and pipe it to a CSV
- ec_spider.py: Main scraping script. Passes each scraped item row-by-row through the pipeline and into the dataset
GitDownload scrapySet up container, pipeline, and settingsTest scraping code for first page- Generalize to all pages
- Figure out doPostback issue