This is a project made by a junior at Cleveland High School. Flowchart at the bottom.
Make sure you have Python 2.7.10 up to 2.7.14 installed.
Used libraries: requests, pandas, BeautifulSoup4, splinter, time, argparse, ConfigParser.
(all pip installed)
Install the newest version of ChromeDriver for the correct platform here: https://sites.google.com/a/chromium.org/chromedriver/downloads
(ChromeDriver allows the program to control browsers, like opening windows and going to sites).
Extract the .zip file and put this folder:
in your Path environmental variable by doing this (Windows):
- Start the System Control Panel applet (Search "System" and click - it should be the Control Panel).
- Select the Advanced system settings tab.
- Click the Environment Variables button.
- Under System Variables, select Path, click Edit, then click Browse.
- Find the folder (it's probably in Downloads) and put it in the Path (you can type
path
into the console to check if it's in Path).
Here's a slightly fast gif of that:
Next, follow these steps:
- Install this repository.
- Put the requirements.txt and weatherscraper.py files in the folder with your virtual environment (the folder you execute python scripts in).
- Set up your virtual environment.
- Before running the .py file, type
pip install -r requirements.txt
into the console to download the required packages for scraping.
You can look at the code here, in GitHub. The file should have a good amount of comments.
The .py file scrapes the weather forecast from forecast.weather.gov and gets the high and low temperatures, time period, short description, and long description. The program prints any changes it detects. It also arranges all the data into a pandas table (pandas.DataFrame). It also saves the data into a separate INI file in the same folder, so when you come back later it will check if anything has changed.
To run the file without skipping any parameters, you can type this example into the console:
python weatherscraper.py 2 4 seconds "Portland"
It should look something like this:
The 2
specifies the amount of seconds between each scrape. This can be any number.
The 4 seconds
specifies how long the program runs.
You can type any number, and after it you have to write seconds
, minutes
, or hours
to specify the unit of time.
The "Portland"
specifies the US city. Type in any US city, but have "
quotes around it.
To end the program, either wait for the time to run out or force it to end by pressing ctrl + c
.
NOTE: If a window pops up saying "chromedriver.exe has stopped working" or something, that's fine; just click OK.
Optional Parameters
To skip any of the 4 parameters, type after weatherscraper.py and before the numbers:
--skipperiod
(skip time period)
--skiptemp
(skip temperatures)
--skipdesc
(skip long description)
--skipshort_desc
(skip short description)
i.e. python weatherscraper.py --skipdesc 2 4 seconds "Portland"
The example should look like this:
If you got everything to work, then yay! You're done!
Here's the flowchart of the core algorithm:
Master Python Web Scraping: Get Your Data Back
How to scrape websites with Python and BeautifulSoup
Python Web Scraping Tutorial using BeautifulSoup
Python Documentation - argparse
google.com
Brian