Winston Young - Grocery Chain Location Extractor

Python scrapy repo designed to extract location data from well known grocery chains (Aldi, Kroger, Safeway, Super One, and Whole Foods)

Data Normalization

Data collected from the internet is messy. To produce meaningful and usable data, normalization and cleaning techniques are used.

Household values like the date and time of data extraction, and the name of the crawling spider are automatically added
Missing values like the location country and retailer name are inferred and added
If the lat/lng coordinates are missing, they are calculated collected from a zip code data file using the existing zip code
Lat/lng coordinate values are converted to floats
The US state name is normalized, the abbreviation is always used over the full name
Zip codes are formatted to only include the first five values
A deduplication process is used using the retailer website and store id as a location key

Tests

Proper data parsing and normalization are important. There are a series of tests in retail_locations/tests. Various tests for geographic parsing and data normalization are included.

Geographic Processing

Various geographic utility methods are included in location_data.py. These are used for generating missing coordinate values, as well as generating an efficient (as small as possible with full coverage) input list of a zip code grid. This grid provides a minimum number of zip codes at a given radius to request retail locations.

Running a retail location scraper

Install the virtual environment from the Pipfile.
Open debug_runner.py.
Edit the desired values for the spider to run, and the output filename and type.
Common output types include CSV, JSON, and JL.
This executes the command scrapy crawl safeway -s HTTPCACHE_ENABLED=false -o sample_safeway_output.csv
A sample data output (from the above command) is included: sample_safeway_output.csv

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
retail_locations		retail_locations
tests		tests
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md
debug_runner.py		debug_runner.py
sample_safeway_output.csv		sample_safeway_output.csv
scrapy.cfg		scrapy.cfg
setup.cfg		setup.cfg

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Winston Young - Grocery Chain Location Extractor

Data Normalization

Tests

Geographic Processing

Running a retail location scraper

About

Releases

Packages

Languages

License

winston-young/grocery_chain_scraper

Folders and files

Latest commit

History

Repository files navigation

Winston Young - Grocery Chain Location Extractor

Data Normalization

Tests

Geographic Processing

Running a retail location scraper

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages