Web-Crawling: A Practical Introduction in Python

A workshop with the Massive Data Institute, Georgetown University

Overview

When it comes to data collection, web-crawling (i.e., web-scraping, screen-scraping) is a common approach in our increasingly digital era--and a common stumbling block. With such a wide range of tools and languages available (Selenium, Requests, and HTML, to name just a few), developing and implementing a web-crawling pipeline is often a frustrating experience for researchers--especially those without a computer science background.

Whatever your background, this workshop will give you the foundation to use web-crawling in your research. We will tackle common problems including collecting web addresses/URLs (by automated Google search), downloading website copies (with wget), non-scalable website scraping (with requests), and scalable crawling of text (with scrapy). No web-crawling experience is required, but some Python know-how is expected.

Workshop goals

Understand how web-crawling and -scraping are useful for digital data collection
Build intuitions around the uses and limits of:
- APIs (Application Programming Interfaces)
- Exploiting website structure (HTML/CSS)
- Scalable crawling
Be familiar with common problems in web-crawling and their fixes, like:
- Nested websites --> vertical crawling (link extraction)
- Getting blocked --> polite pauses
Gain practice with:
- Collecting domains to scrape
- Scalable and non-scalable website scraping
- Parsing website text (with BeautifulSoup)
- wget, Requests, and Scrapy

Prerequisites

We will get our hands dirty implementing an assortment of simple web-crawling tools. To follow along with the code—which is the point—will need some familiarity with Python and Jupyter Notebooks. If you haven't programmed in Python or haven’t used Jupyter Notebooks, please do some self-teaching before this workshop using resources like those listed below.

Getting started & software prerequisites

For simplicity, just click the "Launch Binder" button (at the top of this Readme) to create a virtual environment ready for this workshop. It may take a few minutes; if it takes longer than 10, try again.

If you want to run the code on your computer, you have two options. You could use Anaconda to make installation easy: download Anaconda . Or if you already have Python 3.x installed with the full list of libraries listed under requirements.txt, you're welcome to clone this repository and follow along on your own machine. You can also install all the necessary packages like so:

pip3 install -r requirements.txt

Open-Access Resources

Slides for day 1 (also in folder above)

Python and Jupyter Notebooks

Web-crawling with Scrapy & friends

Other useful libraries

O'Reilly books on scraping

These are available free to Georgetown students/affiliates (log in here then search for books)

Contributing

If you spot a problem with these materials, please make an issue describing the problem or contact Jaren at [email protected]. If you want to suggest additional resources or materials, please branch and make a pull request!

Acknowledgments

D-Lab at the University of California, Berkeley
Summer Institute in Computational Social Science
Geoff Bacon, especially his Introduction to web scraping workshop
Rochelle Terman, especially her Web Scraping and Data Management in R summer course

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
day-1		day-1
day-2		day-2
extra		extra
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Web-Crawling: A Practical Introduction in Python

A workshop with the Massive Data Institute, Georgetown University

Overview

Workshop goals

Prerequisites

Getting started & software prerequisites

Open-Access Resources

Python and Jupyter Notebooks

Web-crawling with Scrapy & friends

Other useful libraries

O'Reilly books on scraping

Contributing

Acknowledgments

About

Releases

Packages

Languages

License

jhaber-zz/web-crawling-intro-2021

Folders and files

Latest commit

History

Repository files navigation

Web-Crawling: A Practical Introduction in Python

A workshop with the Massive Data Institute, Georgetown University

Overview

Workshop goals

Prerequisites

Getting started & software prerequisites

Open-Access Resources

Python and Jupyter Notebooks

Web-crawling with Scrapy & friends

Other useful libraries

O'Reilly books on scraping

Contributing

Acknowledgments

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages