-
Notifications
You must be signed in to change notification settings - Fork 1
Web Scraping
Web scraping is the process of extracting data or information from websites and turning that information into a useful format for further analysis. A typical process of web scraping is first to fetch the target webpage and then, second, parse information from that page. Next, the information is brought into a useful format and then stored in an archivable file format, database or server for further analysis.
- Web scraping
- Table of contents
- Web scraper
- Useful open source scrapers
A web scraper is a computer program that can be used for web scraping. A web scraper often exhibits a crawler and a scraper functionality. A crawler is an algorithm or AI which is built to discover websites with desirable data. Then, the scraper is the tool to extract this data from a website. Usually, when a scraper needs to scrape data from a website, first the URLs of the website are provided (e.g., by a crawler). Then it loads the HTML code (which mostly contains content and overall structure of the content), sometimes alongside CSS code (which determines much of the design) and javascript elements (which usually make a website interactive) depending on the ability of the scraper. Next, the scraper extracts the desired data (e.g. links, or names of politicians from online articles) and saves the data in a useful format. Most scrapers use CSV-like formats, or JSON to save the data.
-
Unique, rich, and independent datasets can be acquired by using a scraper. A researcher does not depend on any third party to get the data.
-
Instead of copying and pasting data from the internet or buying data from a third party, we can choose what data we want to collect exactly
-
Data collection can be automated and repeated. E.g., we can run the scraper on a daily basis and collect data for every day.
-
Building a scraper might require a lot of programming knowledge. Otherwise, ready-made scraping software can be used but might be costly. Also using third-party software can create limitations regarding the customizability of the data to be collected.
-
Websites change their structure regularly which might require a great deal of maintenance for long-term collections.
-
Also, scraping a website means using their resources so best practices involve being respectful, avoiding plagiarism, respecting privacy expectations and setting a gentle request rate limit. Also, scraping involves often more risks of violating ethical guidelines or legal restrictions
This page contains a handful of useful news scrapers which are open source and already documented on our website.
The following list is sorted by the ease of access (open-source status and required programming knowledge).
Scrapy is a strong web crawler and scraper which can be used to scrape data from a website and then store the data in a structured way. However, scrapy has a little bit of python programming knowledge.
Heritrix is a java based open-source scraper which provides a user interface with a web browser to operate the crawler. Heritrix required a strong programming background, so it’s not for the beginners
This Wiki is curated by the Social Media Observatory, which is hosted by the Leibniz Institute for Media Research | Hans-Bredow-Institut and supported by the Research Institute Social Cohesion. A pretty version can be found here.
- Instagram-Tools
- Twitter-Tools
- Wikipedia-Tools
- Facebook-Tools
- YouTube-Tools
- Telegram-Tools
- Smaller Platform Tools
- Cross-Platform Tools
- General News Scrapers
- Secure Storage and Archiving
- Data Anonymization Tools
- Data Publishing