Skip to content

Latest commit

 

History

History
54 lines (42 loc) · 3.48 KB

README.md

File metadata and controls

54 lines (42 loc) · 3.48 KB

Basics of Web scraping

A series of simple projects that I did while practicing Web scraping and parsing.Welcome to the Web scraping Mission. In this mission, you will learn various concepts of web scraping and get comfortable with scraping various types of websites and their data. You will be dealing with a simple problem statement here. The mission is to scrape data from Wikipedia Home page and parse it through various web scraping techniques. You will be getting familiar with various web scraping techniques, python modules for web scraping and processes of Data extraction and dat processing. This mission will be useful for graduates, post graduates, and research students who either have an interest in this subject or have this subject as a part of their curriculum. Web scraping is an automatic process of extracting information from web. This mission will give you an in-depth idea of web scraping, its comparison with web crawling, and why you should opt for web scraping. You will also learn about the components and working of a web scraper.

KEY POINTS & OBJECTIVES –

  • Use of python
  • Creating virtual env
  • Working with virtual env
  • Web scraping libraries
  • Legality

Starters Pack:

  • We need python IDE and should be familiar with the use of it.
  • Virtualenv is a tool to create isolated Python environments. With the help of virtualenv, we can create a folder that contains all necessary executables to use the packages that our Python project requires. Here we can add and modify python modules without affecting any global installation.
  • We need to install various python modules using pip command for our purpose.
  • But, we should always keep in mind that whether website we are scraping is legal or not.
  • We use pip command to install all the modules and libraries.

Requirements -

  • Requests:- It is an efficient HTTP library used for accessing web page.
  • Urlib3:- It is used for retrieving data from URLs.
  • Selenium:- It is an open source automated testing suite for web applications across different browsers and platforms.
  • Beautiful Soup library.

Resources

  1. https://realpython.com/python-web-scraping-practical-introduction/
  2. https://www.tutorialspoint.com/python_web_scraping/python_web_scraping_getting_started_with_python.htm
  3. https://www.promptcloud.com/blog/scraping-dynamic-websites-web-scraping/
  4. https://www.webharvy.com/articles/what-is-web-scraping.html#:~:text=Web%20Scraping%20(also%20termed%20Screen,in%20table%20(spreadsheet)%20format.
  5. https://www.dataquest.io/blog/web-scraping-tutorial-python/
  6. https://medium.com/@pknerd/scraping-dynamic-websites-using-scraper-api-and-python-a8d041fc97ac

Guidelines for Contributing

  • Raise an issue regards to the topic you will be contributing.
  • Fork this repo to you own github profile.
  • Clone the repo to you machine using the command $git clone FORK_URL
  • Create a new branch using the command $git checkout -b BRANCH_NAME
  • Make changes that you wish to implement.
  • Test the changes.
  • Once you are satisfied with the testing, commit the changes.
  • Change to the master branch with the command $git checkout master
  • Merge the branch where you made changes with the master branch. Use the command $git merge BRANCH_NAME
  • Push the changes to your fork $git push -u origin
  • Create a pull request from your fork.
  • If there are no conflicts you can merge the PR.
  • If there are conflicts and you are uncomfortable with resolving them, contact for support.
  • Delete the fork once you are done.

Happy Contributing! 😁