-
Notifications
You must be signed in to change notification settings - Fork 5
/
Copy pathkick-off-Scraping.py
108 lines (69 loc) · 3.25 KB
/
kick-off-Scraping.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
"""
Created on Wed Jun 6 11:00:10 2016
@author: Rohan
Basics of Python Web Scraping:
Hi Buddy! This is a kick-off-script to get hands on.
Script Requirements: Python3, BeautifulSoup, urllib
I have implemented few basic examples using selenium, Do check them out! This script covers approximately 0.1% of entire
python web scraping. Here my motive is to get you familiar with the tools that python provides if you forsee your career in
web automation.
"""
from urllib.error import HTTPError
from urllib.request import urlopen
from bs4 import BeautifulSoup
import re
#Simple Ex. 1
html = urlopen("https://en.wikipedia.org/wiki/Cristiano_Ronaldo")
bsObj =BeautifulSoup(html.read())
#print(bsObj) # shows the entire page source HTML i.e. DOM (Document Object Model) structure for the page mentioned in the URL
print(bsObj.h1) #accessing tags
print(bsObj.h2) #accessing tags
print(bsObj.body.h3) #accessing tags
'''
There are two main things that can go wrong in this line:
The page is not found on the server (or there was some error in retrieving it)
The server is not found
Along with this, there are fair chances that a mentioned attribute in the DOM is not present.
Hence, we must this. It is the most important web scraping ritual to handle these two basic exceptions.
'''
#Simple Ex. 2: Invalid Link
try:
html = urlopen("https://en.wikipedia.org/wiki/Cristio_Ronaldo")
except HTTPError as e:
print(e)
#Simple Ex. 3: Best practise
try:
html = urlopen("https://en.wikipedia.org/wiki/Cristio_Ronaldo")
except (HTTPError, AttributeError) as e:
print(e)
# Few BeautifulSoup methods
html = urlopen("https://en.wikipedia.org/wiki/Cristiano_Ronaldo")
bsObj = BeautifulSoup(html)
nameList = bsObj.findAll("span", {"class":"nowrap"}) #prints all values with tag span and class as nowrap whereas, find() returns only the first occurence of the condition
for name in nameList:
print(name.get_text())
print(len(nameList))
crnamecount = bsObj.findAll(text = "Cristiano Ronaldo") # calculates the count this string is present in the page
print(len(crnamecount)) #17 times printed
id = bsObj.findAll(id = "mw-content-text")
print(len(id))
print(id) # return DOM bcz it is not a list so we cant do get_text()
# parsing table
html = urlopen("http://www.pythonscraping.com/pages/page3.html")
bsObj = BeautifulSoup(html)
tabledata = bsObj.findAll("table",{"id":"giftList"}) # fetches the entire table data DOM code
print(tabledata)
# images = bsObj.findAll("img", {"src":re.compile("\.\.\/img\/gifts/img.*\.jpg")}) # "." indicates presence of any one charater so to escape its original meaning we use "\."
# crawl deep
'''
find the first div with id tag and then find all links present in that div
'''
html = urlopen("http://en.wikipedia.org/wiki/Cristiano_Ronaldo")
bsObj = BeautifulSoup(html)
for link in bsObj.find("div", {"id":"bodyContent"}).findAll("a",href=re.compile("^(/wiki/)((?!:).)*$")):
print(link) #all links present in that div
#OP: <a href="/wiki/Frank_McLintock" title="Frank McLintock">McLintock</a>
#if 'href' in link.attrs:
# print(link.attrs['href'])