Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kirkland scraper #445

Open
wants to merge 35 commits into
base: master
Choose a base branch
from
Open
Changes from 5 commits
Commits
Show all changes
35 commits
Select commit Hold shift + click to select a range
49fd814
Fix Kirkland scraper
rafe-murray May 30, 2024
6f52701
Update Montreal Est scraper
rafe-murray Jun 5, 2024
707e503
Merge branch 'master' into montreal_est_scraper
samJMA Oct 29, 2024
ca587f3
Fixed Errors
samJMA Oct 29, 2024
d35d876
Merge branch 'master' into kirkland_scraper
samJMA Oct 29, 2024
1bd0cde
Similar solution to burnaby web scraper.
samJMA Oct 29, 2024
0b49b23
Updated formatting
samJMA Nov 6, 2024
f8d2789
Updated formatting
samJMA Nov 6, 2024
f8a69cf
Merge branch 'opencivicdata:master' into kirkland_scraper
iepmas Nov 6, 2024
29c7f83
Merge branch 'opencivicdata:master' into montreal_est_scraper
iepmas Nov 6, 2024
3083d46
ca_qc_montreal_est: Make changes closer to original
jpmckinney Nov 7, 2024
6e7c15e
Fixed email encryption error
samJMA Nov 8, 2024
a4770b7
ca_qc_sainte_anne_de_bellevue: Squash #447 after simplifying changes
jpmckinney Nov 11, 2024
3fb90a2
build: Upgrade opencivicdata
jpmckinney Nov 11, 2024
fef5d3f
ca_on_wilmot: Skip executive officer
jpmckinney Nov 11, 2024
1e96ba1
ca_ns_halifax: Allow Jean St.Amand
jpmckinney Nov 11, 2024
d6249fc
ca_on_kawartha_lakes: Ignore names including content like "RESIGNED A…
jpmckinney Nov 11, 2024
e50e780
ca_on_thunder_bay: Fix SSL error
jpmckinney Nov 12, 2024
744835b
ca_yt: Set user-agent and cookie (DEFAULT_USER_AGENT and cookie from …
jpmckinney Nov 12, 2024
b5c125a
ca_on_thunder_bay: patching requests didn't work on Heroku
jpmckinney Nov 12, 2024
d68412c
ca_ns_cape_breton: Escape quotation marks in name
jpmckinney Nov 12, 2024
b48c025
ca_yt: Add comment about Cloudflare bot products
jpmckinney Nov 12, 2024
6ceb72b
Update people.py
bzhangjma Nov 18, 2024
adff484
Update people.py
bzhangjma Nov 18, 2024
4c6e6d4
Update people.py
bzhangjma Nov 18, 2024
4f8f4a7
chore: Remove unused imports
jpmckinney Nov 18, 2024
6de91ee
ca_bc: Use multiline string for readability
jpmckinney Nov 18, 2024
05054d9
ca_bc: Fix validation to allow "Hon Chan" and "A'aliya
jpmckinney Nov 18, 2024
d1f8d1e
Merge branch 'ca_bc_fix_2'
jpmckinney Nov 18, 2024
8f5acfb
Removed unused function
samJMA Nov 20, 2024
83a5a04
Fix Kirkland scraper
rafe-murray May 30, 2024
01536a3
Similar solution to burnaby web scraper.
samJMA Oct 29, 2024
42571bd
Fixed email encryption error
samJMA Nov 8, 2024
d3894c4
Removed unused function
samJMA Nov 20, 2024
ac99311
Merge branch 'kirkland_scraper' of https://github.com/JMAConsulting/s…
samJMA Nov 20, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
24 changes: 19 additions & 5 deletions ca_qc_kirkland/people.py
Original file line number Diff line number Diff line change
Expand Up @@ -8,9 +8,18 @@

class KirklandPersonScraper(CanadianScraper):
def scrape(self):
page = self.lxmlize(COUNCIL_PAGE)
def decode_email(e):
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function isn't called. Please delete it.

de = ""
k = int(e[:2], 16)

councillors = page.xpath('//div[@class="container_content"]//tbody/tr')
for i in range(2, len(e) - 1, 2):
de += chr(int(e[i : i + 2], 16) ^ k)

return de

page = self.lxmlize(COUNCIL_PAGE, "iso-8859-1")

councillors = page.xpath('//table/tbody[not(@id)]/tr/td[@valign="top"]')
assert len(councillors), "No councillors found"
for councillor in councillors:
if councillor == councillors[0]:
Expand All @@ -23,19 +32,24 @@ def scrape(self):

name = councillor.xpath(".//strong/text()")[0]

# Using self.get_phone does not include the extension #
phone = (
councillor.xpath('.//div[contains(text(), "#")]/text()')[0]
.replace("T ", "")
.replace(" ", "-")
.replace(".", ",") # correcting a typo
.replace(".", ",")
.replace(",-#-", " x")
)
email = self.get_email(councillor)
encrypted_email = councillor.xpath('.//@href[contains(., "email")]')[0].split("#")[1]
email = decode_email(encrypted_email)

# cloudflare encrypts the email data
email = councillor.xpath(".//div/*/*/@href | .//div/*/@href | .//@href")[0]
decoded_email = decode_email(email.split("#", 1)[1])
jpmckinney marked this conversation as resolved.
Show resolved Hide resolved
p = Person(primary_org="legislature", name=name, district=district, role=role)
p.add_source(COUNCIL_PAGE)
p.add_contact("voice", phone, "legislature")
p.add_contact("email", email)
p.add_contact("email", decoded_email)
image = councillor.xpath(".//img/@src")
if image:
p.image = image[0]
Expand Down