New Scottish Parliament Scraper #172

ajparsons · 2024-04-15T11:11:07Z

This adds a new scraper for the Scottish Parliament's new site.

I've made a new sp_2024 folder and pulled across some of the elements needed for the ID parser.

There are three main steps:

Download: Uses the search page to find all agendas for that day, download the items of business and composite that into one file per committee.
Parse: Create structured information from the debates, containing the IDs and information being passed by the SP website.
Convert: Convert this structured information into TWFY format, add debate and person_ids.

(common and resolvenames are lightly re-formatted versions of the modules from the old scraper).

This seems to work as I'd expect for some recent ones - haven't tested actually loading the data.

There's some more special case stuff that could be loaded from the old scraper, but probably makes more sense to bring it across as things break?

dracos

Really good start, but think we need to do a bit more before switching to it. Have made a start on some at the download end. To sum up the I think important ones:

Remembering the Presiding Officer / Deputy from their first use
And/or using the SP person IDs
Spotting topical question subheadings both in parse and convert
Handling unspoken names using or-bill-section-bold
Spotting timestamps (which include suspend/resume as meta text to include I think)
Spotting or-italic
Fixing speech after division

pyscraper/sp_2024/__main__.py

pyscraper/sp_2024/download.py

pyscraper/sp_2024/convert.py

ajparsons · 2024-04-25T16:04:31Z

That should be all the major issues tidied up - does this also need an adjustment in TWFY to pull from the sp_2024 directory it puts the finished files in?

dracos · 2024-04-26T09:56:57Z

Yep, once the parser is updated and has pulled in some data, we can update it so it starts loading in from there

ajparsons added 2 commits April 15, 2024 10:54

Add mirroring action

5e42d2a

Update requirements

c669da2

ajparsons requested a review from dracos April 15, 2024 11:11

Remove unused code.

7c70d2f

dracos requested changes Apr 19, 2024

View reviewed changes

dracos force-pushed the new-sp-scraper-2024 branch 2 times, most recently from ae430fc to 3c19e8e Compare April 19, 2024 15:59

ajparsons added 2 commits April 25, 2024 15:46

Add scraper for new Scottish Parliament site

277c208

Update scripts to reference new SP scraper

f87d77a

ajparsons force-pushed the new-sp-scraper-2024 branch from 762a000 to f87d77a Compare April 25, 2024 15:51

dracos merged commit f87d77a into master Apr 26, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New Scottish Parliament Scraper #172

New Scottish Parliament Scraper #172

ajparsons commented Apr 15, 2024

dracos left a comment •

edited by ajparsons

Loading

ajparsons commented Apr 25, 2024

dracos commented Apr 26, 2024

New Scottish Parliament Scraper #172

New Scottish Parliament Scraper #172

Conversation

ajparsons commented Apr 15, 2024

dracos left a comment • edited by ajparsons Loading

Choose a reason for hiding this comment

ajparsons commented Apr 25, 2024

dracos commented Apr 26, 2024

dracos left a comment •

edited by ajparsons

Loading