Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Scottish Parliament Scraper #172

Merged
merged 5 commits into from
Apr 26, 2024
Merged

New Scottish Parliament Scraper #172

merged 5 commits into from
Apr 26, 2024

Conversation

ajparsons
Copy link
Contributor

This adds a new scraper for the Scottish Parliament's new site.

I've made a new sp_2024 folder and pulled across some of the elements needed for the ID parser.

There are three main steps:

  • Download: Uses the search page to find all agendas for that day, download the items of business and composite that into one file per committee.
  • Parse: Create structured information from the debates, containing the IDs and information being passed by the SP website.
  • Convert: Convert this structured information into TWFY format, add debate and person_ids.

(common and resolvenames are lightly re-formatted versions of the modules from the old scraper).

This seems to work as I'd expect for some recent ones - haven't tested actually loading the data.

There's some more special case stuff that could be loaded from the old scraper, but probably makes more sense to bring it across as things break?

@ajparsons ajparsons requested a review from dracos April 15, 2024 11:11
Copy link
Member

@dracos dracos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really good start, but think we need to do a bit more before switching to it. Have made a start on some at the download end. To sum up the I think important ones:

  • Remembering the Presiding Officer / Deputy from their first use
  • And/or using the SP person IDs
  • Spotting topical question subheadings both in parse and convert
  • Handling unspoken names using or-bill-section-bold
  • Spotting timestamps (which include suspend/resume as meta text to include I think)
  • Spotting or-italic
  • Fixing speech after division

@dracos dracos force-pushed the new-sp-scraper-2024 branch 2 times, most recently from ae430fc to 3c19e8e Compare April 19, 2024 15:59
@ajparsons ajparsons force-pushed the new-sp-scraper-2024 branch from 762a000 to f87d77a Compare April 25, 2024 15:51
@ajparsons
Copy link
Contributor Author

That should be all the major issues tidied up - does this also need an adjustment in TWFY to pull from the sp_2024 directory it puts the finished files in?

@dracos dracos merged commit f87d77a into master Apr 26, 2024
1 check passed
@dracos
Copy link
Member

dracos commented Apr 26, 2024

Yep, once the parser is updated and has pulled in some data, we can update it so it starts loading in from there

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants