Scraping public datasets #213

yiwen-h · 2024-11-06T12:52:04Z

e.g. Appointments in General Practice
How to download a webpage
How to find releases
How to fight problems
How to be responsible in what you download
Could include {rvest} and {polite}, also point to APIs where possible with {httr2} (Matt)

StatsRhian · 2024-11-27T15:58:30Z

Notes from our chat (2024-11-27)

Data scraping can be different levels. Certain things (links in a website) are more complicated to scrape reliably than an RSS feed.
How do we watch for changes?
Jonathan shares how he has done this on other projects and what the challenges were?
What makes scraping easy? What makes it hard?
What if the website changes it's structure?
How can we quickly identify key parts of a website?

Jonathan & Rhian would like to mock up an example of scraping unstructured data
For example Appointments in General Practice websites
Could try and adapt the workflow we use to see if it also works with similar NHS Digital pages (e.g. Workforce statistics)

Jonathan is interested in taking this work further with the more challenging board meeting data.
This could make a nice NLP data set
C&C is a good way to gauge if anyone in SU would be interested in looking into this

yiwen-h added the C&C ☕ Session idea for Coffee & Coding label Nov 6, 2024

github-project-automation bot added this to Coffee and Coding ☕🧑‍💻 Nov 6, 2024

github-project-automation bot moved this to Potential in Coffee and Coding ☕🧑‍💻 Nov 6, 2024

StatsRhian assigned StatsRhian and jspncr Nov 27, 2024

StatsRhian moved this from Potential to Scheduled in Coffee and Coding ☕🧑‍💻 Nov 27, 2024