Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Scraping public datasets #213

Open
yiwen-h opened this issue Nov 6, 2024 · 1 comment
Open

Scraping public datasets #213

yiwen-h opened this issue Nov 6, 2024 · 1 comment
Assignees
Labels
C&C ☕ Session idea for Coffee & Coding

Comments

@yiwen-h
Copy link
Member

yiwen-h commented Nov 6, 2024

  • e.g. Appointments in General Practice
  • How to download a webpage
  • How to find releases
  • How to fight problems
  • How to be responsible in what you download
  • Could include {rvest} and {polite}, also point to APIs where possible with {httr2} (Matt)
@StatsRhian
Copy link
Member

StatsRhian commented Nov 27, 2024

Notes from our chat (2024-11-27)

Overview

  • Data scraping can be different levels. Certain things (links in a website) are more complicated to scrape reliably than an RSS feed.
  • How do we watch for changes?
  • Jonathan shares how he has done this on other projects and what the challenges were?
  • What makes scraping easy? What makes it hard?
  • What if the website changes it's structure?
  • How can we quickly identify key parts of a website?

Being polite

  • What is the etiquette when scraping?
  • What do we need to bare in mind?
  • What tools can help us scrape nicely?

Related topics

These are adjacent topics which are relevant to data scraping but we won't go into detail on

  • Job scheduling (windows scheduler, CRON jobs, GitHub Actions, Connect scheduler)
  • Data validation (does the data that I downloaded meet the expected structure)
  • APIs (how to write a query)
  • Scraping structured data (XML, RSS etc,)

Specific Project

Future work

  • Jonathan is interested in taking this work further with the more challenging board meeting data.
  • This could make a nice NLP data set
  • C&C is a good way to gauge if anyone in SU would be interested in looking into this

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C&C ☕ Session idea for Coffee & Coding
Projects
Status: Scheduled
Development

No branches or pull requests

3 participants