-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add documentation for scrapers #25
Conversation
- name: Set up Quarto | ||
if: steps.docs-changed.outputs.any_changed == 'true' | ||
uses: quarto-dev/quarto-actions/setup@v2 | ||
|
||
- name: Render and Publish | ||
if: steps.docs-changed.outputs.any_changed == 'true' | ||
uses: quarto-dev/quarto-actions/publish@v2 | ||
with: | ||
path: docs | ||
target: gh-pages | ||
env: | ||
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These steps are only run if there are changes in the docs/
directory.
version: '2.4' | ||
|
||
services: | ||
scrapers: | ||
image: scrapers-lametro | ||
container_name: scrapers-lametro | ||
build: . | ||
stdin_open: true | ||
tty: true | ||
volumes: | ||
- .:/app | ||
environment: | ||
# Populate the local Councilmatic database | ||
DATABASE_URL: postgresql://postgres:postgres@postgres:5432/lametro | ||
DJANGO_SETTINGS_MODULE: pupa.settings | ||
OCD_DIVISION_CSV: "/app/lametro/divisions.csv" | ||
command: pupa update lametro | ||
# Connect the scraper container to the app network | ||
networks: | ||
- app_net | ||
|
||
networks: | ||
# Define connection to the app's Docker network | ||
app_net: | ||
name: la-metro-councilmatic_default | ||
external: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cc @xmedr
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this makes things way easier, ty!
@@ -0,0 +1,224 @@ | |||
--- |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Migrated from the Councilmatic wiki.
docker compose -f docker-compose.councilmatic.yml run --rm scrapers | ||
``` | ||
|
||
### Useful pupa commands |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adapted from the Councilmatic wiki.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really great! And the quarto docs run/update locally. Got some little comments below.
I also wanted to ask: I found that it took some time to figure out how some terms in Legistar mapped to Metro. For example, that "matter" equates to "board report", or a "body" refers to a "legislation" (I might be wrong about that). Would it be useful to have a small note of alternative names that certain items might be called in the individual scrapers sections?
docs/debugging.qmd
Outdated
|
||
#### Legistar contains the wrong data | ||
|
||
Data issues can occur when the Legistar API or web interface displays the wrong information. This generally happens with Metro staff enters information that is incorrect. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be a small typo. was it supposed to be something like:
This generally happens <when> Metro staff enters information that is incorrect.
|
||
The dashboard lives on GitHub, [here](https://github.com/Metro-Records/la-metro-dashboard). | ||
|
||
Apache maintains thorough documentation of [core concepts](http://airflow.apache.org/docs/stable/concepts.html), as well as [navigating the UI](http://airflow.apache.org/docs/stable/ui.html). If you've never used Airflow before, these are great resources to start with! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is super helpful!
docs/debugging.qmd
Outdated
|
||
![windowed_bill_scraping DAG tree view](https://i.imgur.com/9NDdEpy.png) | ||
|
||
Note that the scraping DAGs employ [branch operators](https://airflow.apache.org/docs/stable/concepts.html?highlight=branch#branching) to determine what kind of scraping task to run. Be sure to look closely to verify that the task you're expecting is green (succeeded), not pink (skipped). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this link leads to the core concepts page without highlighting anything. would this link make sense?
https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#concepts-branching
version: '2.4' | ||
|
||
services: | ||
scrapers: | ||
image: scrapers-lametro | ||
container_name: scrapers-lametro | ||
build: . | ||
stdin_open: true | ||
tty: true | ||
volumes: | ||
- .:/app | ||
environment: | ||
# Populate the local Councilmatic database | ||
DATABASE_URL: postgresql://postgres:postgres@postgres:5432/lametro | ||
DJANGO_SETTINGS_MODULE: pupa.settings | ||
OCD_DIVISION_CSV: "/app/lametro/divisions.csv" | ||
command: pupa update lametro | ||
# Connect the scraper container to the app network | ||
networks: | ||
- app_net | ||
|
||
networks: | ||
# Define connection to the app's Docker network | ||
app_net: | ||
name: la-metro-councilmatic_default | ||
external: true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah this makes things way easier, ty!
docs/scrapers/events.qmd
Outdated
- Metro streams audio in both English and Spanish. They cannot associated multiple | ||
broadcast link with one event in Legistar, so they create two nearly identical events that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
super small typo. is it supposed to be:
They cannot associate multiple broadcast links with one event in Legistar, ...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Amazing @hancush! Just one question inline.
docs/debugging.qmd
Outdated
|
||
If scrapes are running, by far the most common source of issues is the scraper failing to capture changes to a bill or event. The root issue is that we rely on timestamps that should indicate an update to determine which bills and events to scrape. In reality, these timestamps do not always update when a change is made to an event or bill in Legistar. We have a couple of strategies to get around this: | ||
|
||
1. During [the Friday support window](https://github.com/datamade/la-metro-dashboard/blob/ac416e5e03f6a97fb9b0c6112093c679cefb0d1c/dags/constants.py#L41-L59), we scrape all events and bills at the top of every hour. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we remove references to the Friday support window?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Done.
@xmedr, @antidipyramid Thanks for the review, you two! Addressed your comments and published the changes manually: https://metro-records.github.io/scrapers-lametro/ Xavier, thanks for flagging synonyms! I've added those to each of the scraper pages, and I also added core concepts to the Jurisdiction scraper, since I found that really confusing when I got started. I also added links to the Open Civic Data documentation where relevant. They're really useful for understanding why data is structured the way it is. Let me know if this looks good to merge! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are hugely helpful, thank you! Looks good to me
Description
Hark! A documentation site!
Succeeding build here: https://github.com/Metro-Records/scrapers-lametro/actions/runs/10622927449/job/29448271525
Testing