Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add documentation for scrapers #25

Merged
merged 12 commits into from
Sep 26, 2024
Merged

Add documentation for scrapers #25

merged 12 commits into from
Sep 26, 2024

Conversation

hancush
Copy link
Collaborator

@hancush hancush commented Aug 29, 2024

Description

Hark! A documentation site!

Succeeding build here: https://github.com/Metro-Records/scrapers-lametro/actions/runs/10622927449/job/29448271525

Testing

  • Review the documentation at https://metro-records.github.io/scrapers-lametro/ and confirm it's a good first draft.
  • Try to run the Quarto docs locally following the instructions in the README and confirm you can preview the built site and make updates.

Comment on lines +23 to +34
- name: Set up Quarto
if: steps.docs-changed.outputs.any_changed == 'true'
uses: quarto-dev/quarto-actions/setup@v2

- name: Render and Publish
if: steps.docs-changed.outputs.any_changed == 'true'
uses: quarto-dev/quarto-actions/publish@v2
with:
path: docs
target: gh-pages
env:
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These steps are only run if there are changes in the docs/ directory.

Comment on lines +1 to +26
version: '2.4'

services:
scrapers:
image: scrapers-lametro
container_name: scrapers-lametro
build: .
stdin_open: true
tty: true
volumes:
- .:/app
environment:
# Populate the local Councilmatic database
DATABASE_URL: postgresql://postgres:postgres@postgres:5432/lametro
DJANGO_SETTINGS_MODULE: pupa.settings
OCD_DIVISION_CSV: "/app/lametro/divisions.csv"
command: pupa update lametro
# Connect the scraper container to the app network
networks:
- app_net

networks:
# Define connection to the app's Docker network
app_net:
name: la-metro-councilmatic_default
external: true
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cc @xmedr

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this makes things way easier, ty!

@@ -0,0 +1,224 @@
---
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Migrated from the Councilmatic wiki.

docker compose -f docker-compose.councilmatic.yml run --rm scrapers
```

### Useful pupa commands
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Adapted from the Councilmatic wiki.

Copy link
Collaborator

@xmedr xmedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really great! And the quarto docs run/update locally. Got some little comments below.

I also wanted to ask: I found that it took some time to figure out how some terms in Legistar mapped to Metro. For example, that "matter" equates to "board report", or a "body" refers to a "legislation" (I might be wrong about that). Would it be useful to have a small note of alternative names that certain items might be called in the individual scrapers sections?


#### Legistar contains the wrong data

Data issues can occur when the Legistar API or web interface displays the wrong information. This generally happens with Metro staff enters information that is incorrect.
Copy link
Collaborator

@xmedr xmedr Sep 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

might be a small typo. was it supposed to be something like:

This generally happens <when> Metro staff enters information that is incorrect.


The dashboard lives on GitHub, [here](https://github.com/Metro-Records/la-metro-dashboard).

Apache maintains thorough documentation of [core concepts](http://airflow.apache.org/docs/stable/concepts.html), as well as [navigating the UI](http://airflow.apache.org/docs/stable/ui.html). If you've never used Airflow before, these are great resources to start with!
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is super helpful!


![windowed_bill_scraping DAG tree view](https://i.imgur.com/9NDdEpy.png)

Note that the scraping DAGs employ [branch operators](https://airflow.apache.org/docs/stable/concepts.html?highlight=branch#branching) to determine what kind of scraping task to run. Be sure to look closely to verify that the task you're expecting is green (succeeded), not pink (skipped).
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this link leads to the core concepts page without highlighting anything. would this link make sense?

https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html#concepts-branching

Comment on lines +1 to +26
version: '2.4'

services:
scrapers:
image: scrapers-lametro
container_name: scrapers-lametro
build: .
stdin_open: true
tty: true
volumes:
- .:/app
environment:
# Populate the local Councilmatic database
DATABASE_URL: postgresql://postgres:postgres@postgres:5432/lametro
DJANGO_SETTINGS_MODULE: pupa.settings
OCD_DIVISION_CSV: "/app/lametro/divisions.csv"
command: pupa update lametro
# Connect the scraper container to the app network
networks:
- app_net

networks:
# Define connection to the app's Docker network
app_net:
name: la-metro-councilmatic_default
external: true
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah this makes things way easier, ty!

Comment on lines 11 to 12
- Metro streams audio in both English and Spanish. They cannot associated multiple
broadcast link with one event in Legistar, so they create two nearly identical events that
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

super small typo. is it supposed to be:

They cannot associate multiple broadcast links with one event in Legistar, ...

Copy link
Collaborator

@antidipyramid antidipyramid left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Amazing @hancush! Just one question inline.


If scrapes are running, by far the most common source of issues is the scraper failing to capture changes to a bill or event. The root issue is that we rely on timestamps that should indicate an update to determine which bills and events to scrape. In reality, these timestamps do not always update when a change is made to an event or bill in Legistar. We have a couple of strategies to get around this:

1. During [the Friday support window](https://github.com/datamade/la-metro-dashboard/blob/ac416e5e03f6a97fb9b0c6112093c679cefb0d1c/dags/constants.py#L41-L59), we scrape all events and bills at the top of every hour.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we remove references to the Friday support window?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Done.

@hancush
Copy link
Collaborator Author

hancush commented Sep 5, 2024

@xmedr, @antidipyramid Thanks for the review, you two! Addressed your comments and published the changes manually: https://metro-records.github.io/scrapers-lametro/

Xavier, thanks for flagging synonyms! I've added those to each of the scraper pages, and I also added core concepts to the Jurisdiction scraper, since I found that really confusing when I got started. I also added links to the Open Civic Data documentation where relevant. They're really useful for understanding why data is structured the way it is.

Let me know if this looks good to merge!

Copy link
Collaborator

@xmedr xmedr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are hugely helpful, thank you! Looks good to me

@hancush hancush merged commit a512658 into main Sep 26, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants