Skip to content

Commit

Permalink
Docs (#13)
Browse files Browse the repository at this point in the history
* Update contributor docs #5
* Update usage docs #6
* update deps
* Update main README #11
* add stub page for maintainer docs #9
  • Loading branch information
zstumgoren authored Apr 10, 2024
1 parent de28b11 commit 309e677
Show file tree
Hide file tree
Showing 6 changed files with 142 additions and 103 deletions.
1 change: 1 addition & 0 deletions Pipfile
Original file line number Diff line number Diff line change
Expand Up @@ -22,6 +22,7 @@ typing-extensions = "*"
types-retry = "*"
types-beautifulsoup4 = "*"
types-openpyxl = "*"
blacken-docs = "*"

[packages]
beautifulsoup4 = "*"
Expand Down
53 changes: 31 additions & 22 deletions Pipfile.lock

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

20 changes: 14 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,13 +1,21 @@
# CLEAN scrapers

This repo contains scrapers to gather police bodycam video footage and other files from police department websites, as part of the Community Law Enforcement Accountability Network.
Welcome to the `clean-scraper` project!

We welcome open-source contributions to this project. If you'd like to pitch in, check out the [growing list of agencies](https://docs.google.com/spreadsheets/d/e/2PACX-1vTBcJKRsufBPYLsX92ZhaHrjV7Qv1THMO4EBhOCmEos4ayv6yB6d9-VXlaKNr5FGaViP20qXbUvJXgL/pubhtml?gid=0&single=true) we need to scrape and ping us in Discussions :point_up: to claim an agency.
This repo contains scrapers to gather police bodycam video footage and other files from police department websites, as part of the Community Law Enforcement Accountability Network (CLEAN).

> :warning: This is a new scraping effort (as of March 2024). We're planning to provide Developer guidelines and sample code in the near future. In the meantime, please ping if you have questions about how to get started.
The CLEAN project is a collaborative effort between the ACLU, [Big Local News][], at Stanford, the [UC Berkeley Investigative Reporting Program][], and a variety of civil liberty organizations and news partners around the country. We're gathering police body camera footage, disciplinary records and other important information to help shine a light on the use of harmful and lethal force by law enforcement.

Our accountability reporting has produced a number of [news stories](docs/stories.md) already, but there's still much work to be done.

## Scrapers
We welcome open-source contributions to this project. Or if you'd just like to nab our code for your own purposes, you're welcome to do so.

- CA
- san_diego_pd
Please check out the resources below to get started:

- [Usage docs](docs/usage.md) - For folks who just need the code
- [Contributor docs](docs/contributing.md) - Want to write a scraper? Check this out.
- [Maintainer docs](docs/maintainers.md) - For those intrepid souls helping to manage the project


[Big Local News]: https://biglocalnews.org/content/about/
[UC Berkeley Investigative Reporting Program]:https://journalism.berkeley.edu/programs/mj/investigative-reporting/
148 changes: 74 additions & 74 deletions docs/contributing.md
Original file line number Diff line number Diff line change
@@ -1,18 +1,25 @@
# Contributing

Our project welcomes new contributors who want to help us fix bugs and improve our scrapers. The current status of our effort is documented in our [issue tracker](https://github.com/biglocalnews/clean-scrapers/issues).
Our project welcomes new contributors who want to help us add scrapers, fix bugs, or improve our existing codebase. The current status of our effort is documented in our [issue tracker](https://github.com/biglocalnews/clean-scraper/issues).

You can also chat with us over on [GitHub Discussions](https://github.com/biglocalnews/clean-scraper/discussions).

We want your help. We need your help. Here's how to get started.


Adding features and fixing bugs is managed using GitHub's [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) system.

The tutorial that follows assumes you have the [Python](https://www.python.org/) programming language, the [pipenv](https://pipenv.pypa.io/) package manager and the [git](https://git-scm.com/) version control system already installed. If you don't, you'll want to address that first.

Below are details on the typical workflow.

## Claim an agency

If you'd like to write a new scraper, check out the [growing list of law enforcement agencies](https://docs.google.com/spreadsheets/d/e/2PACX-1vTBcJKRsufBPYLsX92ZhaHrjV7Qv1THMO4EBhOCmEos4ayv6yB6d9-VXlaKNr5FGaViP20qXbUvJXgL/pubhtml?gid=0&single=true) we need to scrape and ping us in Discussions :point_up: to claim an agency. Just make sure to pick one that hasn't yet been claimed.

## Create a fork

Start by visiting our project's repository at [github.com/biglocalnews/clean-scrapers](https://github.com/biglocalnews/clean-scrapers) and creating a fork. You can learn how from [GitHub's documentation](https://docs.github.com/en/get-started/quickstart/fork-a-repo).
Start by visiting our project's repository at [github.com/biglocalnews/clean-scraper](https://github.com/biglocalnews/clean-scraper) and creating a fork. You can learn how from [GitHub's documentation](https://docs.github.com/en/get-started/quickstart/fork-a-repo).

## Clone the fork

Expand All @@ -21,15 +28,15 @@ Now you need to make a copy of your fork on your computer using GitHub's cloning
A typical terminal command will look something like the following, with your username inserted in the URL.

``` bash
git clone [email protected]:yourusername/clean-scrapers.git
git clone [email protected]:yourusername/clean-scraper.git
```

## Install dependencies

You should [change directory](https://manpages.ubuntu.com/manpages/trusty/man1/cd.1posix.html) into folder where you code was downloaded.

``` bash
cd clean-scrapers
cd clean-scraper
```

The `pipenv` package manager can install all of the Python tools necessary to run and test our code.
Expand Down Expand Up @@ -58,9 +65,9 @@ You can do this by running a line of code like the one below. You should substit
git switch -c your-branch-name
```

We ask that you follow a pattern where the **branch name includes the postal code of the state you're working on, combined with the issue number generated by GitHub**.
We ask that you name your branch using following convention: **the postal code + the GitHub issue number**.

For example, let's say you were working on a scraper for the San Diego Police Department and the related GitHub issue is `#100`.
For example, let's say you were working on a scraper for the San Diego Police Department in California and the related GitHub issue is `#100`.

You create a branch named `ca-100` and switch over to it (i.e. "check it out locally, in git lingo) using the below command.

Expand All @@ -74,7 +81,7 @@ Now you can begin your work. You can start editing the code on your computer, ma

## Creating a new scraper

When adding a new state, you should create a new Python file in the following directory structure and format: `clean/{state_postal}/{agency_slug}`. Try to keep the agency slug, or abbreviation short but meaningful. If in doubt, hit us up and we can hash out a name. Naming things is hard, after all.
When adding a new state, you should create a new Python file in the following directory structure and format: `clean/{state_postal}/{agency_slug}`. Try to keep the agency slug, or abbreviation, short but meaningful. If in doubt, hit us up on an issue or in the [GitHub Discussions forum](https://github.com/biglocalnews/clean-scraper/discussions) and we can hash out a name. After all, [naming things is hard](https://martinfowler.com/bliki/TwoHardThings.html).

Here is the folder structure for the San Diego Police Department in California:

Expand All @@ -84,78 +91,59 @@ clean
   └── san_diego_pd.py
```

You can use the code for San Diego as a reference example to jumpstart
your own state.

When coding a new scraper, the new important conventions to follow are:

- Always create a top-level `scrape` function with the interface seen
below
- Always ensure that the `scrape` function downloads and stores files to
a standardized, configurable location (configured as parameters to the
`scrape` function)

``` python
from pathlib import Path

from .. import utils
from ..cache import Cache

You can use the code for San Diego as a reference example to jumpstart your own state.

def scrape(
data_dir: Path = utils.CLEAN_DATA_DIR,
cache_dir: Path = utils.CLEAN_CACHE_DIR,
) -> Path:
"""
Scrape data from Iowa.
When coding a new scraper, there are a few important conventions to follow:

Keyword arguments:
data_dir -- the Path were the result will be saved (default WARN_DATA_DIR)
cache_dir -- the Path where results can be cached (default WARN_CACHE_DIR)
- Add the agency's scraper module to a state-based folder (e.g. `clean/ca/san_deigo_pd.py`)
- If it's a new state folder, add an empty `__init__.py` to the folder
- Create a `Site` class inside the agency's scraper module with the following attributes/methods:
- `name` - Official name of the agency
- `scrape_meta` - generates a CSV with metadata about videos and other available files (file name, URL, and size at minimum)
- `scrape` - uses the CSV generated by `scrape_meta` to download videos and other files

Returns: the Path where the file is written
"""
# Grab the page
page = utils.get_url("https://xx.gov/yy.html")
html = page.text
Below is a pared down version of San Diego's [Site](https://github.com/biglocalnews/clean-scraper/blob/main/clean/ca/san_diego_pd.py) class to illustrate these conventions.

# Write the raw file to the cache
cache = Cache(cache_dir)
cache.write("xx/yy.html", html)
> The San Diego PD scraper code is in `clean/ca/san_diego_pd.py`
# Parse the source file and convert to a list of rows, with a header in the first row.
## It's up to you to fill in the blank here based on the structure of the source file.
## You could do that here with BeautifulSoup or whatever other technique.
pass
```python
class Site:

# Set the path to the final CSV
# We should always use the lower-case state postal code, like nj.csv
output_csv = data_dir / "xx.csv"
name = "San Diego Police Department"

# Write out the rows to the export directory
utils.write_rows_to_csv(output_csv, cleaned_data)
def __init__(self, data_dir=utils.CLEAN_DATA_DIR, cache_dir=utils.CLEAN_CACHE_DIR):
self.data_dir = data_dir
self.cache_dir = cache_dir
# etc.
# etc.

# Return the path to the final CSV
return output_csv
def scrape_meta(self, throttle=0):
# 1. Scrape metadata about available files, making sure to download and save file
# artifacts such as HTML pages along the way (we recommend using Cache.download)
# 2. Generate a metadata CSV and store in the cache
# 3. Return the path to the metadata CSV
pass


if __name__ == "__main__":
scrape()
def scrape(self, metadata_csv):
# 1. Use the metadata CSV generated by `scrape_meta` to download available files
# to the cache directory (once again, check out Cache.download).
# artifacts such as HTML "index" pages along the way in the cache
# 2. Generate a metadata CSV and store in the cache
# 3. Return the path to the metadata CSV
pass
```

When creating a scraper, there are a few rules of thumb.

1. The raw data being scraped --- whether it be HTML, video files, PDFs ---
should be saved to the cache unedited. We aim to store pristine
versions of our source data.
2. The data extracted from source files should be exported as a single
file. Any intermediate files generated during data processing should
1. The metadata about source files should be stored in a single
CSV file. Any intermediate files generated during file/data processing should
not be written to the data folder. Such files should be written to
the cache directory.
3. The final export should be the state's postal code and agency slug, in lower case.
For example, San Diego's final file should be saved `ca_san_diego_pd.csv`.
4. For simple cases, use a cache name identical to the final export name.
5. If many files need to be cached, create a subdirectory using the lower-case state postal code and agency slug (`ca_san_diego_pd`) and apply a sensible naming scheme to the cached files (e.g. `ca_san_diego_pd/page_1.html`)
1. Files should be cached in a site-specific cache folder using the agency slug name: `ca_san_diego_pd`.
If many files need to be cached, apply a sensible naming scheme to the cached files (e.g. `ca_san_diego_pd/index_page_1.html`)

## Running the CLI

Expand All @@ -164,36 +152,48 @@ After a scraper has been created, the command-line tool provides a method to tes
``` bash
pipenv run python -m clean.cli --help

Usage: python -m clean.cli [OPTIONS] [SCRAPERS]...

Command-line interface for downloading CLEAN police files.
Usage: python -m clean.cli [OPTIONS] COMMAND [ARGS]...

SCRAPERS -- a list of one or more scrapers to run. Pass `all` to
scrape all supported states and agencies.
Command-line interface for downloading CLEAN files.

Options:
--data-dir PATH The Path were the results will be saved
--cache-dir PATH The Path where results can be cached
--delete / --no-delete Delete generated files from the cache
-l, --log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL]
Set the logging level
--help Show this message and exit.
--help Show this message and exit.

Commands:
list List all available agencies and their slugs.
scrape Command-line interface for downloading CLEAN files.
scrape-meta Command-line interface for generating metadata CSV about...
```

Running a state is as simple as passing arguments to that same command.
Running a state is as simple as passing arguments to the appropriate subcommand.

If you were trying to develop the San Deigo PD scraper found in `clean/ca/san_diego_pd.py`, you could run something like this.

``` bash
pipenv run python -m clean.cli ca_san_diego_pd
# List availabe agencies (and their slugs, which are required for scraping commands)
pipenv run python -m clean.cli list

# Trigger the metadata scraper using agency slug
pipenv run python -m clean.cli scrape-meta ca_san_diego_pd

# Trigger file downloads using agency slug
pipenv run python -m clean.cli scrape ca_san_diego_pd
```

For more verbose logging, you can ask the system to showing debugging information.
For more verbose logging, you can ask the system to show debugging information.

``` bash
pipenv run python -m clean.cli -l DEBUG ca_san_diego_pd
```

To be a good citizen of the Web and avoid IP blocking, you can throttle (i.e. slow down the scrapers with a time delay):

``` bash
# Pause 2 seconds between web requests
pipenv run python -m clean.cli -t 2 ca_san_diego_pd
```

You could continue to iterate with code edits and CLI runs until you've completed your goal.

## Run tests
Expand Down
7 changes: 7 additions & 0 deletions docs/maintainers.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Maintainers

Some helpful bits to help manage this project.

- TK: Local testing
- TK: CI on GitHub Actions
- TK: release to PyPI
Loading

0 comments on commit 309e677

Please sign in to comment.