-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
* Update contributor docs #5 * Update usage docs #6 * update deps * Update main README #11 * add stub page for maintainer docs #9
- Loading branch information
1 parent
de28b11
commit 309e677
Showing
6 changed files
with
142 additions
and
103 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.
Oops, something went wrong.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,13 +1,21 @@ | ||
# CLEAN scrapers | ||
|
||
This repo contains scrapers to gather police bodycam video footage and other files from police department websites, as part of the Community Law Enforcement Accountability Network. | ||
Welcome to the `clean-scraper` project! | ||
|
||
We welcome open-source contributions to this project. If you'd like to pitch in, check out the [growing list of agencies](https://docs.google.com/spreadsheets/d/e/2PACX-1vTBcJKRsufBPYLsX92ZhaHrjV7Qv1THMO4EBhOCmEos4ayv6yB6d9-VXlaKNr5FGaViP20qXbUvJXgL/pubhtml?gid=0&single=true) we need to scrape and ping us in Discussions :point_up: to claim an agency. | ||
This repo contains scrapers to gather police bodycam video footage and other files from police department websites, as part of the Community Law Enforcement Accountability Network (CLEAN). | ||
|
||
> :warning: This is a new scraping effort (as of March 2024). We're planning to provide Developer guidelines and sample code in the near future. In the meantime, please ping if you have questions about how to get started. | ||
The CLEAN project is a collaborative effort between the ACLU, [Big Local News][], at Stanford, the [UC Berkeley Investigative Reporting Program][], and a variety of civil liberty organizations and news partners around the country. We're gathering police body camera footage, disciplinary records and other important information to help shine a light on the use of harmful and lethal force by law enforcement. | ||
|
||
Our accountability reporting has produced a number of [news stories](docs/stories.md) already, but there's still much work to be done. | ||
|
||
## Scrapers | ||
We welcome open-source contributions to this project. Or if you'd just like to nab our code for your own purposes, you're welcome to do so. | ||
|
||
- CA | ||
- san_diego_pd | ||
Please check out the resources below to get started: | ||
|
||
- [Usage docs](docs/usage.md) - For folks who just need the code | ||
- [Contributor docs](docs/contributing.md) - Want to write a scraper? Check this out. | ||
- [Maintainer docs](docs/maintainers.md) - For those intrepid souls helping to manage the project | ||
|
||
|
||
[Big Local News]: https://biglocalnews.org/content/about/ | ||
[UC Berkeley Investigative Reporting Program]:https://journalism.berkeley.edu/programs/mj/investigative-reporting/ |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,18 +1,25 @@ | ||
# Contributing | ||
|
||
Our project welcomes new contributors who want to help us fix bugs and improve our scrapers. The current status of our effort is documented in our [issue tracker](https://github.com/biglocalnews/clean-scrapers/issues). | ||
Our project welcomes new contributors who want to help us add scrapers, fix bugs, or improve our existing codebase. The current status of our effort is documented in our [issue tracker](https://github.com/biglocalnews/clean-scraper/issues). | ||
|
||
You can also chat with us over on [GitHub Discussions](https://github.com/biglocalnews/clean-scraper/discussions). | ||
|
||
We want your help. We need your help. Here's how to get started. | ||
|
||
|
||
Adding features and fixing bugs is managed using GitHub's [pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/about-pull-requests) system. | ||
|
||
The tutorial that follows assumes you have the [Python](https://www.python.org/) programming language, the [pipenv](https://pipenv.pypa.io/) package manager and the [git](https://git-scm.com/) version control system already installed. If you don't, you'll want to address that first. | ||
|
||
Below are details on the typical workflow. | ||
|
||
## Claim an agency | ||
|
||
If you'd like to write a new scraper, check out the [growing list of law enforcement agencies](https://docs.google.com/spreadsheets/d/e/2PACX-1vTBcJKRsufBPYLsX92ZhaHrjV7Qv1THMO4EBhOCmEos4ayv6yB6d9-VXlaKNr5FGaViP20qXbUvJXgL/pubhtml?gid=0&single=true) we need to scrape and ping us in Discussions :point_up: to claim an agency. Just make sure to pick one that hasn't yet been claimed. | ||
|
||
## Create a fork | ||
|
||
Start by visiting our project's repository at [github.com/biglocalnews/clean-scrapers](https://github.com/biglocalnews/clean-scrapers) and creating a fork. You can learn how from [GitHub's documentation](https://docs.github.com/en/get-started/quickstart/fork-a-repo). | ||
Start by visiting our project's repository at [github.com/biglocalnews/clean-scraper](https://github.com/biglocalnews/clean-scraper) and creating a fork. You can learn how from [GitHub's documentation](https://docs.github.com/en/get-started/quickstart/fork-a-repo). | ||
|
||
## Clone the fork | ||
|
||
|
@@ -21,15 +28,15 @@ Now you need to make a copy of your fork on your computer using GitHub's cloning | |
A typical terminal command will look something like the following, with your username inserted in the URL. | ||
|
||
``` bash | ||
git clone [email protected]:yourusername/clean-scrapers.git | ||
git clone [email protected]:yourusername/clean-scraper.git | ||
``` | ||
|
||
## Install dependencies | ||
|
||
You should [change directory](https://manpages.ubuntu.com/manpages/trusty/man1/cd.1posix.html) into folder where you code was downloaded. | ||
|
||
``` bash | ||
cd clean-scrapers | ||
cd clean-scraper | ||
``` | ||
|
||
The `pipenv` package manager can install all of the Python tools necessary to run and test our code. | ||
|
@@ -58,9 +65,9 @@ You can do this by running a line of code like the one below. You should substit | |
git switch -c your-branch-name | ||
``` | ||
|
||
We ask that you follow a pattern where the **branch name includes the postal code of the state you're working on, combined with the issue number generated by GitHub**. | ||
We ask that you name your branch using following convention: **the postal code + the GitHub issue number**. | ||
|
||
For example, let's say you were working on a scraper for the San Diego Police Department and the related GitHub issue is `#100`. | ||
For example, let's say you were working on a scraper for the San Diego Police Department in California and the related GitHub issue is `#100`. | ||
|
||
You create a branch named `ca-100` and switch over to it (i.e. "check it out locally, in git lingo) using the below command. | ||
|
||
|
@@ -74,7 +81,7 @@ Now you can begin your work. You can start editing the code on your computer, ma | |
|
||
## Creating a new scraper | ||
|
||
When adding a new state, you should create a new Python file in the following directory structure and format: `clean/{state_postal}/{agency_slug}`. Try to keep the agency slug, or abbreviation short but meaningful. If in doubt, hit us up and we can hash out a name. Naming things is hard, after all. | ||
When adding a new state, you should create a new Python file in the following directory structure and format: `clean/{state_postal}/{agency_slug}`. Try to keep the agency slug, or abbreviation, short but meaningful. If in doubt, hit us up on an issue or in the [GitHub Discussions forum](https://github.com/biglocalnews/clean-scraper/discussions) and we can hash out a name. After all, [naming things is hard](https://martinfowler.com/bliki/TwoHardThings.html). | ||
|
||
Here is the folder structure for the San Diego Police Department in California: | ||
|
||
|
@@ -84,78 +91,59 @@ clean | |
└── san_diego_pd.py | ||
``` | ||
|
||
You can use the code for San Diego as a reference example to jumpstart | ||
your own state. | ||
|
||
When coding a new scraper, the new important conventions to follow are: | ||
|
||
- Always create a top-level `scrape` function with the interface seen | ||
below | ||
- Always ensure that the `scrape` function downloads and stores files to | ||
a standardized, configurable location (configured as parameters to the | ||
`scrape` function) | ||
|
||
``` python | ||
from pathlib import Path | ||
|
||
from .. import utils | ||
from ..cache import Cache | ||
|
||
You can use the code for San Diego as a reference example to jumpstart your own state. | ||
|
||
def scrape( | ||
data_dir: Path = utils.CLEAN_DATA_DIR, | ||
cache_dir: Path = utils.CLEAN_CACHE_DIR, | ||
) -> Path: | ||
""" | ||
Scrape data from Iowa. | ||
When coding a new scraper, there are a few important conventions to follow: | ||
|
||
Keyword arguments: | ||
data_dir -- the Path were the result will be saved (default WARN_DATA_DIR) | ||
cache_dir -- the Path where results can be cached (default WARN_CACHE_DIR) | ||
- Add the agency's scraper module to a state-based folder (e.g. `clean/ca/san_deigo_pd.py`) | ||
- If it's a new state folder, add an empty `__init__.py` to the folder | ||
- Create a `Site` class inside the agency's scraper module with the following attributes/methods: | ||
- `name` - Official name of the agency | ||
- `scrape_meta` - generates a CSV with metadata about videos and other available files (file name, URL, and size at minimum) | ||
- `scrape` - uses the CSV generated by `scrape_meta` to download videos and other files | ||
|
||
Returns: the Path where the file is written | ||
""" | ||
# Grab the page | ||
page = utils.get_url("https://xx.gov/yy.html") | ||
html = page.text | ||
Below is a pared down version of San Diego's [Site](https://github.com/biglocalnews/clean-scraper/blob/main/clean/ca/san_diego_pd.py) class to illustrate these conventions. | ||
|
||
# Write the raw file to the cache | ||
cache = Cache(cache_dir) | ||
cache.write("xx/yy.html", html) | ||
> The San Diego PD scraper code is in `clean/ca/san_diego_pd.py` | ||
# Parse the source file and convert to a list of rows, with a header in the first row. | ||
## It's up to you to fill in the blank here based on the structure of the source file. | ||
## You could do that here with BeautifulSoup or whatever other technique. | ||
pass | ||
```python | ||
class Site: | ||
|
||
# Set the path to the final CSV | ||
# We should always use the lower-case state postal code, like nj.csv | ||
output_csv = data_dir / "xx.csv" | ||
name = "San Diego Police Department" | ||
|
||
# Write out the rows to the export directory | ||
utils.write_rows_to_csv(output_csv, cleaned_data) | ||
def __init__(self, data_dir=utils.CLEAN_DATA_DIR, cache_dir=utils.CLEAN_CACHE_DIR): | ||
self.data_dir = data_dir | ||
self.cache_dir = cache_dir | ||
# etc. | ||
# etc. | ||
|
||
# Return the path to the final CSV | ||
return output_csv | ||
def scrape_meta(self, throttle=0): | ||
# 1. Scrape metadata about available files, making sure to download and save file | ||
# artifacts such as HTML pages along the way (we recommend using Cache.download) | ||
# 2. Generate a metadata CSV and store in the cache | ||
# 3. Return the path to the metadata CSV | ||
pass | ||
|
||
|
||
if __name__ == "__main__": | ||
scrape() | ||
def scrape(self, metadata_csv): | ||
# 1. Use the metadata CSV generated by `scrape_meta` to download available files | ||
# to the cache directory (once again, check out Cache.download). | ||
# artifacts such as HTML "index" pages along the way in the cache | ||
# 2. Generate a metadata CSV and store in the cache | ||
# 3. Return the path to the metadata CSV | ||
pass | ||
``` | ||
|
||
When creating a scraper, there are a few rules of thumb. | ||
|
||
1. The raw data being scraped --- whether it be HTML, video files, PDFs --- | ||
should be saved to the cache unedited. We aim to store pristine | ||
versions of our source data. | ||
2. The data extracted from source files should be exported as a single | ||
file. Any intermediate files generated during data processing should | ||
1. The metadata about source files should be stored in a single | ||
CSV file. Any intermediate files generated during file/data processing should | ||
not be written to the data folder. Such files should be written to | ||
the cache directory. | ||
3. The final export should be the state's postal code and agency slug, in lower case. | ||
For example, San Diego's final file should be saved `ca_san_diego_pd.csv`. | ||
4. For simple cases, use a cache name identical to the final export name. | ||
5. If many files need to be cached, create a subdirectory using the lower-case state postal code and agency slug (`ca_san_diego_pd`) and apply a sensible naming scheme to the cached files (e.g. `ca_san_diego_pd/page_1.html`) | ||
1. Files should be cached in a site-specific cache folder using the agency slug name: `ca_san_diego_pd`. | ||
If many files need to be cached, apply a sensible naming scheme to the cached files (e.g. `ca_san_diego_pd/index_page_1.html`) | ||
|
||
## Running the CLI | ||
|
||
|
@@ -164,36 +152,48 @@ After a scraper has been created, the command-line tool provides a method to tes | |
``` bash | ||
pipenv run python -m clean.cli --help | ||
|
||
Usage: python -m clean.cli [OPTIONS] [SCRAPERS]... | ||
|
||
Command-line interface for downloading CLEAN police files. | ||
Usage: python -m clean.cli [OPTIONS] COMMAND [ARGS]... | ||
|
||
SCRAPERS -- a list of one or more scrapers to run. Pass `all` to | ||
scrape all supported states and agencies. | ||
Command-line interface for downloading CLEAN files. | ||
|
||
Options: | ||
--data-dir PATH The Path were the results will be saved | ||
--cache-dir PATH The Path where results can be cached | ||
--delete / --no-delete Delete generated files from the cache | ||
-l, --log-level [DEBUG|INFO|WARNING|ERROR|CRITICAL] | ||
Set the logging level | ||
--help Show this message and exit. | ||
--help Show this message and exit. | ||
|
||
Commands: | ||
list List all available agencies and their slugs. | ||
scrape Command-line interface for downloading CLEAN files. | ||
scrape-meta Command-line interface for generating metadata CSV about... | ||
``` | ||
|
||
Running a state is as simple as passing arguments to that same command. | ||
Running a state is as simple as passing arguments to the appropriate subcommand. | ||
|
||
If you were trying to develop the San Deigo PD scraper found in `clean/ca/san_diego_pd.py`, you could run something like this. | ||
|
||
``` bash | ||
pipenv run python -m clean.cli ca_san_diego_pd | ||
# List availabe agencies (and their slugs, which are required for scraping commands) | ||
pipenv run python -m clean.cli list | ||
|
||
# Trigger the metadata scraper using agency slug | ||
pipenv run python -m clean.cli scrape-meta ca_san_diego_pd | ||
|
||
# Trigger file downloads using agency slug | ||
pipenv run python -m clean.cli scrape ca_san_diego_pd | ||
``` | ||
|
||
For more verbose logging, you can ask the system to showing debugging information. | ||
For more verbose logging, you can ask the system to show debugging information. | ||
|
||
``` bash | ||
pipenv run python -m clean.cli -l DEBUG ca_san_diego_pd | ||
``` | ||
|
||
To be a good citizen of the Web and avoid IP blocking, you can throttle (i.e. slow down the scrapers with a time delay): | ||
|
||
``` bash | ||
# Pause 2 seconds between web requests | ||
pipenv run python -m clean.cli -t 2 ca_san_diego_pd | ||
``` | ||
|
||
You could continue to iterate with code edits and CLI runs until you've completed your goal. | ||
|
||
## Run tests | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,7 @@ | ||
# Maintainers | ||
|
||
Some helpful bits to help manage this project. | ||
|
||
- TK: Local testing | ||
- TK: CI on GitHub Actions | ||
- TK: release to PyPI |
Oops, something went wrong.