generated from City-Bureau/city-scrapers-template
-
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
0 parents
commit ae4fb6e
Showing
25 changed files
with
1,744 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
#!/bin/bash | ||
pipenv run scrapy list | xargs -I {} pipenv run scrapy crawl {} -s LOG_ENABLED=False & | ||
|
||
# Output to the screen every 9 minutes to prevent a travis timeout | ||
# https://stackoverflow.com/a/40800348 | ||
export PID=$! | ||
while [[ `ps -p $PID | tail -n +2` ]]; do | ||
echo 'Deploying' | ||
sleep 540 | ||
done |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
[flake8] | ||
ignore = E203,E741,W503,W504 | ||
exclude = | ||
.git, | ||
.venv, | ||
venv, | ||
*/__pycache__/*, | ||
tests/files/*, | ||
max-line-length = 88 |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1 @@ | ||
tests/files/* linguist-generated |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,28 @@ | ||
## What's this PR do? | ||
<!-- eg. This PR updates the scraper for Cleveland City Council because of changes to how they display their meeting schedule. --> | ||
|
||
## Why are we doing this? | ||
<!-- eg. The website's layout was recently updated, causing our existing scraper to fail. This change ensures our scraper remains functional and continues to provide timely updates on council meetings. --> | ||
|
||
## Steps to manually test | ||
<!-- Text here is not always necessary but it is generally recommended in order to aid a reviewer. | ||
eg. | ||
1. Ensure the project is installed: | ||
``` | ||
pipenv sync --dev | ||
``` | ||
2. Activate the virtual env and enter the pipenv shell: | ||
``` | ||
pipenv shell | ||
``` | ||
3. Run the spider: | ||
``` | ||
scrapy crawl <spider-name> -O test_output.csv | ||
``` | ||
4. Monitor the output and ensure no errors are raised. | ||
5. Inspect `test_output.csv` to ensure the data looks valid. | ||
--> | ||
|
||
## Are there any smells or added technical debt to note? | ||
<!-- eg. The new scraping logic includes a more complex parsing routine, which might be less efficient. Future optimization or a more robust parsing strategy may be needed if the website's layout continues to evolve. --> |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
name: Archive | ||
|
||
on: | ||
schedule: | ||
- cron: "7 11 * * *" | ||
workflow_dispatch: | ||
|
||
env: | ||
CI: true | ||
PYTHON_VERSION: 3.9 | ||
PIPENV_VENV_IN_PROJECT: true | ||
SCRAPY_SETTINGS_MODULE: city_scrapers.settings.archive | ||
AUTOTHROTTLE_MAX_DELAY: 30.0 | ||
AUTOTHROTTLE_START_DELAY: 1.5 | ||
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0 | ||
|
||
jobs: | ||
crawl: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- name: Set up Python ${{ env.PYTHON_VERSION }} | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: ${{ env.PYTHON_VERSION }} | ||
|
||
- name: Install Pipenv | ||
run: pip install --user pipenv | ||
|
||
- name: Cache Python dependencies | ||
uses: actions/cache@v2 | ||
with: | ||
path: .venv | ||
key: ${{ env.PYTHON_VERSION }}-${{ hashFiles('**/Pipfile.lock') }} | ||
restore-keys: | | ||
${{ env.PYTHON_VERSION }}- | ||
pip- | ||
- name: Install dependencies | ||
run: pipenv sync | ||
env: | ||
PIPENV_DEFAULT_PYTHON_VERSION: ${{ env.PYTHON_VERSION }} | ||
|
||
- name: Run scrapers | ||
run: | | ||
export PYTHONPATH=$(pwd):$PYTHONPATH | ||
./.deploy.sh |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,69 @@ | ||
name: CI | ||
|
||
on: [push, pull_request] | ||
|
||
env: | ||
CI: true | ||
PIPENV_VENV_IN_PROJECT: true | ||
AUTOTHROTTLE_MAX_DELAY: 30.0 | ||
AUTOTHROTTLE_START_DELAY: 1.5 | ||
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0 | ||
|
||
jobs: | ||
check: | ||
runs-on: ubuntu-latest | ||
strategy: | ||
max-parallel: 4 | ||
matrix: | ||
python-version: [3.9] | ||
|
||
steps: | ||
- uses: actions/checkout@v1 | ||
|
||
- name: Set up Python ${{ matrix.python-version }} | ||
uses: actions/setup-python@v1 | ||
with: | ||
python-version: ${{ matrix.python-version }} | ||
|
||
- name: Install Pipenv | ||
run: pip install --user pipenv | ||
|
||
- name: Cache Python dependencies | ||
uses: actions/cache@v1 | ||
with: | ||
path: .venv | ||
key: pip-${{ matrix.python-version }}-${{ hashFiles('**/Pipfile.lock') }} | ||
restore-keys: | | ||
pip-${{ matrix.python-version }}- | ||
pip- | ||
- name: Install dependencies | ||
run: pipenv sync --dev | ||
env: | ||
PIPENV_DEFAULT_PYTHON_VERSION: ${{ matrix.python-version }} | ||
|
||
- name: Check imports with isort | ||
run: | | ||
pipenv run isort . --check-only | ||
- name: Check style with black | ||
run: | | ||
pipenv run black . --check | ||
- name: Lint with flake8 | ||
run: | | ||
pipenv run flake8 . | ||
- name: Test with pytest | ||
# Ignores exit code 5 (no tests collected) | ||
run: | | ||
pipenv run pytest || [ $? -eq 5 ] | ||
- name: Validate output with scrapy | ||
if: github.event_name == 'pull_request' | ||
run: | | ||
git checkout ${{ github.base_ref }} | ||
git checkout $(git show-ref | grep pull | awk '{ print $2 }') | ||
git diff-index --name-only --diff-filter=d $(git merge-base HEAD ${{ github.base_ref }}) | \ | ||
grep -Pio '(?<=/spiders/).*(?=\.py)' | \ | ||
xargs pipenv run scrapy validate |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
name: Cron | ||
|
||
on: | ||
schedule: | ||
# Set any time that you'd like scrapers to run (in UTC) | ||
- cron: "1 6 * * *" | ||
workflow_dispatch: | ||
|
||
env: | ||
CI: true | ||
PYTHON_VERSION: 3.9 | ||
PIPENV_VENV_IN_PROJECT: true | ||
SCRAPY_SETTINGS_MODULE: city_scrapers.settings.prod | ||
WAYBACK_ENABLED: true | ||
AUTOTHROTTLE_MAX_DELAY: 30.0 | ||
AUTOTHROTTLE_START_DELAY: 1.5 | ||
AUTOTHROTTLE_TARGET_CONCURRENCY: 3.0 | ||
# Add secrets for the platform you're using and uncomment here | ||
# AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }} | ||
# AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }} | ||
# S3_BUCKET: ${{ secrets.S3_BUCKET }} | ||
# AZURE_ACCOUNT_KEY: ${{ secrets.AZURE_ACCOUNT_KEY }} | ||
# AZURE_ACCOUNT_NAME: ${{ secrets.AZURE_ACCOUNT_NAME }} | ||
# AZURE_CONTAINER: ${{ secrets.AZURE_CONTAINER }} | ||
# GOOGLE_APPLICATION_CREDENTIALS = os.getenv("GOOGLE_APPLICATION_CREDENTIALS") | ||
# GCS_BUCKET = os.getenv("GCS_BUCKET") | ||
# Setup Sentry, add the DSN to secrets and uncomment here | ||
# SENTRY_DSN: ${{ secrets.SENTRY_DSN }} | ||
|
||
jobs: | ||
crawl: | ||
runs-on: ubuntu-latest | ||
steps: | ||
- uses: actions/checkout@v2 | ||
|
||
- name: Set up Python ${{ env.PYTHON_VERSION }} | ||
uses: actions/setup-python@v2 | ||
with: | ||
python-version: ${{ env.PYTHON_VERSION }} | ||
|
||
- name: Install Pipenv | ||
run: pip install --user pipenv | ||
|
||
- name: Cache Python dependencies | ||
uses: actions/cache@v2 | ||
with: | ||
path: .venv | ||
key: ${{ env.PYTHON_VERSION }}-${{ hashFiles('**/Pipfile.lock') }} | ||
restore-keys: | | ||
${{ env.PYTHON_VERSION }}- | ||
pip- | ||
- name: Install dependencies | ||
run: pipenv sync | ||
env: | ||
PIPENV_DEFAULT_PYTHON_VERSION: ${{ env.PYTHON_VERSION }} | ||
|
||
- name: Run scrapers | ||
run: | | ||
export PYTHONPATH=$(pwd):$PYTHONPATH | ||
./.deploy.sh | ||
- name: Combine output feeds | ||
run: | | ||
export PYTHONPATH=$(pwd):$PYTHONPATH | ||
pipenv run scrapy combinefeeds -s LOG_ENABLED=False | ||
- name: Prevent workflow deactivation | ||
uses: gautamkrishnar/keepalive-workflow@v1 | ||
with: | ||
committer_username: "citybureau-bot" | ||
committer_email: "[email protected]" |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,136 @@ | ||
# Byte-compiled / optimized / DLL files | ||
__pycache__/ | ||
*.py[cod] | ||
*$py.class | ||
|
||
# C extensions | ||
*.so | ||
|
||
# Distribution / packaging | ||
.Python | ||
env/ | ||
build/ | ||
develop-eggs/ | ||
dist/ | ||
docs/_site/ | ||
downloads/ | ||
eggs/ | ||
.eggs/ | ||
lib/ | ||
lib64/ | ||
parts/ | ||
sdist/ | ||
var/ | ||
wheels/ | ||
*.egg-info/ | ||
.installed.cfg | ||
*.egg | ||
|
||
# PyInstaller | ||
# Usually these files are written by a python script from a template | ||
# before PyInstaller builds the exe, so as to inject date/other infos into it. | ||
*.manifest | ||
*.spec | ||
|
||
# Installer logs | ||
pip-log.txt | ||
pip-delete-this-directory.txt | ||
|
||
# Unit test / coverage reports | ||
htmlcov/ | ||
.tox/ | ||
.coverage | ||
.coverage.* | ||
.cache | ||
nosetests.xml | ||
coverage.xml | ||
.pytest_cache | ||
*.cover | ||
.hypothesis/ | ||
|
||
# Translations | ||
*.mo | ||
*.pot | ||
|
||
# Django stuff: | ||
*.log | ||
local_settings.py | ||
|
||
# Flask stuff: | ||
instance/ | ||
.webassets-cache | ||
|
||
# Scrapy stuff: | ||
.scrapy | ||
tutorial/ | ||
|
||
# Sphinx documentation | ||
docs/_build/ | ||
|
||
# PyBuilder | ||
target/ | ||
|
||
# Jupyter Notebook | ||
.ipynb_checkpoints | ||
|
||
# pyenv | ||
.python-version | ||
|
||
# celery beat schedule file | ||
celerybeat-schedule | ||
|
||
# SageMath parsed files | ||
*.sage.py | ||
|
||
# dotenv | ||
.env | ||
|
||
# IDEs and editors | ||
/.idea | ||
.vscode | ||
.project | ||
.classpath | ||
.c9/ | ||
*.launch | ||
.settings/ | ||
*.sublime-workspace | ||
|
||
# virtualenv | ||
.venv | ||
venv/ | ||
ENV/ | ||
documenters-aggregator/ | ||
city-scrapers/ | ||
|
||
# Spyder project settings | ||
.spyderproject | ||
.spyproject | ||
|
||
# Rope project settings | ||
.ropeproject | ||
|
||
# mkdocs documentation | ||
/site | ||
|
||
# mypy | ||
.mypy_cache/ | ||
|
||
# cyrus | ||
json_conversions.md | ||
.citybureau/ | ||
city_scrapers/caps.py | ||
|
||
# legistar cache | ||
_cache/ | ||
|
||
# src dir from git commit packages in requirements.txt | ||
src/ | ||
|
||
# OS files | ||
.DS_Store | ||
|
||
# validation logs | ||
logs/*.log | ||
travis/*.json | ||
|
||
# output files: local gitignore added to city_scrapers/local_outputs/ |
Oops, something went wrong.